NLTK samples

Installation

sudo apt-get install python-dev
sudo pip install -U numpy
sudo pip install -U pyyaml nltk

For windows there are msi installers, follow links from here: http://nltk.org/install.html

To check installation: run python then type import nltk

Download dictionaries:

python -m nltk.downloader all

Now you can check that all installed:

>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Stemming

>>> from nltk import stem

>>> stem.SnowballStemmer.languages
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
>>> russian_stemmer = stem.SnowballStemmer('russian')
>>> print russian_stemmer.stem(u'киева')
киев

>>> stemmer = stem.PorterStemmer()
>>> stemmer.stem('buying')
'buy'

Tokenize

>>> import nltk
>>> tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
>>> tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
>>> tokenized = tokenizer.tokenize("Hello World! Mac was here.")
>>> print tokenized
['Hello', 'World', '!', 'Mac', 'was', 'here', '.']
>>> tagged = tagger.tag(tokenized)
>>> print tagged
[('Hello', 'UH'), ('World', 'NN-TL'), ('!', '.'), ('Mac', 'NP'), ('was', 'BEDZ'), ('here', 'RB'), ('.', '.')]