Natural Language Processing

Tokenizers

TweetTokenizer

Tokenizer for tweets.

NLTKWordTokenizer

The NLTK tokenizer that has improved upon the TreebankWordTokenizer.

LegalitySyllableTokenizer

Syllabifies words based on the Legality Principle and Onset Maximization.

MWETokenizer

A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.

PunktSentenceTokenizer

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.

BlanklineTokenizer

Tokenize a string, treating any sequence of blank lines as a delimiter.

RegexpTokenizer

A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

WhitespaceTokenizer

Tokenize a string on whitespace (space, tab, newline).

WordPunctTokenizer

Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

ReppTokenizer

A class for word tokenization using the REPP parser described in Rebecca Dridan and Stephan Oepen (2012) Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit.

SExprTokenizer

A tokenizer that divides strings into s-expressions.

LineTokenizer

Tokenize a string into its lines, optionally discarding blank lines.

SpaceTokenizer

Tokenize a string using the space character as a delimiter, which is the same as s.split(' ').

TabTokenizer

Tokenize a string use the tab character as a delimiter, the same as s.split('\t').

SyllableTokenizer

Syllabifies words based on the Sonority Sequencing Principle (SSP).

StanfordSegmenter

Interface to the Stanford Segmenter

TextTilingTokenizer

Tokenize a document into topical sections using the TextTiling algorithm.

ToktokTokenizer

This is a Python port of the tok-tok.pl from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl

TreebankWordDetokenizer

The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer's regexes.

TreebankWordTokenizer

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.

sent_tokenize(text[, language])

Return a sentence-tokenized copy of text, using NLTK's recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language).

word_tokenize(text[, language, preserve_line])

Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).

casual_tokenize(text[, preserve_case, ...])

Convenience function for wrapping the tokenizer.

blankline_tokenize(text)

Return a tokenized copy of s.

line_tokenize(text[, blanklines])

regexp_tokenize(text, pattern[, gaps, ...])

Return a tokenized copy of text.

wordpunct_tokenize(text)

Return a tokenized copy of s.

sexpr_tokenize(text)

Return a list of s-expressions extracted from text.

Stemmers

ARLSTem

ARLSTem stemmer : a light Arabic Stemming algorithm without any dictionary.

ARLSTem2

Return a stemmed Arabic word after removing affixes.

Cistem

CISTEM Stemmer for German

ISRIStemmer

ISRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary.

LancasterStemmer

Lancaster Stemmer

PorterStemmer

A word stemmer based on the Porter stemming algorithm.

RegexpStemmer

A stemmer that uses regular expressions to identify morphological affixes.

RSLPStemmer

A stemmer for Portuguese.

SnowballStemmer

Snowball Stemmer

WordNetLemmatizer

WordNet Lemmatizer

Taggers

DefaultTagger

A tagger that assigns the same tag to every token.

NgramTagger

A tagger that chooses a token's tag based on its word string and on the preceding n word's tags.

UnigramTagger

Unigram Tagger

BigramTagger

A tagger that chooses a token's tag based its word string and on the preceding words' tag.

TrigramTagger

A tagger that chooses a token's tag based its word string and on the preceding two words' tags.

AffixTagger

A tagger that chooses a token's tag based on a leading or trailing substring of its word string.

RegexpTagger

Regular Expression Tagger

ClassifierBasedTagger

A sequential tagger that uses a classifier to choose the tag for each token in a sentence. The featureset input for the classifier is generated by a feature detector function::.

ClassifierBasedPOSTagger

A classifier based part of speech tagger.

BrillTagger

Brill's transformational rule-based tagger.

BrillTaggerTrainer

A trainer for tbl taggers.

TnT

TnT - Statistical POS tagger

HunposTagger

A class for pos tagging with HunPos. The input is the paths to:

HiddenMarkovModelTagger

Hidden Markov model class, a generative model for labelling sequence data.

HiddenMarkovModelTrainer

Algorithms for learning HMM parameters from training data.

SennaTagger

SennaChunkTagger

SennaNERTagger

CRFTagger

A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite

PerceptronTagger

Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal.

pos_tag(tokens[, tagset, lang])

Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.

pos_tag_sents(sentences[, tagset, lang])

Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.

Word sense disambiguiation

lesk(context_sentence, ambiguous_word[, ...])

Return a synset for an ambiguous word in a context.

Sentiment analysis

SentimentAnalyzer

A Sentiment Analysis tool based on machine learning approaches.

SentimentIntensityAnalyzer

Give a sentiment intensity score to sentences.