Natural Language Processing¶

Tokenizers¶

`TweetTokenizer`	Tokenizer for tweets.
`NLTKWordTokenizer`	The NLTK tokenizer that has improved upon the TreebankWordTokenizer.
`LegalitySyllableTokenizer`	Syllabifies words based on the Legality Principle and Onset Maximization.
`MWETokenizer`	A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
`PunktSentenceTokenizer`	A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
`BlanklineTokenizer`	Tokenize a string, treating any sequence of blank lines as a delimiter.
`RegexpTokenizer`	A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
`WhitespaceTokenizer`	Tokenize a string on whitespace (space, tab, newline).
`WordPunctTokenizer`	Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp `\w+\|[^\w\s]+`.
`ReppTokenizer`	A class for word tokenization using the REPP parser described in Rebecca Dridan and Stephan Oepen (2012) Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit.
`SExprTokenizer`	A tokenizer that divides strings into s-expressions.
`LineTokenizer`	Tokenize a string into its lines, optionally discarding blank lines.
`SpaceTokenizer`	Tokenize a string using the space character as a delimiter, which is the same as `s.split(' ')`.
`TabTokenizer`	Tokenize a string use the tab character as a delimiter, the same as `s.split('\t')`.
`SyllableTokenizer`	Syllabifies words based on the Sonority Sequencing Principle (SSP).
`StanfordSegmenter`	Interface to the Stanford Segmenter
`TextTilingTokenizer`	Tokenize a document into topical sections using the TextTiling algorithm.
`ToktokTokenizer`	This is a Python port of the tok-tok.pl from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl
`TreebankWordDetokenizer`	The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer's regexes.
`TreebankWordTokenizer`	The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
`sent_tokenize`(text[, language])	Return a sentence-tokenized copy of text, using NLTK's recommended sentence tokenizer (currently `PunktSentenceTokenizer` for the specified language).
`word_tokenize`(text[, language, preserve_line])	Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved `TreebankWordTokenizer` along with `PunktSentenceTokenizer` for the specified language).
`casual_tokenize`(text[, preserve_case, ...])	Convenience function for wrapping the tokenizer.
`blankline_tokenize`(text)	Return a tokenized copy of s.
`line_tokenize`(text[, blanklines])
`regexp_tokenize`(text, pattern[, gaps, ...])	Return a tokenized copy of text.
`wordpunct_tokenize`(text)	Return a tokenized copy of s.
`sexpr_tokenize`(text)	Return a list of s-expressions extracted from text.

Stemmers¶

`ARLSTem`	ARLSTem stemmer : a light Arabic Stemming algorithm without any dictionary.
`ARLSTem2`	Return a stemmed Arabic word after removing affixes.
`Cistem`	CISTEM Stemmer for German
`ISRIStemmer`	ISRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary.
`LancasterStemmer`	Lancaster Stemmer
`PorterStemmer`	A word stemmer based on the Porter stemming algorithm.
`RegexpStemmer`	A stemmer that uses regular expressions to identify morphological affixes.
`RSLPStemmer`	A stemmer for Portuguese.
`SnowballStemmer`	Snowball Stemmer
`WordNetLemmatizer`	WordNet Lemmatizer

Taggers¶

`DefaultTagger`	A tagger that assigns the same tag to every token.
`NgramTagger`	A tagger that chooses a token's tag based on its word string and on the preceding n word's tags.
`UnigramTagger`	Unigram Tagger
`BigramTagger`	A tagger that chooses a token's tag based its word string and on the preceding words' tag.
`TrigramTagger`	A tagger that chooses a token's tag based its word string and on the preceding two words' tags.
`AffixTagger`	A tagger that chooses a token's tag based on a leading or trailing substring of its word string.
`RegexpTagger`	Regular Expression Tagger
`ClassifierBasedTagger`	A sequential tagger that uses a classifier to choose the tag for each token in a sentence. The featureset input for the classifier is generated by a feature detector function::.
`ClassifierBasedPOSTagger`	A classifier based part of speech tagger.
`BrillTagger`	Brill's transformational rule-based tagger.
`BrillTaggerTrainer`	A trainer for tbl taggers.
`TnT`	TnT - Statistical POS tagger
`HunposTagger`	A class for pos tagging with HunPos. The input is the paths to:
`HiddenMarkovModelTagger`	Hidden Markov model class, a generative model for labelling sequence data.
`HiddenMarkovModelTrainer`	Algorithms for learning HMM parameters from training data.
`SennaTagger`
`SennaChunkTagger`
`SennaNERTagger`
`CRFTagger`	A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite
`PerceptronTagger`	Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal.
`pos_tag`(tokens[, tagset, lang])	Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.
`pos_tag_sents`(sentences[, tagset, lang])	Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.

Word sense disambiguiation¶

lesk(context_sentence, ambiguous_word[, ...])

Return a synset for an ambiguous word in a context.

Sentiment analysis¶

`SentimentAnalyzer`	A Sentiment Analysis tool based on machine learning approaches.
`SentimentIntensityAnalyzer`	Give a sentiment intensity score to sentences.