Corpora

Data downloaders

The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.

Downloading Packages

If called with no arguments, download() will display an interactive interface which can be used to download and install new packages. If Tkinter is available, then a graphical interface will be shown, otherwise a simple text interface will be provided.

Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:

>>> download('treebank') 
[nltk_data] Downloading package 'treebank'...
[nltk_data]   Unzipping corpora/treebank.zip.

NLTK also provides a number of “package collections”, consisting of a group of related packages. To download all packages in a colleciton, simply call download() with the collection’s identifier:

>>> download('all-corpora') 
[nltk_data] Downloading package 'abc'...
[nltk_data]   Unzipping corpora/abc.zip.
[nltk_data] Downloading package 'alpino'...
[nltk_data]   Unzipping corpora/alpino.zip.
  ...
[nltk_data] Downloading package 'words'...
[nltk_data]   Unzipping corpora/words.zip.

Download Directory

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.

See Downloader.default_download_dir() for more a detailed description of how the default download directory is chosen.

NLTK Download Server

Before downloading any packages, the corpus and module downloader contacts the NLTK download server, to retrieve an index file describing the available packages. By default, this index file is loaded from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml. If necessary, it is possible to create a new Downloader object, specifying a different URL for the package index file.

Usage:

python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or:

python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

download([info_or_id, download_dir, quiet, ...])

download_shell()

download_gui()

Corpus readers

PlaintextCorpusReader

Reader for corpora that consist of plaintext documents.

TaggedCorpusReader

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form::.

BracketParseCorpusReader

Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the "combined" section of the Penn Treebank, e.g.

CategorizedPlaintextCorpusReader

A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.

CategorizedTaggedCorpusReader

A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.

CategorizedBracketParseCorpusReader

A reader for parsed corpora whose documents are divided into categories based on their file identifiers.

ConllCorpusReader

A corpus reader for CoNLL-style files.

ConllChunkCorpusReader

A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.

XMLCorpusReader

Corpus reader for corpora whose documents are xml files.

CMUDictCorpusReader

WordListCorpusReader

List of words, one per line.

PPAttachmentCorpusReader

sentence_id verb noun1 preposition noun2 attachment

SensevalCorpusReader

IEERCorpusReader

ChunkedCorpusReader

Reader for chunked (and optionally tagged) corpora.

SinicaTreebankCorpusReader

Reader for the sinica treebank.

IndianCorpusReader

List of words, one per line.

ToolboxCorpusReader

TimitCorpusReader

Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats).

YCOECorpusReader

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.

MacMorphoCorpusReader

A corpus reader for the MAC_MORPHO corpus.

AlpinoCorpusReader

Reader for the Alpino Dutch Treebank.

RTECorpusReader

Corpus reader for corpora in RTE challenges.

StringCategoryCorpusReader

EuroparlCorpusReader

Reader for Europarl corpora that consist of plaintext documents.

PortugueseCategorizedPlaintextCorpusReader

PropbankCorpusReader

Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance.

VerbnetCorpusReader

An NLTK interface to the VerbNet verb lexicon.

BNCCorpusReader

Corpus reader for the XML version of the British National Corpus.

NPSChatCorpusReader

SwadeshCorpusReader

WordNetCorpusReader

A corpus reader used to access wordnet or its variants.

WordNetICCorpusReader

A corpus reader for the WordNet information content corpus.

SwitchboardCorpusReader

DependencyCorpusReader

NombankCorpusReader

Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance.

IPIPANCorpusReader

Corpus reader designed to work with corpus created by IPI PAN.

Pl196xCorpusReader

TEICorpusView

KNBCorpusReader

This class implements:

ChasenCorpusReader

CHILDESCorpusReader

Corpus reader for the XML version of the CHILDES corpus.

AlignedCorpusReader

Reader for corpora of word-aligned sentences.

TimitTaggedCorpusReader

A corpus reader for tagged sentences that are included in the TIMIT corpus.

LinThesaurusCorpusReader

Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.

SemcorCorpusReader

Corpus reader for the SemCor Corpus.

FramenetCorpusReader

A corpus reader for the Framenet Corpus.

UdhrCorpusReader

BNCCorpusReader

Corpus reader for the XML version of the British National Corpus.

SentiWordNetCorpusReader

SentiSynset

TwitterCorpusReader

Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.

NKJPCorpusReader

CrubadanCorpusReader

A corpus reader used to access language An Crubadan n-gram files.

MTECorpusReader

Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East.

ReviewsCorpusReader

Reader for the Customer Review Data dataset by Hu, Liu (2004).

OpinionLexiconCorpusReader

Reader for Liu and Hu opinion lexicon.

ProsConsCorpusReader

Reader for the Pros and Cons sentence dataset.

CategorizedSentencesCorpusReader

A reader for corpora in which each row represents a single instance, mainly a sentence.

ComparativeSentencesCorpusReader

Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).

PanLexLiteCorpusReader

NonbreakingPrefixesCorpusReader

This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit.

UnicharsCorpusReader

This class is used to read lists of characters from the Perl Unicode Properties (see https://perldoc.perl.org/perluniprops.html).

MWAPPDBCorpusReader

This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):.

PanlexSwadeshCorpusReader

This is a class to read the PanLex Swadesh list from

Texts

ContextIndex

A bidirectional index between words and their 'contexts' in a text.

ConcordanceIndex

An index that can be used to look up the offset locations at which a given word occurs in a document.

TokenSearcher

A class that makes it easier to use regular expressions to search over tokenized strings.

Text

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console).

TextCollection

A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc.

Visualizations

collocations()

concordance()

nemo()

wordnet()

dispersion_plot(text, words[, ignore_case, ...])

Generate a lexical dispersion plot.