nltk.text.TextCollection¶
- class nltk.text.TextCollection[source]¶
Bases:
Text
A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows:
>>> import nltk.corpus >>> from nltk.text import TextCollection >>> print('hack'); from nltk.book import text1, text2, text3 hack... >>> gutenberg = TextCollection(nltk.corpus.gutenberg) >>> mytexts = TextCollection([text1, text2, text3])
Iterating over a TextCollection produces all the tokens of all the texts in order.
- __init__(source)[source]¶
Create a Text object.
- Parameters
tokens (sequence of str) – The source text.
- collocation_list(num=20, window_size=2)[source]¶
Return collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocation_list()[:2] [('United', 'States'), ('fellow', 'citizens')]
- Parameters
num (int) – The maximum number of collocations to return.
window_size (int) – The number of tokens spanned by a collocation (default=2)
- Return type
list(tuple(str, str))
- collocations(num=20, window_size=2)[source]¶
Print collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocations() United States; fellow citizens; four years; ...
- Parameters
num (int) – The maximum number of collocations to print.
window_size (int) – The number of tokens spanned by a collocation (default=2)
- common_contexts(words, num=20)[source]¶
Find contexts where the specified words appear; list most frequent common contexts first.
- Parameters
words (str) – The words used to seed the similarity search
num (int) – The number of words to generate (default=20)
- Seealso
ContextIndex.common_contexts()
- concordance(word, width=79, lines=25)[source]¶
Prints a concordance for
word
with the specified context window. Word matching is not case-sensitive.- Parameters
word (str or list) – The target word or phrase (a list of strings)
width (int) – The width of each line, in characters (default=80)
lines (int) – The number of lines to display (default=25)
- Seealso
ConcordanceIndex
- concordance_list(word, width=79, lines=25)[source]¶
Generate a concordance for
word
with the specified context window. Word matching is not case-sensitive.- Parameters
word (str or list) – The target word or phrase (a list of strings)
width (int) – The width of each line, in characters (default=80)
lines (int) – The number of lines to display (default=25)
- Seealso
ConcordanceIndex
- dispersion_plot(words)[source]¶
Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
- Parameters
words (list(str)) – The words to be plotted
- Seealso
nltk.draw.dispersion_plot()
- findall(regexp)[source]¶
Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
- Parameters
regexp (str) – A regular expression
- generate(length=100, text_seed=None, random_seed=42)[source]¶
Print random text, generated using a trigram language model. See also help(nltk.lm).
- Parameters
length (int) – The length of text to generate (default=100)
text_seed (list(str)) – Generation can be conditioned on preceding context.
random_seed (int) – A random seed or an instance of random.Random. If provided, makes the random sampling part of generation reproducible. (default=42)
- similar(word, num=20)[source]¶
Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
- Parameters
word (str) – The word used to seed the similarity search
num (int) – The number of words to generate (default=20)
- Seealso
ContextIndex.similar_words()