nltk.text.Text¶
- class nltk.text.Text[source]¶
Bases:
object
A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the
Text
class, and use the appropriate analysis function or class directly instead.A
Text
is typically initialized from a given document or corpus. E.g.:>>> import nltk.corpus >>> from nltk.text import Text >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
- __init__(tokens, name=None)[source]¶
Create a Text object.
- Parameters
tokens (sequence of str) – The source text.
- concordance(word, width=79, lines=25)[source]¶
Prints a concordance for
word
with the specified context window. Word matching is not case-sensitive.- Parameters
word (str or list) – The target word or phrase (a list of strings)
width (int) – The width of each line, in characters (default=80)
lines (int) – The number of lines to display (default=25)
- Seealso
ConcordanceIndex
- concordance_list(word, width=79, lines=25)[source]¶
Generate a concordance for
word
with the specified context window. Word matching is not case-sensitive.- Parameters
word (str or list) – The target word or phrase (a list of strings)
width (int) – The width of each line, in characters (default=80)
lines (int) – The number of lines to display (default=25)
- Seealso
ConcordanceIndex
- collocation_list(num=20, window_size=2)[source]¶
Return collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocation_list()[:2] [('United', 'States'), ('fellow', 'citizens')]
- Parameters
num (int) – The maximum number of collocations to return.
window_size (int) – The number of tokens spanned by a collocation (default=2)
- Return type
list(tuple(str, str))
- collocations(num=20, window_size=2)[source]¶
Print collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocations() United States; fellow citizens; four years; ...
- Parameters
num (int) – The maximum number of collocations to print.
window_size (int) – The number of tokens spanned by a collocation (default=2)
- similar(word, num=20)[source]¶
Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
- Parameters
word (str) – The word used to seed the similarity search
num (int) – The number of words to generate (default=20)
- Seealso
ContextIndex.similar_words()
- common_contexts(words, num=20)[source]¶
Find contexts where the specified words appear; list most frequent common contexts first.
- Parameters
words (str) – The words used to seed the similarity search
num (int) – The number of words to generate (default=20)
- Seealso
ContextIndex.common_contexts()
- dispersion_plot(words)[source]¶
Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
- Parameters
words (list(str)) – The words to be plotted
- Seealso
nltk.draw.dispersion_plot()
- generate(length=100, text_seed=None, random_seed=42)[source]¶
Print random text, generated using a trigram language model. See also help(nltk.lm).
- Parameters
length (int) – The length of text to generate (default=100)
text_seed (list(str)) – Generation can be conditioned on preceding context.
random_seed (int) – A random seed or an instance of random.Random. If provided, makes the random sampling part of generation reproducible. (default=42)
- findall(regexp)[source]¶
Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
- Parameters
regexp (str) – A regular expression