nltk.text.TokenSearcher¶
- class nltk.text.TokenSearcher[source]¶
Bases:
object
A class that makes it easier to use regular expressions to search over tokenized strings. The tokenized string is converted to a string where tokens are marked with angle brackets – e.g.,
'<the><window><is><still><open>'
. The regular expression passed to thefindall()
method is modified to treat angle brackets as non-capturing parentheses, in addition to matching the token boundaries; and to have'.'
not match the angle brackets.- findall(regexp)[source]¶
Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> from nltk.text import TokenSearcher >>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
- Parameters
regexp (str) – A regular expression