nltk.tokenize.LineTokenizer¶
- class nltk.tokenize.LineTokenizer[source]¶
Bases:
TokenizerI
Tokenize a string into its lines, optionally discarding blank lines. This is similar to
s.split('\n')
.>>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.']
- Parameters
blanklines –
Indicates how blank lines should be handled. Valid values are:
discard
: strip blank lines out of the token list before returning it.A line is considered blank if it contains only whitespace characters.
keep
: leave all blank lines in the token list.discard-eof
: if the string ends with a newline, then do not generatea corresponding token
''
after that newline.
- span_tokenize(s)[source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]