nltk.tokenize.LineTokenizer¶

class nltk.tokenize.LineTokenizer[source]¶

Bases: TokenizerI

Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').

>>> from nltk.tokenize import LineTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> LineTokenizer(blanklines='keep').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']
>>> # same as [l for l in s.split('\n') if l.strip()]:
>>> LineTokenizer(blanklines='discard').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', 'Thanks.']

Parameters

blanklines –

Indicates how blank lines should be handled. Valid values are:

discard: strip blank lines out of the token list before returning it.
A line is considered blank if it contains only whitespace characters.
keep: leave all blank lines in the token list.
discard-eof: if the string ends with a newline, then do not generate
a corresponding token '' after that newline.

__init__(blanklines='discard')[source]¶

tokenize(s)[source]¶

Return a tokenized copy of s.

Return type: List[str]

span_tokenize(s)[source]¶

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type: Iterator[Tuple[int, int]]

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]][source]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield: List[Tuple[int, int]]
Parameters: strings (List[str]) –
Return type: Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) → List[List[str]][source]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type: List[List[str]]
Parameters: strings (List[str]) –

NLTK

Documentation

nltk.tokenize.LineTokenizer¶