nltk.tokenize.MWETokenizer¶

class nltk.tokenize.MWETokenizer[source]¶

Bases: TokenizerI

A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.

__init__(mwes=None, separator='_')[source]¶

Initialize the multi-word tokenizer with a list of expressions and a separator

Parameters

mwes (list(list(str))) – A sequence of multi-word expressions to be merged, where each MWE is a sequence of strings.
separator (str) – String that should be inserted between words in a multi-word expression token. (Default is ‘_’)

add_mwe(mwe)[source]¶

Add a multi-word expression to the lexicon (stored as a word trie)

We use util.Trie to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE.

Parameters: mwe (tuple(str) or list(str)) – The multi-word expression we’re adding into the word trie
Example

>>> tokenizer = MWETokenizer()
>>> tokenizer.add_mwe(('a', 'b'))
>>> tokenizer.add_mwe(('a', 'b', 'c'))
>>> tokenizer.add_mwe(('a', 'x'))
>>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}}
>>> tokenizer._mwes == expected
True

tokenize(text)[source]¶

Parameters: text (list(str)) – A list containing tokenized text
Returns: A list of the tokenized text with multi-words merged together
Return type: list(str)
Example

>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+')
>>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
['An', "hors+d'oeuvre", 'tonight,', 'sir?']

span_tokenize(s: str) → Iterator[Tuple[int, int]][source]¶

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type: Iterator[Tuple[int, int]]
Parameters: s (str) –

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]][source]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield: List[Tuple[int, int]]
Parameters: strings (List[str]) –
Return type: Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) → List[List[str]][source]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type: List[List[str]]
Parameters: strings (List[str]) –

NLTK

Documentation

nltk.tokenize.MWETokenizer¶