nltk.tokenize.MWETokenizer¶
- class nltk.tokenize.MWETokenizer[source]¶
Bases:
TokenizerI
A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
- __init__(mwes=None, separator='_')[source]¶
Initialize the multi-word tokenizer with a list of expressions and a separator
- Parameters
mwes (list(list(str))) – A sequence of multi-word expressions to be merged, where each MWE is a sequence of strings.
separator (str) – String that should be inserted between words in a multi-word expression token. (Default is ‘_’)
- add_mwe(mwe)[source]¶
Add a multi-word expression to the lexicon (stored as a word trie)
We use
util.Trie
to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE.- Parameters
mwe (tuple(str) or list(str)) – The multi-word expression we’re adding into the word trie
- Example
>>> tokenizer = MWETokenizer() >>> tokenizer.add_mwe(('a', 'b')) >>> tokenizer.add_mwe(('a', 'b', 'c')) >>> tokenizer.add_mwe(('a', 'x')) >>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}} >>> tokenizer._mwes == expected True
- tokenize(text)[source]¶
- Parameters
text (list(str)) – A list containing tokenized text
- Returns
A list of the tokenized text with multi-words merged together
- Return type
list(str)
- Example
>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+') >>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split()) ['An', "hors+d'oeuvre", 'tonight,', 'sir?']
- span_tokenize(s: str) Iterator[Tuple[int, int]] [source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –