nltk.tokenize.TweetTokenizer¶
- class nltk.tokenize.TweetTokenizer[source]¶
Bases:
TokenizerI
Tokenizer for tweets.
>>> from nltk.tokenize import TweetTokenizer >>> tknzr = TweetTokenizer() >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--" >>> tknzr.tokenize(s0) ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3' , 'and', 'some', 'arrows', '<', '>', '->', '<--']
Examples using strip_handles and reduce_len parameters:
>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!' >>> tknzr.tokenize(s1) [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
- __init__(preserve_case=True, reduce_len=False, strip_handles=False, match_phone_numbers=True)[source]¶
Create a TweetTokenizer instance with settings for use in the tokenize method.
- Parameters
preserve_case (bool) – Flag indicating whether to preserve the casing (capitalisation) of text used in the tokenize method. Defaults to True.
reduce_len (bool) – Flag indicating whether to replace repeated character sequences of length 3 or greater with sequences of length 3. Defaults to False.
strip_handles (bool) – Flag indicating whether to remove Twitter handles of text used in the tokenize method. Defaults to False.
match_phone_numbers (bool) – Flag indicating whether the tokenize method should look for phone numbers. Defaults to True.
- tokenize(text: str) List[str] [source]¶
Tokenize the input text.
- Parameters
text (str) – str
- Return type
list(str)
- Returns
a tokenized list of strings; joining this list returns the original string if preserve_case=False.
- property WORD_RE: Pattern¶
Core TweetTokenizer regex
- property PHONE_WORD_RE: Pattern¶
Secondary core TweetTokenizer regex
- span_tokenize(s: str) Iterator[Tuple[int, int]] [source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –