nltk.tokenize.SyllableTokenizer¶
- class nltk.tokenize.SyllableTokenizer[source]¶
Bases:
TokenizerI
Syllabifies words based on the Sonority Sequencing Principle (SSP).
>>> from nltk.tokenize import SyllableTokenizer >>> from nltk import word_tokenize >>> SSP = SyllableTokenizer() >>> SSP.tokenize('justification') ['jus', 'ti', 'fi', 'ca', 'tion'] >>> text = "This is a foobar-like sentence." >>> [SSP.tokenize(token) for token in word_tokenize(text)] [['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['.']]
- __init__(lang='en', sonority_hierarchy=False)[source]¶
- Parameters
lang (str) – Language parameter, default is English, ‘en’
sonority_hierarchy (list(str)) – Sonority hierarchy according to the Sonority Sequencing Principle.
- assign_values(token)[source]¶
Assigns each phoneme its value from the sonority hierarchy. Note: Sentence/text has to be tokenized first.
- Parameters
token (str) – Single word or token
- Returns
List of tuples, first element is character/phoneme and second is the soronity value.
- Return type
list(tuple(str, int))
- validate_syllables(syllable_list)[source]¶
Ensures each syllable has at least one vowel. If the following syllable doesn’t have vowel, add it to the current one.
- Parameters
syllable_list (list(str)) – Single word or token broken up into syllables.
- Returns
Single word or token broken up into syllables (with added syllables if necessary)
- Return type
list(str)
- tokenize(token)[source]¶
Apply the SSP to return a list of syllables. Note: Sentence/text has to be tokenized first.
- Parameters
token (str) – Single word or token
- Return syllable_list
Single word or token broken up into syllables.
- Return type
list(str)
- span_tokenize(s: str) Iterator[Tuple[int, int]] [source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –