nltk.tokenize.SyllableTokenizer¶

class nltk.tokenize.SyllableTokenizer[source]¶

Bases: TokenizerI

Syllabifies words based on the Sonority Sequencing Principle (SSP).

>>> from nltk.tokenize import SyllableTokenizer
>>> from nltk import word_tokenize
>>> SSP = SyllableTokenizer()
>>> SSP.tokenize('justification')
['jus', 'ti', 'fi', 'ca', 'tion']
>>> text = "This is a foobar-like sentence."
>>> [SSP.tokenize(token) for token in word_tokenize(text)]
[['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['.']]

__init__(lang='en', sonority_hierarchy=False)[source]¶

Parameters

lang (str) – Language parameter, default is English, ‘en’
sonority_hierarchy (list(str)) – Sonority hierarchy according to the Sonority Sequencing Principle.

assign_values(token)[source]¶

Assigns each phoneme its value from the sonority hierarchy. Note: Sentence/text has to be tokenized first.

Parameters: token (str) – Single word or token
Returns: List of tuples, first element is character/phoneme and second is the soronity value.
Return type: list(tuple(str, int))

validate_syllables(syllable_list)[source]¶

Ensures each syllable has at least one vowel. If the following syllable doesn’t have vowel, add it to the current one.

Parameters: syllable_list (list(str)) – Single word or token broken up into syllables.
Returns: Single word or token broken up into syllables (with added syllables if necessary)
Return type: list(str)

tokenize(token)[source]¶

Apply the SSP to return a list of syllables. Note: Sentence/text has to be tokenized first.

Parameters: token (str) – Single word or token
Return syllable_list: Single word or token broken up into syllables.
Return type: list(str)

span_tokenize(s: str) → Iterator[Tuple[int, int]][source]¶

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type: Iterator[Tuple[int, int]]
Parameters: s (str) –

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]][source]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield: List[Tuple[int, int]]
Parameters: strings (List[str]) –
Return type: Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) → List[List[str]][source]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type: List[List[str]]
Parameters: strings (List[str]) –

NLTK

Documentation

nltk.tokenize.SyllableTokenizer¶