nltk.tokenize.LegalitySyllableTokenizer¶

class nltk.tokenize.LegalitySyllableTokenizer[source]¶

Bases: TokenizerI

Syllabifies words based on the Legality Principle and Onset Maximization.

>>> from nltk.tokenize import LegalitySyllableTokenizer
>>> from nltk import word_tokenize
>>> from nltk.corpus import words
>>> text = "This is a wonderful sentence."
>>> text_words = word_tokenize(text)
>>> LP = LegalitySyllableTokenizer(words.words())
>>> [LP.tokenize(word) for word in text_words]
[['This'], ['is'], ['a'], ['won', 'der', 'ful'], ['sen', 'ten', 'ce'], ['.']]

__init__(tokenized_source_text, vowels='aeiouy', legal_frequency_threshold=0.001)[source]¶

Parameters

tokenized_source_text (list(str)) – List of valid tokens in the language
vowels (str) – Valid vowels in language or IPA representation
legal_frequency_threshold (float) – Lowest frequency of all onsets to be considered a legal onset

find_legal_onsets(words)[source]¶

Gathers all onsets and then return only those above the frequency threshold

Parameters: words (list(str)) – List of words in a language
Returns: Set of legal onsets
Return type: set(str)

onset(word)[source]¶

Returns consonant cluster of word, i.e. all characters until the first vowel.

Parameters: word (str) – Single word or token
Returns: String of characters of onset
Return type: str

tokenize(token)[source]¶

Apply the Legality Principle in combination with Onset Maximization to return a list of syllables.

Parameters: token (str) – Single word or token
Return syllable_list: Single word or token broken up into syllables.
Return type: list(str)

span_tokenize(s: str) → Iterator[Tuple[int, int]][source]¶

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type: Iterator[Tuple[int, int]]
Parameters: s (str) –

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]][source]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield: List[Tuple[int, int]]
Parameters: strings (List[str]) –
Return type: Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) → List[List[str]][source]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type: List[List[str]]
Parameters: strings (List[str]) –

NLTK

Documentation

nltk.tokenize.LegalitySyllableTokenizer¶