nltk.tokenize.LegalitySyllableTokenizer¶
- class nltk.tokenize.LegalitySyllableTokenizer[source]¶
Bases:
TokenizerI
Syllabifies words based on the Legality Principle and Onset Maximization.
>>> from nltk.tokenize import LegalitySyllableTokenizer >>> from nltk import word_tokenize >>> from nltk.corpus import words >>> text = "This is a wonderful sentence." >>> text_words = word_tokenize(text) >>> LP = LegalitySyllableTokenizer(words.words()) >>> [LP.tokenize(word) for word in text_words] [['This'], ['is'], ['a'], ['won', 'der', 'ful'], ['sen', 'ten', 'ce'], ['.']]
- __init__(tokenized_source_text, vowels='aeiouy', legal_frequency_threshold=0.001)[source]¶
- Parameters
tokenized_source_text (list(str)) – List of valid tokens in the language
vowels (str) – Valid vowels in language or IPA representation
legal_frequency_threshold (float) – Lowest frequency of all onsets to be considered a legal onset
- find_legal_onsets(words)[source]¶
Gathers all onsets and then return only those above the frequency threshold
- Parameters
words (list(str)) – List of words in a language
- Returns
Set of legal onsets
- Return type
set(str)
- onset(word)[source]¶
Returns consonant cluster of word, i.e. all characters until the first vowel.
- Parameters
word (str) – Single word or token
- Returns
String of characters of onset
- Return type
str
- tokenize(token)[source]¶
Apply the Legality Principle in combination with Onset Maximization to return a list of syllables.
- Parameters
token (str) – Single word or token
- Return syllable_list
Single word or token broken up into syllables.
- Return type
list(str)
- span_tokenize(s: str) Iterator[Tuple[int, int]] [source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –