nltk.tokenize.SExprTokenizer¶
- class nltk.tokenize.SExprTokenizer[source]¶
Bases:
TokenizerIA tokenizer that divides strings into s-expressions. An s-expresion can be either:
a parenthesized expression, including any nested parenthesized expressions, or
a sequence of non-whitespace non-parenthesis characters.
For example, the string
(a (b c)) d e (f)consists of four s-expressions:(a (b c)),d,e, and(f).By default, the characters
(and)are treated as open and close parentheses, but alternative strings may be specified.- Parameters
parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
strict – If true, then raise an exception when tokenizing an ill-formed sexpr.
- tokenize(text)[source]¶
Return a list of s-expressions extracted from text. For example:
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)']
All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)
If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the
strictparameter to the constructor. IfstrictisTrue, then raise aValueError. IfstrictisFalse, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g']
- Parameters
text (str or iter(str)) – the string to be tokenized
- Return type
iter(str)
- span_tokenize(s: str) Iterator[Tuple[int, int]][source]¶
Identify the tokens using integer offsets
(start_i, end_i), wheres[start_i:end_i]is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –