nltk.tokenize.StanfordSegmenter¶

class nltk.tokenize.StanfordSegmenter[source]¶

Bases: TokenizerI

Interface to the Stanford Segmenter

If stanford-segmenter version is older than 2016-10-31, then path_to_slf4j should be provieded, for example:

seg = StanfordSegmenter(path_to_slf4j='/YOUR_PATH/slf4j-api.jar')

>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> seg = StanfordSegmenter()
>>> seg.default_config('zh')
>>> sent = u'这是斯坦福中文分词器测试'
>>> print(seg.segment(sent))
这 是 斯坦福 中文 分词器 测试

>>> seg.default_config('ar')
>>> sent = u'هذا هو تصنيف ستانفورد العربي للكلمات'
>>> print(seg.segment(sent.split()))
هذا هو تصنيف ستانفورد العربي ل الكلمات

__init__(path_to_jar=None, path_to_slf4j=None, java_class=None, path_to_model=None, path_to_dict=None, path_to_sihan_corpora_dict=None, sihan_post_processing='false', keep_whitespaces='false', encoding='UTF-8', options=None, verbose=False, java_options='-mx2g')[source]¶

default_config(lang)[source]¶: Attempt to initialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables

tokenize(s)[source]¶

Return a tokenized copy of s.

Return type: List[str]

segment_file(input_file_path)[source]¶

segment(tokens)[source]¶

segment_sents(sentences)[source]¶

span_tokenize(s: str) → Iterator[Tuple[int, int]][source]¶

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type: Iterator[Tuple[int, int]]
Parameters: s (str) –

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]][source]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield: List[Tuple[int, int]]
Parameters: strings (List[str]) –
Return type: Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) → List[List[str]][source]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type: List[List[str]]
Parameters: strings (List[str]) –

NLTK

Documentation

nltk.tokenize.StanfordSegmenter¶