nltk.tokenize.TextTilingTokenizer

class nltk.tokenize.TextTilingTokenizer[source]

Bases: TokenizerI

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

Parameters
  • w (int) – Pseudosentence size

  • k (int) – Size (in sentences) of the block used in the block comparison method

  • similarity_method (constant) – The method used for determining similarity scores: BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION.

  • stopwords (list(str)) – A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus)

  • smoothing_method (constant) – The method used for smoothing the score plot: DEFAULT_SMOOTHING (default)

  • smoothing_width (int) – The width of the window used by the smoothing method

  • smoothing_rounds (int) – The number of smoothing passes

  • cutoff_policy (constant) – The policy used to determine the number of boundaries: HC (default) or LC

>>> from nltk.corpus import brown
>>> tt = TextTilingTokenizer(demo_mode=True)
>>> text = brown.raw()[:4000]
>>> s, ss, d, b = tt.tokenize(text)
>>> b
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]
__init__(w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)[source]
tokenize(text)[source]

Return a tokenized copy of text, where each “token” represents a separate topic.

span_tokenize(s: str) Iterator[Tuple[int, int]][source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type

Iterator[Tuple[int, int]]

Parameters

s (str) –

span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]][source]

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield

List[Tuple[int, int]]

Parameters

strings (List[str]) –

Return type

Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) List[List[str]][source]

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type

List[List[str]]

Parameters

strings (List[str]) –