nltk.lm.StupidBackoff

class nltk.lm.StupidBackoff[source]

Bases: LanguageModel

Provides StupidBackoff scores.

In addition to initialization arguments from BaseNgramModel also requires a parameter alpha with which we scale the lower order probabilities. Note that this is not a true probability distribution as scores for ngrams of the same order do not sum up to unity.

__init__(alpha=0.4, *args, **kwargs)[source]

Creates new LanguageModel.

Parameters
  • vocabulary (nltk.lm.NgramCounter or None) – If provided, this vocabulary will be used instead of creating a new one when training.

  • counter – If provided, use this object to count ngrams.

  • ngrams_fn (function or None) – If given, defines how sentences in training text are turned to ngram sequences.

  • pad_fn (function or None) – If given, defines how sentences in training text are padded.

unmasked_score(word, context=None)[source]

Score a word given some optional context.

Concrete models are expected to provide an implementation. Note that this method does not mask its arguments with the OOV label. Use the score method for that.

Parameters
  • word (str) – Word for which we want the score

  • context (tuple(str)) – Context the word is in. If None, compute unigram score.

  • context – tuple(str) or None

Return type

float

context_counts(context)[source]

Helper method for retrieving counts for a given context.

Assumes context has been checked and oov words in it masked. :type context: tuple(str) or None

entropy(text_ngrams)[source]

Calculate cross-entropy of model for given evaluation text.

Parameters

text_ngrams (Iterable(tuple(str))) – A sequence of ngram tuples.

Return type

float

fit(text, vocabulary_text=None)[source]

Trains the model on a text.

Parameters

text – Training text as a sequence of sentences.

generate(num_words=1, text_seed=None, random_seed=None)[source]

Generate words from the model.

Parameters
  • num_words (int) – How many words to generate. By default 1.

  • text_seed – Generation can be conditioned on preceding context.

  • random_seed – A random seed or an instance of random.Random. If provided, makes the random sampling part of generation reproducible.

Returns

One (str) word or a list of words generated from model.

Examples:

>>> from nltk.lm import MLE
>>> lm = MLE(2)
>>> lm.fit([[("a", "b"), ("b", "c")]], vocabulary_text=['a', 'b', 'c'])
>>> lm.fit([[("a",), ("b",), ("c",)]])
>>> lm.generate(random_seed=3)
'a'
>>> lm.generate(text_seed=['a'])
'b'
logscore(word, context=None)[source]

Evaluate the log score of this word in this context.

The arguments are the same as for score and unmasked_score.

perplexity(text_ngrams)[source]

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy for the text, so the arguments are the same.

score(word, context=None)[source]

Masks out of vocab (OOV) words and computes their model score.

For model-specific logic of calculating scores, see the unmasked_score method.