nltk.translate.StackDecoder

class nltk.translate.StackDecoder[source]

Bases: object

Phrase-based stack decoder for machine translation

>>> from nltk.translate import PhraseTable
>>> phrase_table = PhraseTable()
>>> phrase_table.add(('niemand',), ('nobody',), log(0.8))
>>> phrase_table.add(('niemand',), ('no', 'one'), log(0.2))
>>> phrase_table.add(('erwartet',), ('expects',), log(0.8))
>>> phrase_table.add(('erwartet',), ('expecting',), log(0.2))
>>> phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1))
>>> phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8))
>>> phrase_table.add(('!',), ('!',), log(0.8))
>>> #  nltk.model should be used here once it is implemented
>>> from collections import defaultdict
>>> language_prob = defaultdict(lambda: -999.0)
>>> language_prob[('nobody',)] = log(0.5)
>>> language_prob[('expects',)] = log(0.4)
>>> language_prob[('the', 'spanish', 'inquisition')] = log(0.2)
>>> language_prob[('!',)] = log(0.1)
>>> language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})()
>>> stack_decoder = StackDecoder(phrase_table, language_model)
>>> stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!'])
['nobody', 'expects', 'the', 'spanish', 'inquisition', '!']
__init__(phrase_table, language_model)[source]
Parameters
  • phrase_table (PhraseTable) – Table of translations for source language phrases and the log probabilities for those translations.

  • language_model (object) – Target language model. Must define a probability_change method that calculates the change in log probability of a sentence, if a given string is appended to it. This interface is experimental and will likely be replaced with nltk.model once it is implemented.

word_penalty
float: Influences the translation length exponentially.

If positive, shorter translations are preferred. If negative, longer translations are preferred. If zero, no penalty is applied.

beam_threshold
float: Hypotheses that score below this factor of the best

hypothesis in a stack are dropped from consideration. Value between 0.0 and 1.0.

stack_size
int: Maximum number of hypotheses to consider in a stack.

Higher values increase the likelihood of a good translation, but increases processing time.

property distortion_factor
float: Amount of reordering of source phrases.

Lower values favour monotone translation, suitable when word order is similar for both source and target languages. Value between 0.0 and 1.0. Default 0.5.

translate(src_sentence)[source]
Parameters

src_sentence (list(str)) – Sentence to be translated

Returns

Translated sentence

Return type

list(str)

find_all_src_phrases(src_sentence)[source]

Finds all subsequences in src_sentence that have a phrase translation in the translation table

Returns

Subsequences that have a phrase translation, represented as a table of lists of end positions. For example, if result[2] is [5, 6, 9], then there are three phrases starting from position 2 in src_sentence, ending at positions 5, 6, and 9 exclusive. The list of ending positions are in ascending order.

Return type

list(list(int))

compute_future_scores(src_sentence)[source]

Determines the approximate scores for translating every subsequence in src_sentence

Future scores can be used a look-ahead to determine the difficulty of translating the remaining parts of a src_sentence.

Returns

Scores of subsequences referenced by their start and end positions. For example, result[2][5] is the score of the subsequence covering positions 2, 3, and 4.

Return type

dict(int: (dict(int): float))

future_score(hypothesis, future_score_table, sentence_length)[source]

Determines the approximate score for translating the untranslated words in hypothesis

expansion_score(hypothesis, translation_option, src_phrase_span)[source]

Calculate the score of expanding hypothesis with translation_option

Parameters
  • hypothesis (_Hypothesis) – Hypothesis being expanded

  • translation_option (PhraseTableEntry) – Information about the proposed expansion

  • src_phrase_span (tuple(int, int)) – Word position span of the source phrase

distortion_score(hypothesis, next_src_phrase_span)[source]
static valid_phrases(all_phrases_from, hypothesis)[source]

Extract phrases from all_phrases_from that contains words that have not been translated by hypothesis

Parameters

all_phrases_from (list(list(int))) – Phrases represented by their spans, in the same format as the return value of find_all_src_phrases

Returns

A list of phrases, represented by their spans, that cover untranslated positions.

Return type

list(tuple(int, int))