nltk.translate.IBMModel1

class nltk.translate.IBMModel1[source]

Bases: IBMModel

Lexical translation model that ignores word order

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> ibm1 = IBMModel1(bitext, 5)
>>> print(ibm1.translation_table['buch']['book'])
0.889...
>>> print(ibm1.translation_table['das']['book'])
0.061...
>>> print(ibm1.translation_table['buch'][None])
0.113...
>>> print(ibm1.translation_table['ja'][None])
0.072...
>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)])
__init__(sentence_aligned_corpus, iterations, probability_tables=None)[source]

Train on sentence_aligned_corpus and create a lexical translation model.

Translation direction is from AlignedSent.mots to AlignedSent.words.

Parameters
  • sentence_aligned_corpus (list(AlignedSent)) – Sentence-aligned parallel corpus

  • iterations (int) – Number of iterations to run training algorithm

  • probability_tables (dict[str]: object) – Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, the following entry must be present: translation_table. See IBMModel for the type and purpose of this table.

set_uniform_probabilities(sentence_aligned_corpus)[source]

Initialize probability tables to a uniform distribution

Derived classes should implement this accordingly.

train(parallel_corpus)[source]
prob_all_alignments(src_sentence, trg_sentence)[source]

Computes the probability of all possible word alignments, expressed as a marginal distribution over target words t

Each entry in the return value represents the contribution to the total alignment probability by the target word t.

To obtain probability(alignment | src_sentence, trg_sentence), simply sum the entries in the return value.

Returns

Probability of t for all s in src_sentence

Return type

dict(str): float

prob_alignment_point(s, t)[source]

Probability that word t in the target sentence is aligned to word s in the source sentence

prob_t_a_given_s(alignment_info)[source]

Probability of target sentence and an alignment given the source sentence

align_all(parallel_corpus)[source]
align(sentence_pair)[source]

Determines the best word alignment for one sentence pair from the corpus that the model was trained on.

The best alignment will be set in sentence_pair when the method returns. In contrast with the internal implementation of IBM models, the word indices in the Alignment are zero- indexed, not one-indexed.

Parameters

sentence_pair (AlignedSent) – A sentence in the source language and its counterpart sentence in the target language

MIN_PROB = 1e-12
best_model2_alignment(sentence_pair, j_pegged=None, i_pegged=0)[source]

Finds the best alignment according to IBM Model 2

Used as a starting point for hill climbing in Models 3 and above, because it is easier to compute than the best alignments in higher models

Parameters
  • sentence_pair (AlignedSent) – Source and target language sentence pair to be word-aligned

  • j_pegged (int) – If specified, the alignment point of j_pegged will be fixed to i_pegged

  • i_pegged (int) – Alignment point to j_pegged

hillclimb(alignment_info, j_pegged=None)[source]

Starting from the alignment in alignment_info, look at neighboring alignments iteratively for the best one

There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.

Parameters

j_pegged (int) – If specified, the search will be constrained to alignments where j_pegged remains unchanged

Returns

The best alignment found from hill climbing

Return type

AlignmentInfo

init_vocab(sentence_aligned_corpus)[source]
maximize_fertility_probabilities(counts)[source]
maximize_lexical_translation_probabilities(counts)[source]
maximize_null_generation_probabilities(counts)[source]
neighboring(alignment_info, j_pegged=None)[source]

Determine the neighbors of alignment_info, obtained by moving or swapping one alignment point

Parameters

j_pegged (int) – If specified, neighbors that have a different alignment point from j_pegged will not be considered

Returns

A set neighboring alignments represented by their AlignmentInfo

Return type

set(AlignmentInfo)

prob_of_alignments(alignments)[source]
reset_probabilities()[source]
sample(sentence_pair)[source]

Sample the most probable alignments from the entire alignment space

First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a higher IBM Model. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point.

Hill climbing may be stuck in a local maxima, hence the pegging and trying out of different alignments.

Parameters

sentence_pair (AlignedSent) – Source and target language sentence pair to generate a sample of alignments from

Returns

A set of best alignments represented by their AlignmentInfo and the best alignment of the set for convenience

Return type

set(AlignmentInfo), AlignmentInfo