Collocations

Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each ngram of words may then be scored according to some association measure, in order to determine the relative likelihood of each ngram being a collocation.

The BigramCollocationFinder and TrigramCollocationFinder classes provide these functionalities, dependent on being provided a function which scores a ngram given appropriate frequency counts. A number of standard association measures are provided in BigramAssocMeasures and TrigramAssocMeasures.

Collocation finders

BigramCollocationFinder

A tool for the finding and ranking of bigram collocations or other association measures.

TrigramCollocationFinder

A tool for the finding and ranking of trigram collocations or other association measures.

QuadgramCollocationFinder

A tool for the finding and ranking of quadgram collocations or other association measures.

N-gram metrics

BigramAssocMeasures

A collection of bigram association measures. Each association measure is provided as a function with three arguments::.

TrigramAssocMeasures

A collection of trigram association measures. Each association measure is provided as a function with four arguments::.

QuadgramAssocMeasures

A collection of quadgram association measures. Each association measure is provided as a function with five arguments::.

ContingencyMeasures

Wraps NgramAssocMeasures classes such that the arguments of association measures are contingency table values rather than marginals.