nltk.tag.BrillTagger¶

class nltk.tag.BrillTagger[source]¶

Bases: TaggerI

Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as tag.DefaultTagger) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the TagRule interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using one of the TaggerTrainers available.

json_tag = 'nltk.tag.BrillTagger'¶

__init__(initial_tagger, rules, training_stats=None)[source]¶

Parameters

initial_tagger (TaggerI) – The initial tagger
rules (list(TagRule)) – An ordered list of transformation rules that should be used to correct the initial tagging.
training_stats (dict) – A dictionary of statistics collected during training, for possible later use

encode_json_obj()[source]¶

classmethod decode_json_obj(obj)[source]¶

rules()[source]¶

Return the ordered list of transformation rules that this tagger has learnt

Returns: the ordered list of transformation rules that correct the initial tagging
Return type: list of Rules

train_stats(statistic=None)[source]¶

Return a named statistic collected during training, or a dictionary of all available statistics if no name given

Parameters: statistic (str) – name of statistic
Returns: some statistic collected during training of this tagger
Return type: any (but usually a number)

tag(tokens)[source]¶

Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple (token, tag).

Return type: list(tuple(str, str))

print_template_statistics(test_stats=None, printunused=True)[source]¶

Print a list of all templates, ranked according to efficiency.

If test_stats is available, the templates are ranked according to their relative contribution (summed for all rules created from a given template, weighted by score) to the performance on the test set. If no test_stats, then statistics collected during training are used instead. There is also an unweighted measure (just counting the rules). This is less informative, though, as many low-score rules will appear towards end of training.

Parameters

test_stats (dict of str -> any (but usually numbers)) – dictionary of statistics collected during testing
printunused (bool) – if True, print a list of all unused templates

Returns

None

Return type

None

batch_tag_incremental(sequences, gold)[source]¶

Tags by applying each rule to the entire corpus (rather than all rules to a single sequence). The point is to collect statistics on the test set for individual rules.

NOTE: This is inefficient (does not build any index, so will traverse the entire corpus N times for N rules) – usually you would not care about statistics for individual rules and thus use batch_tag() instead

Parameters

sequences (list of list of strings) – lists of token sequences (sentences, in some applications) to be tagged
gold (list of list of strings) – the gold standard

Returns

tuple of (tagged_sequences, ordered list of rule scores (one for each rule))

accuracy(gold)[source]¶

Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.

Parameters: gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
Return type: float

confusion(gold)[source]¶

Return a ConfusionMatrix with the tags from gold as the reference values, with the predictions from tag_sents as the predicted values.

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import treebank
>>> tagger = PerceptronTagger()
>>> gold_data = treebank.tagged_sents()[:10]
>>> print(tagger.confusion(gold_data))
       |        -                                                                                     |
       |        N                                                                                     |
       |        O                                               P                                     |
       |        N                       J  J        N  N  P  P  R     R           V  V  V  V  V  W    |
       |  '     E     C  C  D  E  I  J  J  J  M  N  N  N  O  R  P  R  B  R  T  V  B  B  B  B  B  D  ` |
       |  '  ,  -  .  C  D  T  X  N  J  R  S  D  N  P  S  S  P  $  B  R  P  O  B  D  G  N  P  Z  T  ` |
-------+----------------------------------------------------------------------------------------------+
    '' | <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
     , |  .<15> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
-NONE- |  .  . <.> .  .  2  .  .  .  2  .  .  .  5  1  .  .  .  .  2  .  .  .  .  .  .  .  .  .  .  . |
     . |  .  .  .<10> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    CC |  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    CD |  .  .  .  .  . <5> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    DT |  .  .  .  .  .  .<20> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    EX |  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    IN |  .  .  .  .  .  .  .  .<22> .  .  .  .  .  .  .  .  .  .  3  .  .  .  .  .  .  .  .  .  .  . |
    JJ |  .  .  .  .  .  .  .  .  .<16> .  .  .  .  1  .  .  .  .  1  .  .  .  .  .  .  .  .  .  .  . |
   JJR |  .  .  .  .  .  .  .  .  .  . <.> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   JJS |  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    MD |  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    NN |  .  .  .  .  .  .  .  .  .  .  .  .  .<28> 1  1  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   NNP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .<25> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   NNS |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .<19> .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   POS |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   PRP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  .  .  .  .  .  .  .  .  . |
  PRP$ |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <2> .  .  .  .  .  .  .  .  .  .  .  . |
    RB |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  .  .  .  .  .  .  . |
   RBR |  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  . |
    RP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  . |
    TO |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <5> .  .  .  .  .  .  .  . |
    VB |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <3> .  .  .  .  .  .  . |
   VBD |  .  .  .  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  .  . <6> .  .  .  .  .  . |
   VBG |  .  .  .  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  . |
   VBN |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  1  . <4> .  .  .  . |
   VBP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <3> .  .  . |
   VBZ |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <7> .  . |
   WDT |  .  .  .  .  .  .  .  .  2  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <.> . |
    `` |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1>|
-------+----------------------------------------------------------------------------------------------+
(row = reference; col = test)

Parameters: gold (list(list(tuple(str, str)))) – The list of tagged sentences to run the tagger with, also used as the reference values in the generated confusion matrix.
Return type: ConfusionMatrix

evaluate(**kwargs)[source]¶: @deprecated: Use accuracy(gold) instead.

evaluate_per_tag(gold, alpha=0.5, truncate=None, sort_by_count=False)[source]¶

Tabulate the recall, precision and f-measure for each tag from gold or from running tag on the tokenized sentences from gold.

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import treebank
>>> tagger = PerceptronTagger()
>>> gold_data = treebank.tagged_sents()[:10]
>>> print(tagger.evaluate_per_tag(gold_data))
   Tag | Prec.  | Recall | F-measure
-------+--------+--------+-----------
    '' | 1.0000 | 1.0000 | 1.0000
     , | 1.0000 | 1.0000 | 1.0000
-NONE- | 0.0000 | 0.0000 | 0.0000
     . | 1.0000 | 1.0000 | 1.0000
    CC | 1.0000 | 1.0000 | 1.0000
    CD | 0.7143 | 1.0000 | 0.8333
    DT | 1.0000 | 1.0000 | 1.0000
    EX | 1.0000 | 1.0000 | 1.0000
    IN | 0.9167 | 0.8800 | 0.8980
    JJ | 0.8889 | 0.8889 | 0.8889
   JJR | 0.0000 | 0.0000 | 0.0000
   JJS | 1.0000 | 1.0000 | 1.0000
    MD | 1.0000 | 1.0000 | 1.0000
    NN | 0.8000 | 0.9333 | 0.8615
   NNP | 0.8929 | 1.0000 | 0.9434
   NNS | 0.9500 | 1.0000 | 0.9744
   POS | 1.0000 | 1.0000 | 1.0000
   PRP | 1.0000 | 1.0000 | 1.0000
  PRP$ | 1.0000 | 1.0000 | 1.0000
    RB | 0.4000 | 1.0000 | 0.5714
   RBR | 1.0000 | 0.5000 | 0.6667
    RP | 1.0000 | 1.0000 | 1.0000
    TO | 1.0000 | 1.0000 | 1.0000
    VB | 1.0000 | 1.0000 | 1.0000
   VBD | 0.8571 | 0.8571 | 0.8571
   VBG | 1.0000 | 0.8000 | 0.8889
   VBN | 1.0000 | 0.8000 | 0.8889
   VBP | 1.0000 | 1.0000 | 1.0000
   VBZ | 1.0000 | 1.0000 | 1.0000
   WDT | 0.0000 | 0.0000 | 0.0000
    `` | 1.0000 | 1.0000 | 1.0000

Parameters

gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
alpha (float) – Ratio of the cost of false negative compared to false positives, as used in the f-measure computation. Defaults to 0.5, where the costs are equal.
truncate (int, optional) – If specified, then only show the specified number of values. Any sorting (e.g., sort_by_count) will be performed before truncation. Defaults to None
sort_by_count (bool, optional) – Whether to sort the outputs on number of occurrences of that tag in the gold data, defaults to False

Returns

A tabulated recall, precision and f-measure string

Return type

str

f_measure(gold, alpha=0.5)[source]¶

Compute the f-measure for each tag from gold or from running tag on the tokenized sentences from gold. Then, return the dictionary with mappings from tag to f-measure. The f-measure is the harmonic mean of the precision and recall, weighted by alpha. In particular, given the precision p and recall r defined by:

p = true positive / (true positive + false negative)
r = true positive / (true positive + false positive)

The f-measure is:

1/(alpha/p + (1-alpha)/r)

With alpha = 0.5, this reduces to:

2pr / (p + r)

Parameters

gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
alpha (float) – Ratio of the cost of false negative compared to false positives. Defaults to 0.5, where the costs are equal.

Returns

A mapping from tags to precision

Return type

Dict[str, float]

precision(gold)[source]¶

Compute the precision for each tag from gold or from running tag on the tokenized sentences from gold. Then, return the dictionary with mappings from tag to precision. The precision is defined as:

p = true positive / (true positive + false negative)

Parameters: gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
Returns: A mapping from tags to precision
Return type: Dict[str, float]

recall(gold) → Dict[str, float][source]¶

Compute the recall for each tag from gold or from running tag on the tokenized sentences from gold. Then, return the dictionary with mappings from tag to recall. The recall is defined as:

r = true positive / (true positive + false positive)

Parameters: gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
Returns: A mapping from tags to recall
Return type: Dict[str, float]

tag_sents(sentences)[source]¶

Apply self.tag() to each element of sentences. I.e.:

return [self.tag(sent) for sent in sentences]

NLTK

Documentation

nltk.tag.BrillTagger¶