nltk.tag.CRFTagger¶
- class nltk.tag.CRFTagger[source]¶
Bases:
TaggerI
A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite
>>> from nltk.tag import CRFTagger >>> ct = CRFTagger()
>>> train_data = [[('University','Noun'), ('is','Verb'), ('a','Det'), ('good','Adj'), ('place','Noun')], ... [('dog','Noun'),('eat','Verb'),('meat','Noun')]]
>>> ct.train(train_data,'model.crf.tagger') >>> ct.tag_sents([['dog','is','good'], ['Cat','eat','meat']]) [[('dog', 'Noun'), ('is', 'Verb'), ('good', 'Adj')], [('Cat', 'Noun'), ('eat', 'Verb'), ('meat', 'Noun')]]
>>> gold_sentences = [[('dog','Noun'),('is','Verb'),('good','Adj')] , [('Cat','Noun'),('eat','Verb'), ('meat','Noun')]] >>> ct.accuracy(gold_sentences) 1.0
Setting learned model file >>> ct = CRFTagger() >>> ct.set_model_file(‘model.crf.tagger’) >>> ct.accuracy(gold_sentences) 1.0
- __init__(feature_func=None, verbose=False, training_opt={})[source]¶
Initialize the CRFSuite tagger
- Parameters
feature_func – The function that extracts features for each token of a sentence. This function should take 2 parameters: tokens and index which extract features at index position from tokens list. See the build in _get_features function for more detail.
verbose (boolean) – output the debugging messages during training.
training_opt (dictionary) – python-crfsuite training options
- Set of possible training options (using LBFGS training algorithm).
- ‘feature.minfreq’
The minimum frequency of features.
- ‘feature.possible_states’
Force to generate possible state features.
- ‘feature.possible_transitions’
Force to generate possible transition features.
- ‘c1’
Coefficient for L1 regularization.
- ‘c2’
Coefficient for L2 regularization.
- ‘max_iterations’
The maximum number of iterations for L-BFGS optimization.
- ‘num_memories’
The number of limited memories for approximating the inverse hessian matrix.
- ‘epsilon’
Epsilon for testing the convergence of the objective.
- ‘period’
The duration of iterations to test the stopping criterion.
- ‘delta’
The threshold for the stopping criterion; an L-BFGS iteration stops when the improvement of the log likelihood over the last ${period} iterations is no greater than this threshold.
- ‘linesearch’
The line search algorithm used in L-BFGS updates:
‘MoreThuente’: More and Thuente’s method,
‘Backtracking’: Backtracking method with regular Wolfe condition,
‘StrongBacktracking’: Backtracking method with strong Wolfe condition
- ‘max_linesearch’
The maximum number of trials for the line search algorithm.
- tag_sents(sents)[source]¶
Tag a list of sentences. NB before using this function, user should specify the mode_file either by
Train a new model using
train
functionUse the pre-trained model which is set via
set_model_file
function
- Params sentences
list of sentences needed to tag.
- Returns
list of tagged sentences.
- Return type
list(list(tuple(str,str)))
- train(train_data, model_file)[source]¶
Train the CRF tagger using CRFSuite :params train_data : is the list of annotated sentences. :type train_data : list (list(tuple(str,str))) :params model_file : the model will be saved to this file.
- accuracy(gold)[source]¶
Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
- Return type
float
- confusion(gold)[source]¶
Return a ConfusionMatrix with the tags from
gold
as the reference values, with the predictions fromtag_sents
as the predicted values.>>> from nltk.tag import PerceptronTagger >>> from nltk.corpus import treebank >>> tagger = PerceptronTagger() >>> gold_data = treebank.tagged_sents()[:10] >>> print(tagger.confusion(gold_data)) | - | | N | | O P | | N J J N N P P R R V V V V V W | | ' E C C D E I J J J M N N N O R P R B R T V B B B B B D ` | | ' , - . C D T X N J R S D N P S S P $ B R P O B D G N P Z T ` | -------+----------------------------------------------------------------------------------------------+ '' | <1> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | , | .<15> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | -NONE- | . . <.> . . 2 . . . 2 . . . 5 1 . . . . 2 . . . . . . . . . . . | . | . . .<10> . . . . . . . . . . . . . . . . . . . . . . . . . . . | CC | . . . . <1> . . . . . . . . . . . . . . . . . . . . . . . . . . | CD | . . . . . <5> . . . . . . . . . . . . . . . . . . . . . . . . . | DT | . . . . . .<20> . . . . . . . . . . . . . . . . . . . . . . . . | EX | . . . . . . . <1> . . . . . . . . . . . . . . . . . . . . . . . | IN | . . . . . . . .<22> . . . . . . . . . . 3 . . . . . . . . . . . | JJ | . . . . . . . . .<16> . . . . 1 . . . . 1 . . . . . . . . . . . | JJR | . . . . . . . . . . <.> . . . . . . . . . . . . . . . . . . . . | JJS | . . . . . . . . . . . <1> . . . . . . . . . . . . . . . . . . . | MD | . . . . . . . . . . . . <1> . . . . . . . . . . . . . . . . . . | NN | . . . . . . . . . . . . .<28> 1 1 . . . . . . . . . . . . . . . | NNP | . . . . . . . . . . . . . .<25> . . . . . . . . . . . . . . . . | NNS | . . . . . . . . . . . . . . .<19> . . . . . . . . . . . . . . . | POS | . . . . . . . . . . . . . . . . <1> . . . . . . . . . . . . . . | PRP | . . . . . . . . . . . . . . . . . <4> . . . . . . . . . . . . . | PRP$ | . . . . . . . . . . . . . . . . . . <2> . . . . . . . . . . . . | RB | . . . . . . . . . . . . . . . . . . . <4> . . . . . . . . . . . | RBR | . . . . . . . . . . 1 . . . . . . . . . <1> . . . . . . . . . . | RP | . . . . . . . . . . . . . . . . . . . . . <1> . . . . . . . . . | TO | . . . . . . . . . . . . . . . . . . . . . . <5> . . . . . . . . | VB | . . . . . . . . . . . . . . . . . . . . . . . <3> . . . . . . . | VBD | . . . . . . . . . . . . . 1 . . . . . . . . . . <6> . . . . . . | VBG | . . . . . . . . . . . . . 1 . . . . . . . . . . . <4> . . . . . | VBN | . . . . . . . . . . . . . . . . . . . . . . . . 1 . <4> . . . . | VBP | . . . . . . . . . . . . . . . . . . . . . . . . . . . <3> . . . | VBZ | . . . . . . . . . . . . . . . . . . . . . . . . . . . . <7> . . | WDT | . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . <.> . | `` | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . <1>| -------+----------------------------------------------------------------------------------------------+ (row = reference; col = test)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to run the tagger with, also used as the reference values in the generated confusion matrix.
- Return type
- evaluate_per_tag(gold, alpha=0.5, truncate=None, sort_by_count=False)[source]¶
Tabulate the recall, precision and f-measure for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
.>>> from nltk.tag import PerceptronTagger >>> from nltk.corpus import treebank >>> tagger = PerceptronTagger() >>> gold_data = treebank.tagged_sents()[:10] >>> print(tagger.evaluate_per_tag(gold_data)) Tag | Prec. | Recall | F-measure -------+--------+--------+----------- '' | 1.0000 | 1.0000 | 1.0000 , | 1.0000 | 1.0000 | 1.0000 -NONE- | 0.0000 | 0.0000 | 0.0000 . | 1.0000 | 1.0000 | 1.0000 CC | 1.0000 | 1.0000 | 1.0000 CD | 0.7143 | 1.0000 | 0.8333 DT | 1.0000 | 1.0000 | 1.0000 EX | 1.0000 | 1.0000 | 1.0000 IN | 0.9167 | 0.8800 | 0.8980 JJ | 0.8889 | 0.8889 | 0.8889 JJR | 0.0000 | 0.0000 | 0.0000 JJS | 1.0000 | 1.0000 | 1.0000 MD | 1.0000 | 1.0000 | 1.0000 NN | 0.8000 | 0.9333 | 0.8615 NNP | 0.8929 | 1.0000 | 0.9434 NNS | 0.9500 | 1.0000 | 0.9744 POS | 1.0000 | 1.0000 | 1.0000 PRP | 1.0000 | 1.0000 | 1.0000 PRP$ | 1.0000 | 1.0000 | 1.0000 RB | 0.4000 | 1.0000 | 0.5714 RBR | 1.0000 | 0.5000 | 0.6667 RP | 1.0000 | 1.0000 | 1.0000 TO | 1.0000 | 1.0000 | 1.0000 VB | 1.0000 | 1.0000 | 1.0000 VBD | 0.8571 | 0.8571 | 0.8571 VBG | 1.0000 | 0.8000 | 0.8889 VBN | 1.0000 | 0.8000 | 0.8889 VBP | 1.0000 | 1.0000 | 1.0000 VBZ | 1.0000 | 1.0000 | 1.0000 WDT | 0.0000 | 0.0000 | 0.0000 `` | 1.0000 | 1.0000 | 1.0000
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
alpha (float) – Ratio of the cost of false negative compared to false positives, as used in the f-measure computation. Defaults to 0.5, where the costs are equal.
truncate (int, optional) – If specified, then only show the specified number of values. Any sorting (e.g., sort_by_count) will be performed before truncation. Defaults to None
sort_by_count (bool, optional) – Whether to sort the outputs on number of occurrences of that tag in the
gold
data, defaults to False
- Returns
A tabulated recall, precision and f-measure string
- Return type
str
- f_measure(gold, alpha=0.5)[source]¶
Compute the f-measure for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
. Then, return the dictionary with mappings from tag to f-measure. The f-measure is the harmonic mean of theprecision
andrecall
, weighted byalpha
. In particular, given the precision p and recall r defined by:p = true positive / (true positive + false negative)
r = true positive / (true positive + false positive)
The f-measure is:
1/(alpha/p + (1-alpha)/r)
With
alpha = 0.5
, this reduces to:2pr / (p + r)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
alpha (float) – Ratio of the cost of false negative compared to false positives. Defaults to 0.5, where the costs are equal.
- Returns
A mapping from tags to precision
- Return type
Dict[str, float]
- precision(gold)[source]¶
Compute the precision for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
. Then, return the dictionary with mappings from tag to precision. The precision is defined as:p = true positive / (true positive + false negative)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
- Returns
A mapping from tags to precision
- Return type
Dict[str, float]
- recall(gold) Dict[str, float] [source]¶
Compute the recall for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
. Then, return the dictionary with mappings from tag to recall. The recall is defined as:r = true positive / (true positive + false positive)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
- Returns
A mapping from tags to recall
- Return type
Dict[str, float]
- tag(tokens)[source]¶
Tag a sentence using Python CRFSuite Tagger. NB before using this function, user should specify the mode_file either by
Train a new model using
train
functionUse the pre-trained model which is set via
set_model_file
function
- Params tokens
list of tokens needed to tag.
- Returns
list of tagged tokens.
- Return type
list(tuple(str,str))