nltk.tag package¶
NLTK Taggers
This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.
A “tag” is a case-sensitive string that specifies some property of a token,
such as its part of speech. Tagged tokens are encoded as tuples
(tag, token)
. For example, the following tagged token combines
the word 'fly'
with a noun part of speech tag ('NN'
):
>>> tagged_tok = ('fly', 'NN')
An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
A Russian tagger is also available if you specify lang=”rus”. It uses the Russian National Corpus tagset:
>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]
This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:
>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
... print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None
Note that words that the tagger has not seen during training receive a tag
of None
.
We evaluate a tagger on data that was not seen during training:
>>> tagger.accuracy(brown.tagged_sents(categories='news')[500:600])
0.7...
For more information, please consult chapter 5 of the NLTK Book.
isort:skip_file
- nltk.tag.pos_tag(tokens, tagset=None, lang='eng')[source]¶
Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens.
>>> from nltk.tag import pos_tag >>> from nltk.tokenize import word_tokenize >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')] >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal') [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'), ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
NB. Use pos_tag_sents() for efficient tagging of more than one sentence.
- Parameters
tokens (list(str)) – Sequence of tokens to be tagged
tagset (str) – the tagset to be used, e.g. universal, wsj, brown
lang (str) – the ISO 639 code of the language, e.g. ‘eng’ for English, ‘rus’ for Russian
- Returns
The tagged tokens
- Return type
list(tuple(str, str))
- nltk.tag.pos_tag_sents(sentences, tagset=None, lang='eng')[source]¶
Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.
- Parameters
sentences (list(list(str))) – List of sentences to be tagged
tagset (str) – the tagset to be used, e.g. universal, wsj, brown
lang (str) – the ISO 639 code of the language, e.g. ‘eng’ for English, ‘rus’ for Russian
- Returns
The list of tagged sentences
- Return type
list(list(tuple(str, str)))