nltk.corpus.reader.WordNetCorpusReader

class nltk.corpus.reader.WordNetCorpusReader[source]

Bases: CorpusReader

A corpus reader used to access wordnet or its variants.

ADJ = 'a'
ADJ_SAT = 's'
ADV = 'r'
NOUN = 'n'
VERB = 'v'
__init__(root, omw_reader)[source]

Construct a new wordnet corpus reader, with the given root directory.

corpus2sk(corpus=None)[source]

Read sense key to synset id mapping from index.sense file in corpus directory

map_wn30()[source]

Mapping from Wordnet 3.0 to currently loaded Wordnet version

of2ss(of)[source]

take an id and return the synsets

ss2of(ss, lang=None)[source]

return the ID of the synset

add_provs(reader)[source]

Add languages from Multilingual Wordnet to the provenance dictionary

add_exomw()[source]

Add languages from Extended OMW

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> wn.add_exomw()
>>> print(wn.synset('intrinsically.r.01').lemmas(lang="eng_wikt"))
[Lemma('intrinsically.r.01.per_se'), Lemma('intrinsically.r.01.as_such')]
langs()[source]

return a list of languages supported by Multilingual Wordnet

get_version()[source]
lemma(name, lang='eng')[source]

Return lemma object that matches the name

lemma_from_key(key)[source]
synset(name)[source]
synset_from_pos_and_offset(pos, offset)[source]
  • pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB (‘a’, ‘s’, ‘r’, ‘n’, or ‘v’).

  • offset: The byte offset of this synset in the WordNet dict file for this pos.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_pos_and_offset('n', 1740))
Synset('entity.n.01')
synset_from_sense_key(sense_key)[source]

Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()

From https://wordnet.princeton.edu/documentation/senseidx5wn: A sense_key is represented as:

lemma % lex_sense (e.g. 'dog%1:18:01::')

where lex_sense is encoded as:

ss_type:lex_filenum:lex_id:head_word:head_id
Lemma

ASCII text of word/collocation, in lower case

Ss_type

synset type for the sense (1 digit int) The synset type is encoded as follows:

1    NOUN
2    VERB
3    ADJECTIVE
4    ADVERB
5    ADJECTIVE SATELLITE
Lex_filenum

name of lexicographer file containing the synset for the sense (2 digit int)

Lex_id

when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int)

Head_word

lemma of the first word in satellite’s head synset Only used if sense is in an adjective satellite synset

Head_id

uniquely identifies sense in a lexicographer file when paired with head_word Only used if head_word is present (2 digit int)

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_sense_key("drive%1:04:03::"))
Synset('drive.n.06')
>>> print(wn.synset_from_sense_key("driving%1:04:03::"))
Synset('drive.n.06')
synsets(lemma, pos=None, lang='eng', check_exceptions=True)[source]

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

lemmas(lemma, pos=None, lang='eng')[source]

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

all_lemma_names(pos=None, lang='eng')[source]

Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.

all_omw_synsets(pos=None, lang=None)[source]
all_synsets(pos=None, lang='eng')[source]

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

all_eng_synsets(pos=None)[source]
words(lang='eng')[source]

return lemmas of the given language as list of words

synonyms(word, lang='eng')[source]

return nested list with the synonyms of the different senses of word in the given language

doc(file='README', lang='eng')[source]

Return the contents of readme, license or citation file use lang=lang to get the file for an individual language

license(lang='eng')[source]

Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language

readme(lang='eng')[source]

Return the contents of README (for omw) use lang=lang to get the readme for an individual language

citation(lang='eng')[source]

Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language

lemma_count(lemma)[source]

Return the frequency count for this Lemma

path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

res_similarity(synset1, synset2, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

jcn_similarity(synset1, synset2, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns

A float score denoting the similarity of the two Synset objects.

lin_similarity(synset1, synset2, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

morphy(form, pos=None, check_exceptions=True)[source]

Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)
MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}
ic(corpus, weight_senses_equally=False, smoothing=1.0)[source]

Creates an information content lookup dictionary from a corpus.

Parameters
  • corpus (CorpusReader) – The corpus from which we create an information content dictionary.

  • weight_senses_equally (bool) – If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.)

  • smoothing (float) – How much do we smooth synset counts (default is 1.0)

Returns

An information content dictionary

custom_lemmas(tab_file, lang)[source]

Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.

See the “Tab files” section at http://compling.hss.ntu.edu.sg/omw/ for documentation on the Multilingual WordNet tab file format.

Parameters

tab_file – Tab file as a file or file-like object

Type

lang str

Param

lang ISO 639-3 code of the language of the tab file

disable_custom_lemmas(lang)[source]

prevent synsets from being mistakenly added

digraph(inputs, rel=<function WordNetCorpusReader.<lambda>>, pos=None, maxdepth=-1, shapes=None, attr=None, verbose=False)[source]

Produce a graphical representation from ‘inputs’ (a list of start nodes, which can be a mix of Synsets, Lemmas and/or words), and a synset relation, for drawing with the ‘dot’ graph visualisation program from the Graphviz package.

Return a string in the DOT graph file language, which can then be converted to an image by nltk.parse.dependencygraph.dot2img(dot_string).

Optional Parameters: :rel: Wordnet synset relation :pos: for words, restricts Part of Speech to ‘n’, ‘v’, ‘a’ or ‘r’ :maxdepth: limit the longest path :shapes: dictionary of strings that trigger a specified shape :attr: dictionary with global graph attributes :verbose: warn about cycles

>>> from nltk.corpus import wordnet as wn
>>> print(wn.digraph([wn.synset('dog.n.01')]))
digraph G {
"Synset('animal.n.01')" -> "Synset('organism.n.01')";
"Synset('canine.n.02')" -> "Synset('carnivore.n.01')";
"Synset('carnivore.n.01')" -> "Synset('placental.n.01')";
"Synset('chordate.n.01')" -> "Synset('animal.n.01')";
"Synset('dog.n.01')" -> "Synset('canine.n.02')";
"Synset('dog.n.01')" -> "Synset('domestic_animal.n.01')";
"Synset('domestic_animal.n.01')" -> "Synset('animal.n.01')";
"Synset('living_thing.n.01')" -> "Synset('whole.n.02')";
"Synset('mammal.n.01')" -> "Synset('vertebrate.n.01')";
"Synset('object.n.01')" -> "Synset('physical_entity.n.01')";
"Synset('organism.n.01')" -> "Synset('living_thing.n.01')";
"Synset('physical_entity.n.01')" -> "Synset('entity.n.01')";
"Synset('placental.n.01')" -> "Synset('mammal.n.01')";
"Synset('vertebrate.n.01')" -> "Synset('chordate.n.01')";
"Synset('whole.n.02')" -> "Synset('object.n.01')";
}
abspath(fileid)[source]

Return the absolute path for the given file.

Parameters

fileid (str) – The file identifier for the file whose path should be returned.

Return type

PathPointer

abspaths(fileids=None, include_encoding=False, include_fileid=False)[source]

Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.

Parameters
  • fileids (None or str or list) – Specifies the set of fileids for which paths should be returned. Can be None, for all fileids; a list of file identifiers, for a specified set of fileids; or a single file identifier, for a single file. Note that the return value is always a list of paths, even if fileids is a single file identifier.

  • include_encoding – If true, then return a list of (path_pointer, encoding) tuples.

Return type

list(PathPointer)

encoding(file)[source]

Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.

ensure_loaded()[source]

Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).

fileids()[source]

Return a list of file identifiers for the fileids that make up this corpus.

open(file)[source]

Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.

Parameters

file – The file identifier of the file to read.

raw(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a single string.

Return type

str

property root

The directory where this corpus is stored.

Type

PathPointer