wordnet 3.0 maximum depth of the taxonomy - taxonomy

How can I know the maximum depth of the taxonomy for wordnet 3.0? (is-a relationships for synsets)
I read some papers and found from a paper that it is 16 for wordnet 1.7.1.
I'm wondering the value for wordnet 3.0.

You can try the wordnet interface in python nltk.
Iterate through each synset in wordnet and find the distance to their top most hypernym:
>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> max(max(len(hyp_path) for hyp_path in ss.hypernym_paths()) for ss in wn.all_synsets())
20
To find the possible paths of a synset to its top most hypernym:
>>> wn.synset('dog.n.1')
Synset('dog.n.01')
>>> wn.synset('dog.n.1').hypernym_paths()
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]
To find the maximum of one synset:
>>> max(len(hyp_path) for hyp_path in wn.synset('dog.n.1').hypernym_paths())
14

Related

NLTK single-word part-of-speech tagging

Is there a way to use NLTK to get a set of possible parts of speech of a single string of letters, taking into account that different words might have homonyms?
For example: report -> {Noun, Verb} , kind -> {Adjective, Noun}
I have not been able to find a POS-tokenizer that tags part-of-speech for words outside of the context of a full sentence. This seems like a very basic request of NLTK, so I'm confused as to why I've had so much trouble finding it.
Yes. The simplest way is not to use a tagger, but simply load up one or more corpora and collect the set of all tags for the word you are interested in. If you're interested in more than one word, it's simplest to collect the tags for all words in the corpus, then look up anything you want. I'll add frequency counts, just because I can. For example, using the Brown corpus and the simple "universal" tagset:
>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t)
for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']
Because POS models are trained on sentence/document based data, so the expected input to the pre-trained model is a sentence/document. When there's only a single word, it treats it as a single word sentence, hence there should only be one tag in that single word sentence context.
If you're trying to find all possible POS tags per English words, you would need a corpus of many different use of the words and then tag the corpus and count/extract the no. of tags per word. E.g.
>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]
>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]
>>> for word, pos in chain(*tagged_sents):
... counts[word][pos] += 1
...
>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})
>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})
Alternatively, there's WordNet:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})
But note that WordNet is a manually crafted resource, so you cannot expect every English word to be in it.

How to compare output of wordnet.synsets?

I want compare output from the wordnet.synset function in the NLTK library. In an example when I run:
from nltk.corpus import wordnet as wn
wn.synsets('dog')
I get output:
output:
[Synset('dog.n.01'),
Synset('frump.n.01'),
Synset('dog.n.03'),
Synset('cad.n.01'),
Synset('frank.n.02'),
Synset('pawl.n.01'),
Synset('andiron.n.01'),
Synset('chase.v.01')]
Now if I try:
from nltk.corpus import wordnet as wn
wn.synsets('dogg')
I get output
output:
[]
How can I compare the outputs in the console to build logic around it? Logic could be "if word not in output then mispelled"
Thank you in advance.

Dill/Pickling Error: odd numbers of set items

I have a very strange error that cannot be reproduced anywhere except my production environment. What does this error mean? I get it when I try run the following piece of code:
serialized_object = dills.dumps(object)
dill.loads(serialized_object)
pickle.UnpicklingError: odd number of items for SET ITEMS
I'd never seen this before, so I looked at the source code. See here: https://github.com/python/cpython/blob/f24143b25e4f83368ff6182bebe14f885073015c/Modules/_pickle.c#L5914 it seems that the implication is that you have a corrupted or hostile pickle.
Based on the OP's comments, I think I see the workaround. I'll have to determine the impact of the workaround, and it will have to be integrated into dill, but for now here it is:
>>> import StringIO as io
>>> f = io.StringIO()
>>> import dill
>>> import numpy as np
>>> x = np.array([1])
>>> y = (x,)
>>> p = dill.Pickler(f)
>>> p.dump(x)
>>> f.getvalue()
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I1\ntp6\ncnumpy\ndtype\np7\n(S'i8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."
>>> p.dump(y)
>>> f.getvalue()
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I1\ntp6\ncnumpy\ndtype\np7\n(S'i8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb.(g5\ntp15\n."
>>> dill.loads(_)
array([1])
>>>
import dill
import numpy as np
x = np.array([1])
y = (x,)
dill.dumps(x)
dill.loads(dill.dumps(y))
This will throw an out of index exception. The reason is because there is a special function that is registered to serialize numpy array objects. That special function uses the global Pickler to store the serialized data instead of the Pickler that is passed as an argument. To fix it, I used the Pickler that is passed to the argument instead. I'm not sure if it breaks anything else in the dill though.

Is there a corpus of English words in nltk?

Is there any way to get the list of English words in python nltk library?
I tried to find it but the only thing I have found is wordnet from nltk.corpus. But based on documentation, it does not have what I need (it finds synonyms for a word).
I know how to find the list of this words by myself (this answer covers it in details), so I am interested whether I can do this by only using nltk library.
Yes, from nltk.corpus import words
And check using:
>>> "fine" in words.words()
True
Reference: Section 4.1 (Wordlist Corpora), chapter 2 of Natural Language Processing with Python.
Other than the nltk.corpus.words that #salvadordali has highlighted,:
>>> from nltk.corpus import words
>>> print words.readme()
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
>>> print words.words()[:10]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']
Do note that nltk.corpus.words is a list of words without frequencies so it's not exactly a corpora of natural text.
The corpus package that contains various corpora, some of which are English corpora, see http://www.nltk.org/nltk_data/. E.g. nltk.corpus.brown:
>>> from nltk.corpus import brown
>>> brown.words()[:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
To get a word list from a natural text corpus:
>>> wordlist = set(brown.words())
>>> print len(wordlist)
56057
>>> wordlist_lowercased = set(i.lower() for i in brown.words())
>>> print len(wordlist_lowercased)
49815
Note that the brown.words() contains words with both lower and upper cases like natural text.
In most cases, a list of words is not very useful without frequencies, so you can use the FreqDist:
>>> from nltk import FreqDist
>>> from nltk.corpus import brown
>>> frequency_list = FreqDist(i.lower() for i in brown.words())
>>> frequency_list.most_common()[:10]
[(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
For more, see http://www.nltk.org/book/ch01.html on how to access corpora and process them in NLTK

nltk function to count occurrences of certain words

In the nltk book there is the question
"Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"
I thought I could use a function like state_union('1945-Truman.txt').count('men')
However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.
You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:
>>> wordcounts = Counter(brown.words())
But do note that the Counter is case-sensitive, see:
>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971