nltk function to count occurrences of certain words - nltk

In the nltk book there is the question
"Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"
I thought I could use a function like state_union('1945-Truman.txt').count('men')
However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.

You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:
>>> wordcounts = Counter(brown.words())
But do note that the Counter is case-sensitive, see:
>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971

Related

How can I know the next value of the Django queryset without having to loop and store values?

I'm writing a website using Django and Vue.js. Although I am admittedly a beginner at both, I find this surprising that I can't seem to be able to know, beforehand the value of the next item of a queryset. I need to know if there is a way to do it in Django. For instance, I perform a search on the database, it returns a queryset, and I start to call the elements of the queryset one after the other. Is there a way to know the next beforehand?
def fetch_question(request):
question_id = request.GET.get('question_id', None)
response = Question.objects.filter(pk=question_id)
Django querysets are iterables, so you can retrieve their iterator with the iter() function, then apply next() to the iterator:
>>> from django.contrib.auth.models import User
>>> qs = User.objects.all()
>>> it = iter(qs)
>>> next(it)
<User: root>
>>> next(it)
<User: toto>
>>> next(it)
<User: titi>
>>> next(it)
<User: tata>
>>>
Just beware that this consumes the iterator...
Another solution, if you want to iterate on (current,next) pairs is to use zip() (or it's lazy version itertools.izip() is using Python2 and having huge datasets):
>>> for current, next in zip(qs, qs[1:]):
... print("current: {} - next : {}".format(current, next))
...
current: root - next : tata
current: tata - next : titi
current: titi - next : toto
But I have to say that I really wonder why you think you'd want to "know, beforehand the value of the next item of a queryset"... I really fail to imagine what problem this is supposed to fix.

NLTK single-word part-of-speech tagging

Is there a way to use NLTK to get a set of possible parts of speech of a single string of letters, taking into account that different words might have homonyms?
For example: report -> {Noun, Verb} , kind -> {Adjective, Noun}
I have not been able to find a POS-tokenizer that tags part-of-speech for words outside of the context of a full sentence. This seems like a very basic request of NLTK, so I'm confused as to why I've had so much trouble finding it.
Yes. The simplest way is not to use a tagger, but simply load up one or more corpora and collect the set of all tags for the word you are interested in. If you're interested in more than one word, it's simplest to collect the tags for all words in the corpus, then look up anything you want. I'll add frequency counts, just because I can. For example, using the Brown corpus and the simple "universal" tagset:
>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t)
for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']
Because POS models are trained on sentence/document based data, so the expected input to the pre-trained model is a sentence/document. When there's only a single word, it treats it as a single word sentence, hence there should only be one tag in that single word sentence context.
If you're trying to find all possible POS tags per English words, you would need a corpus of many different use of the words and then tag the corpus and count/extract the no. of tags per word. E.g.
>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]
>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]
>>> for word, pos in chain(*tagged_sents):
... counts[word][pos] += 1
...
>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})
>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})
Alternatively, there's WordNet:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})
But note that WordNet is a manually crafted resource, so you cannot expect every English word to be in it.

Dill/Pickling Error: odd numbers of set items

I have a very strange error that cannot be reproduced anywhere except my production environment. What does this error mean? I get it when I try run the following piece of code:
serialized_object = dills.dumps(object)
dill.loads(serialized_object)
pickle.UnpicklingError: odd number of items for SET ITEMS
I'd never seen this before, so I looked at the source code. See here: https://github.com/python/cpython/blob/f24143b25e4f83368ff6182bebe14f885073015c/Modules/_pickle.c#L5914 it seems that the implication is that you have a corrupted or hostile pickle.
Based on the OP's comments, I think I see the workaround. I'll have to determine the impact of the workaround, and it will have to be integrated into dill, but for now here it is:
>>> import StringIO as io
>>> f = io.StringIO()
>>> import dill
>>> import numpy as np
>>> x = np.array([1])
>>> y = (x,)
>>> p = dill.Pickler(f)
>>> p.dump(x)
>>> f.getvalue()
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I1\ntp6\ncnumpy\ndtype\np7\n(S'i8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."
>>> p.dump(y)
>>> f.getvalue()
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I1\ntp6\ncnumpy\ndtype\np7\n(S'i8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb.(g5\ntp15\n."
>>> dill.loads(_)
array([1])
>>>
import dill
import numpy as np
x = np.array([1])
y = (x,)
dill.dumps(x)
dill.loads(dill.dumps(y))
This will throw an out of index exception. The reason is because there is a special function that is registered to serialize numpy array objects. That special function uses the global Pickler to store the serialized data instead of the Pickler that is passed as an argument. To fix it, I used the Pickler that is passed to the argument instead. I'm not sure if it breaks anything else in the dill though.

wordnet 3.0 maximum depth of the taxonomy

How can I know the maximum depth of the taxonomy for wordnet 3.0? (is-a relationships for synsets)
I read some papers and found from a paper that it is 16 for wordnet 1.7.1.
I'm wondering the value for wordnet 3.0.
You can try the wordnet interface in python nltk.
Iterate through each synset in wordnet and find the distance to their top most hypernym:
>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> max(max(len(hyp_path) for hyp_path in ss.hypernym_paths()) for ss in wn.all_synsets())
20
To find the possible paths of a synset to its top most hypernym:
>>> wn.synset('dog.n.1')
Synset('dog.n.01')
>>> wn.synset('dog.n.1').hypernym_paths()
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]
To find the maximum of one synset:
>>> max(len(hyp_path) for hyp_path in wn.synset('dog.n.1').hypernym_paths())
14

Is there a corpus of English words in nltk?

Is there any way to get the list of English words in python nltk library?
I tried to find it but the only thing I have found is wordnet from nltk.corpus. But based on documentation, it does not have what I need (it finds synonyms for a word).
I know how to find the list of this words by myself (this answer covers it in details), so I am interested whether I can do this by only using nltk library.
Yes, from nltk.corpus import words
And check using:
>>> "fine" in words.words()
True
Reference: Section 4.1 (Wordlist Corpora), chapter 2 of Natural Language Processing with Python.
Other than the nltk.corpus.words that #salvadordali has highlighted,:
>>> from nltk.corpus import words
>>> print words.readme()
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
>>> print words.words()[:10]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']
Do note that nltk.corpus.words is a list of words without frequencies so it's not exactly a corpora of natural text.
The corpus package that contains various corpora, some of which are English corpora, see http://www.nltk.org/nltk_data/. E.g. nltk.corpus.brown:
>>> from nltk.corpus import brown
>>> brown.words()[:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
To get a word list from a natural text corpus:
>>> wordlist = set(brown.words())
>>> print len(wordlist)
56057
>>> wordlist_lowercased = set(i.lower() for i in brown.words())
>>> print len(wordlist_lowercased)
49815
Note that the brown.words() contains words with both lower and upper cases like natural text.
In most cases, a list of words is not very useful without frequencies, so you can use the FreqDist:
>>> from nltk import FreqDist
>>> from nltk.corpus import brown
>>> frequency_list = FreqDist(i.lower() for i in brown.words())
>>> frequency_list.most_common()[:10]
[(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
For more, see http://www.nltk.org/book/ch01.html on how to access corpora and process them in NLTK