I'm getting to grips with WSD and WordNet and I'm trying to work out why they are outputting different results. My understanding when using the below code is that the disambiguate command nominates the most likely Synset:
from pywsd import disambiguate
from nltk.corpus import wordnet as wn
mysent = 'I went to have a drink in a bar'
wsd = disambiguate(mysent)
Which gives me the below output
('I', None)
('went', Synset('travel.v.01'))
('to', None)
('have', None)
('a', None)
('drink', Synset('swallow.n.02'))
('in', None)
('a', None)
('bar', Synset('barroom.n.01'))
From this, I find it odd that the word 'I' was returned as 'nonetype' given that when looking up the word in WordNet I get one of four possible interpretations. Surely, 'I' should correspond to at least one of them?
wordnet.synsets('I')
Out:
[Synset('iodine.n.01'), Synset('one.n.01'), Synset('i.n.03'), Synset('one.s.01')]
In your sentence above, 'I' is a pronoun. The wordnet FAQ states that:
Q: Why is WordNet missing: of, an, the, and, about, above, because, etc.
A: WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles.
Related
I am preprocessing text data. After stemming when I am doing lemmatizing, it is giving exactly the same results as stemming (no change in text). I can't understand what is the issue.
def stem_list(row):
my_list = row['no_stopwords']
stemmed_list = [stemming.stem(word) for word in my_list]
return stemmed_list
Japan['stemmed_words'] = Japan.apply(stem_list, axis=1)
def lemma_list(row):
my_list = row['stemmed_words']
lemmas_list = [lemma.lemmatize(word) for word in my_list]
return lemmas_list
Japan['lemma_words'] = Japan.apply(lemma_list, axis=1)
Below is the sample output:
secur huawei involv uk critic network suffici mitig longterm hcsec
form mitig perceiv risk aris involv huawei critic nation infrastructur
governmentl board includ offici britain gchq cybersecur agenc well
senior huawei execut repres uk telecommun
My text is news articles.
I am using PorterStemmer for Stemming, and WordNetLemmatizer for Lemmatizing.
Thank you in Advance.
The reason your text is likely not changing during lemmatization is that stemmed words are often not real words that have lemmas at all.
Both processes try to shorten a word to its root, but stemming is strictly and algorithm and lemmatization uses a vocabulary to find the smallest root of a word. Generally you would use one or the other depending on the speed you need.
However, if you just want to see the results of both in series, reverse your process- you should get stems that differ from the lemmas you feed into the stemmer.
I'm using nltk lemmatizer and I get wrong result everytime !!
>>> import nltk
>>> from nltk.stem import WordNetLemmatizer
>>> print(WordNetLemmatizer().lemmatize('loved'))
loved
>>> print(WordNetLemmatizer().lemmatize('creating'))
creating
the output is 'loved'/ 'creating'.. and it should be 'love' / 'create'
I think Wordnet's lemmatizer defaults the part-of-speech to Noun, so you need to tell it you're lemmatizing a Verb.
print(WordNetLemmatizer().lemmatize('loved', pos='v'))
love
print(WordNetLemmatizer().lemmatize('creating', pos='v'))
create
Any lemmatizer you use will need to know the part-of-speech so it knows what rules to apply. While the two words you have are always verbs, lots of words can be both. For instance, the word "painting" can be a noun or a verb. The lemma of the verb "painting" (ie.. I am painting) is "paint".
If you use "painting" as a noun (ie.. A painting), "painting" is the lemma since it's the singular form of the noun.
In general, NLTK/Wordnet is not terribly accurate, especially for words not in its word list. I was unhappy with the performance of the available lemmatizers so I created me own. See Lemminflect. The main Readme page also has a comparison of a few common lemmatizers if you don't want to use that one.
I need a part of speech tagger that does not just return the optimal tag sequence for a given sentence, but that returns the n-best tag sequences. So for 'time flies like an arrow', it could return both NN VBZ IN DT NN and NN NNS VBP DT NN for example, ordered in terms of their probability. I need to train the tagger using my own tag set and sentence examples, and I would like a tagger that allows different features of the sentence to be engineered. If one of the nltk taggers had this functionality, that would be great, but any tagger that I can interface with my Python code would do. Thanks in advance for any suggestions.
I would recommend having a look at spaCy. From what I have seen, it doesn't by default allow you to return the top-n tags, but it supports creating custom pipeline components.
There is also an issue on Github where exactly this is discussed, and there are some suggestions on how to implement it relatively quickly.
I have found the frequecny of bigrams in certain sentences using:
import nltk
from nltk import ngrams
mydata = “xxxxx"
mylist = mydata.split()
mybigrams =list(ngrams(mylist, 2))
fd = nltk.FreqDist(mybigrams)
print(fd.most_common())
On printing out the bigrams with the most common frequencies, one occurs 7 times wheras all 95 other bigrams only occur 1 time. However when comparing the bigrams to my sentences I can see no logical order to the way the bigrams all of frequency 1 are printed out. Does anyone know if there is any logic to the way .most_common() prints the bigrams or is it randomly generated
Thanks in advance
Short answer, based on the documentation of collections.Counter.most_common:
Elements with equal counts are ordered arbitrarily:
In current versions of NLTK, nltk.FreqDist is based on nltk.compat.Counter. On Python 2.7 and 3.x, collections.Counter will be imported from the standard library. On Python 2.6, NLTK provides its own implementation.
For details, look at the source code:
https://github.com/nltk/nltk/blob/develop/nltk/compat.py
In conclusion, without checking all possible version configurations, you cannot expect words with equal frequency to be ordered.
I'm developing a system where keywords are extracted from plain text.
The requirements for a keyword are:
Between 1 - 45 letters long
Word must exist within the WordNet database
Must not be a "common" word
Must not be a curse word
I have fulfilled requirements 1 - 3, however I can't find a method for finding a distinction between curse words; how do I filter them?
I know this would not be a definitive method of filtering out all the curse words, but what happens is all keywords are first set to a state of "pending" before being "approved" by a moderator. However if I can get WordNet to filter most of the curse words out, it would make the moderator's job more easy.
It's strange, the Unix command line version of WordNet (wn) will give you the desired
information with the option -domn (domain):
wn ass -domnn (-domnv for a verb)
...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1
However, the equivalent method in the NLTK just returns an empty list:
from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
for l in s.lemmas:
print l.usage_domains()
[]
[]
...
As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).
For the 4th point it would be better and effective if you can collect the list of curse words and remove them through iterative process.
To achieve the same you can checkout this blog
I will summarize the same here.
1. Load Swear words text file from here
2. Compare it with the text, remove if it match.
def remove_curse_words():
text = 'Hey Bro Fuck you'
text = ' '.join([word for word in text.split() if word not in curseWords])
return text
The output would be.
Hey bro you