wrong lemmatizing using nltk "python 3.7.4" - nltk

I'm using nltk lemmatizer and I get wrong result everytime !!
>>> import nltk
>>> from nltk.stem import WordNetLemmatizer
>>> print(WordNetLemmatizer().lemmatize('loved'))
loved
>>> print(WordNetLemmatizer().lemmatize('creating'))
creating
the output is 'loved'/ 'creating'.. and it should be 'love' / 'create'

I think Wordnet's lemmatizer defaults the part-of-speech to Noun, so you need to tell it you're lemmatizing a Verb.
print(WordNetLemmatizer().lemmatize('loved', pos='v'))
love
print(WordNetLemmatizer().lemmatize('creating', pos='v'))
create
Any lemmatizer you use will need to know the part-of-speech so it knows what rules to apply. While the two words you have are always verbs, lots of words can be both. For instance, the word "painting" can be a noun or a verb. The lemma of the verb "painting" (ie.. I am painting) is "paint".
If you use "painting" as a noun (ie.. A painting), "painting" is the lemma since it's the singular form of the noun.
In general, NLTK/Wordnet is not terribly accurate, especially for words not in its word list. I was unhappy with the performance of the available lemmatizers so I created me own. See Lemminflect. The main Readme page also has a comparison of a few common lemmatizers if you don't want to use that one.

Related

How to extract noun phrases from a sentence using pre-trained BERT?

I want to extract noun phrases from sentence using BERT. There are some available libraries like TextBlob that allows us to extract noun phrases like this:
from textblob import TextBlob
line = "Out business could be hurt by increased labor costs or labor shortages"
blob = TextBlob(line)
blob.noun_phrases
>>> WordList(['labor costs', 'labor shortages'])
The output of this seems pretty good. However, this is not able to capture longer noun phrases. Consider following example:
from textblob import TextBlob
line = "The Company’s Image Activation program may not positively affect sales at company-owned and participating "
"franchised restaurants or improve our results of operations"
blob = TextBlob(line)
blob.noun_phrases
>>> WordList(['company ’ s', 'image activation'])
However, the ideal answer here could be a longer phrase, like : company-owned and participating franchised restaurants. Since BERT is proven to be state-of-the-art in many NLP tasks, it should perform at least better than this approach.
However, I could not find any relevant resource to use BERT for this task. Is it possible to solve this task using pre-trained BERT?

Obtaining METEOR scores for Japanese text

I wish to produce METEOR scores for several Japanese strings. I have imported nltk, wordnet and omw but the results do not convince me it is working correctly.
from nltk.corpus import wordnet
from nltk.translate.meteor_score import single_meteor_score
nltk.download('wordnet')
nltk.download('omw')
reference = "チップは含まれていません。"
hypothesis = "チップは含まれていません。"
print(single_meteor_score(reference, hypothesis))
This outputs 0.5 but surely it should be much closer to 1.0 given the reference and hypothesis are identical?
Do I somehow need to specify which wordnet language I want to use in the call to single_meteor_score() for example:
single_meteor_score(reference, hypothesis, wordnet=wordnetJapanese.
Pending review by a qualified linguist, I appear to have found a solution. I found an open source tokenizer for Japanese. I pre-processed all of my reference and hypothesis strings to insert spaces between Japanese tokens and then run the nltk.single_meteor_score() over the files.

NLTK (or other) Part of speech tagger that returns n-best tag sequences

I need a part of speech tagger that does not just return the optimal tag sequence for a given sentence, but that returns the n-best tag sequences. So for 'time flies like an arrow', it could return both NN VBZ IN DT NN and NN NNS VBP DT NN for example, ordered in terms of their probability. I need to train the tagger using my own tag set and sentence examples, and I would like a tagger that allows different features of the sentence to be engineered. If one of the nltk taggers had this functionality, that would be great, but any tagger that I can interface with my Python code would do. Thanks in advance for any suggestions.
I would recommend having a look at spaCy. From what I have seen, it doesn't by default allow you to return the top-n tags, but it supports creating custom pipeline components.
There is also an issue on Github where exactly this is discussed, and there are some suggestions on how to implement it relatively quickly.

WordNet 3.0 Curse Words

I'm developing a system where keywords are extracted from plain text.
The requirements for a keyword are:
Between 1 - 45 letters long
Word must exist within the WordNet database
Must not be a "common" word
Must not be a curse word
I have fulfilled requirements 1 - 3, however I can't find a method for finding a distinction between curse words; how do I filter them?
I know this would not be a definitive method of filtering out all the curse words, but what happens is all keywords are first set to a state of "pending" before being "approved" by a moderator. However if I can get WordNet to filter most of the curse words out, it would make the moderator's job more easy.
It's strange, the Unix command line version of WordNet (wn) will give you the desired
information with the option -domn (domain):
wn ass -domnn (-domnv for a verb)
...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1
However, the equivalent method in the NLTK just returns an empty list:
from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
for l in s.lemmas:
print l.usage_domains()
[]
[]
...
As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).
For the 4th point it would be better and effective if you can collect the list of curse words and remove them through iterative process.
To achieve the same you can checkout this blog
I will summarize the same here.
1. Load Swear words text file from here
2. Compare it with the text, remove if it match.
def remove_curse_words():
text = 'Hey Bro Fuck you'
text = ' '.join([word for word in text.split() if word not in curseWords])
return text
The output would be.
Hey bro you

What NLTK technique for making extracting terms for a tag cloud

I want to extract relevant terms from text and I want to choose the most relevant terms.
How to config nltk data -> how, to, config ignored
config mysql to scan -> config NOT ingored
Python NLTK usage -> usage ingored
new song by the band usage -> usage NOT ingored
NLTK Thinks that -> thinks ignored
critical thinking -> thinking NOT ignored
I can think only this crude method:
>>> text = nltk.word_tokenize(input)
>>> nltk.pos_tag(text)
and to save only the nouns and verbs. But even if "think" and "thinking" are verbs, I want to retain only "thinking". Also "combined" over "combine". I also want to extract phrases if I could. Also terms like "free2play", "#pro_blogger" etc.
Please suggest a better scheme or how to actually make my scheme work.
all you need is a better pos tagging. This is a well known problem with NLTK, the core pos tagger is not efficient for production use. May be you want to try out something else. Compare your results for pos tagging here - http://nlp.stanford.edu:8080/parser/ . This is most accurate POS tagger I have ever found (I know I will be proved wrong soon). Once you parse your data in this tagger, you will automatically realize what exactly you want.
I suggest you to focus on tagging properly.
Check POS Tagging Example :
Tagging
critical/JJ
thinking/NN
Source : I am also struggling with NLTK pos tagger these days.:)