Obtaining METEOR scores for Japanese text - nltk

I wish to produce METEOR scores for several Japanese strings. I have imported nltk, wordnet and omw but the results do not convince me it is working correctly.
from nltk.corpus import wordnet
from nltk.translate.meteor_score import single_meteor_score
nltk.download('wordnet')
nltk.download('omw')
reference = "チップは含まれていません。"
hypothesis = "チップは含まれていません。"
print(single_meteor_score(reference, hypothesis))
This outputs 0.5 but surely it should be much closer to 1.0 given the reference and hypothesis are identical?
Do I somehow need to specify which wordnet language I want to use in the call to single_meteor_score() for example:
single_meteor_score(reference, hypothesis, wordnet=wordnetJapanese.

Pending review by a qualified linguist, I appear to have found a solution. I found an open source tokenizer for Japanese. I pre-processed all of my reference and hypothesis strings to insert spaces between Japanese tokens and then run the nltk.single_meteor_score() over the files.

Related

wrong lemmatizing using nltk "python 3.7.4"

I'm using nltk lemmatizer and I get wrong result everytime !!
>>> import nltk
>>> from nltk.stem import WordNetLemmatizer
>>> print(WordNetLemmatizer().lemmatize('loved'))
loved
>>> print(WordNetLemmatizer().lemmatize('creating'))
creating
the output is 'loved'/ 'creating'.. and it should be 'love' / 'create'
I think Wordnet's lemmatizer defaults the part-of-speech to Noun, so you need to tell it you're lemmatizing a Verb.
print(WordNetLemmatizer().lemmatize('loved', pos='v'))
love
print(WordNetLemmatizer().lemmatize('creating', pos='v'))
create
Any lemmatizer you use will need to know the part-of-speech so it knows what rules to apply. While the two words you have are always verbs, lots of words can be both. For instance, the word "painting" can be a noun or a verb. The lemma of the verb "painting" (ie.. I am painting) is "paint".
If you use "painting" as a noun (ie.. A painting), "painting" is the lemma since it's the singular form of the noun.
In general, NLTK/Wordnet is not terribly accurate, especially for words not in its word list. I was unhappy with the performance of the available lemmatizers so I created me own. See Lemminflect. The main Readme page also has a comparison of a few common lemmatizers if you don't want to use that one.

neutral label for NLTK

I have similar problem like below
Why did NLTK NaiveBayes classifier misclassify one record?
In my case, I queried positive feed and built positive_vocab and then queried negative feed and built negative_voca. I get the data from feed clean and built the classifier. How do I build the neutral_vocab. Is there a way I can instruct NLTK classifier to return neutral label when the given word is not found in the negative_voca and positive_vocab. How do I do that?
In my current implementation, if I give a word which is not present in the both sets it tells positive by default. Instead it should tell, neutral or notfound

NLTK (or other) Part of speech tagger that returns n-best tag sequences

I need a part of speech tagger that does not just return the optimal tag sequence for a given sentence, but that returns the n-best tag sequences. So for 'time flies like an arrow', it could return both NN VBZ IN DT NN and NN NNS VBP DT NN for example, ordered in terms of their probability. I need to train the tagger using my own tag set and sentence examples, and I would like a tagger that allows different features of the sentence to be engineered. If one of the nltk taggers had this functionality, that would be great, but any tagger that I can interface with my Python code would do. Thanks in advance for any suggestions.
I would recommend having a look at spaCy. From what I have seen, it doesn't by default allow you to return the top-n tags, but it supports creating custom pipeline components.
There is also an issue on Github where exactly this is discussed, and there are some suggestions on how to implement it relatively quickly.

Nltk .most_common(), what is the order it is returned in?

I have found the frequecny of bigrams in certain sentences using:
import nltk
from nltk import ngrams
mydata = “xxxxx"
mylist = mydata.split()
mybigrams =list(ngrams(mylist, 2))
fd = nltk.FreqDist(mybigrams)
print(fd.most_common())
On printing out the bigrams with the most common frequencies, one occurs 7 times wheras all 95 other bigrams only occur 1 time. However when comparing the bigrams to my sentences I can see no logical order to the way the bigrams all of frequency 1 are printed out. Does anyone know if there is any logic to the way .most_common() prints the bigrams or is it randomly generated
Thanks in advance
Short answer, based on the documentation of collections.Counter.most_common:
Elements with equal counts are ordered arbitrarily:
In current versions of NLTK, nltk.FreqDist is based on nltk.compat.Counter. On Python 2.7 and 3.x, collections.Counter will be imported from the standard library. On Python 2.6, NLTK provides its own implementation.
For details, look at the source code:
https://github.com/nltk/nltk/blob/develop/nltk/compat.py
In conclusion, without checking all possible version configurations, you cannot expect words with equal frequency to be ordered.

What NLTK technique for making extracting terms for a tag cloud

I want to extract relevant terms from text and I want to choose the most relevant terms.
How to config nltk data -> how, to, config ignored
config mysql to scan -> config NOT ingored
Python NLTK usage -> usage ingored
new song by the band usage -> usage NOT ingored
NLTK Thinks that -> thinks ignored
critical thinking -> thinking NOT ignored
I can think only this crude method:
>>> text = nltk.word_tokenize(input)
>>> nltk.pos_tag(text)
and to save only the nouns and verbs. But even if "think" and "thinking" are verbs, I want to retain only "thinking". Also "combined" over "combine". I also want to extract phrases if I could. Also terms like "free2play", "#pro_blogger" etc.
Please suggest a better scheme or how to actually make my scheme work.
all you need is a better pos tagging. This is a well known problem with NLTK, the core pos tagger is not efficient for production use. May be you want to try out something else. Compare your results for pos tagging here - http://nlp.stanford.edu:8080/parser/ . This is most accurate POS tagger I have ever found (I know I will be proved wrong soon). Once you parse your data in this tagger, you will automatically realize what exactly you want.
I suggest you to focus on tagging properly.
Check POS Tagging Example :
Tagging
critical/JJ
thinking/NN
Source : I am also struggling with NLTK pos tagger these days.:)