Trouble understanding VADER sentiment analyser NLTK - nltk

I have some trouble understanding output of vader based on case of input. Here is an example. I want to know the reason for this kind of behaviour as IMO, if a sentence is in caps, it should add emphasis value and give an even more weighted result. Instead, I think it thinks this is kind of sarcastic or something.
In [17]: sentiment_analyse(str.lower("I WILL NEVER RECOMMEND HIM TO ANYONE......."))
compound: -0.2755,
neg: 0.297,
neu: 0.703,
pos: 0.0,
negative
In [18]: sentiment_analyse("I WILL NEVER RECOMMEND HIM TO ANYONE.......")
compound: 0.3612,
neg: 0.0,
neu: 0.667,
pos: 0.333,
positive

Related

How can I use "Interpolated Absolute Discounting" for a bigram model in language modeling?

I want to compare two smoothing methods for a bigram model:
Add-one smoothing
Interpolated Absolute Discounting
For the first method, I found some codes.
def calculate_bigram_probabilty(self, previous_word, word):
bigram_word_probability_numerator = self.bigram_frequencies.get((previous_word, word), 0)
bigram_word_probability_denominator = self.unigram_frequencies.get(previous_word, 0)
if self.smoothing:
bigram_word_probability_numerator += 1
bigram_word_probability_denominator += self.unique__bigram_words
return 0.0 if bigram_word_probability_numerator == 0 or bigram_word_probability_denominator == 0 else float(
bigram_word_probability_numerator) / float(bigram_word_probability_denominator)
However, I found nothing for the second method except for some references for 'KneserNeyProbDist'. However, this is for trigrams!
How can I change my code above to calculate it? The parameters of this method must be estimated from a development-set.
In this answer I just clear up a few things that I just found about your problem, but I can't provide a coded solution.
with KneserNeyProbDist you seem to refer to a python implementation of that problem: https://kite.com/python/docs/nltk.probability.KneserNeyProbDist
There exists an article about Kneser–Ney smoothing on wikipedia: https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing
The article above links this tutorial: https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf but this has a small fault on the most important page 29, the clear text is this:
Modified Kneser-Ney
Chen and Goodman introduced modified Kneser-Ney:
Interpolation is used instead of backoff. Uses a separate discount for one- and two-counts instead of a single discount for all counts. Estimates discounts on held-out data instead of using a formula
based on training counts.
Experiments show all three modifications improve performance.
Modified Kneser-Ney consistently had best performance.
Regrettable the modified Version is not explained in that document.
The original documentation by Chen & Goodman luckily is available, the Modified Kneser–Ney smoothing is explained on page 370 of this document: http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf.
I copy the most important text and formula here as screenshot:
So the Modified Kneser–Ney smoothing now is known and seems being the best solution, just translating the description beside formula in running code is still one step to do.
It might be helpful that below the shown text (above in screenshot) in the original linked document is still some explanation that might help to understand the raw description.

decompose text into multi-class

am looking for a machine learning algorithm which serves as String decoder (Multi-label classifier).
for example , decompose an email xyz#qwe.abc.com into different classes with/out probability
xyz:owner
#:special_character
qwe:sub-domain
abc.domain
or 231115
23:hour
11:minute
15:seconds
my case is more complex, but the above would guide me where to look.
for example I might have
74als353d
74:family
als:ftype
353d:id
and
73das
7:family
as:ftype
3d:id
I came across deep learning alogirthm SEQ2SEQ would be a good starting point ?

Proj4JS projection definition for rotated pole

I am trying to define a rotated pole projection in Proj4JS where the north pole is now is 48N and 176E. I haven't been able to find any other example of rotated-poles in Proj4JS so I have tried to convert one I found for PROJ.4.
proj4.defs('myProjection', '+proj=ob_tran +o_proj=latlon +o_lon_p=-176 +o_lat_p=48 +lon_0=0 +a=1 +to_meter=0.0174532925199');
That line of JS is run without problem, but when I try to use that projection
proj4('EPSG:4326', 'myProjection', [175, -41]);
I get this error
uncaught exception: myProjection
I've tried replacing the projection definition the one for WGS84 and it works fine, so I believe my use of the function is correct, it's the parameters in that string that I am unsure of.
I think what you want is the so-called Azimuthal Equidistant projection. It's the best choice for measuring true distances radiating away from a center point.
If this is what you're looking for, I asked a similar question a while back over on GIS.SE, and for the coordinate you provided (48N, 176E), you could declare the Proj4js projection definition as so..
Proj4js.defs["CUSTOM:10001"] = "+proj=aeqd +lat_0=48.0 +lon_0=176.0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs";
I hope it helps.

actionscript Number add and substract strange behavior

Well i just got a problem, with the simple following code:
trace( 0.01+0.05 ); // 0.060000000000000005
trace( 0.03-0.01 ); // 0.019999999999999997
I mean i just want 0.01+0.05 give me 0.06 and 0.03-0.01 give me 0.02.
Does someone have an idea how to retrieve the correct results ?
The imprecision is due to floating point arithmetic. 0.01, 0.05 and 0.03 are all floating point literals. Not every number (in fact, very few numbers) can be represented precisely in floating point.
For example, 0.5 can be but, 0.06 cannot. As a rule of thumb the first 15 significant figures will be correct.
For more details, see http://en.wikipedia.org/wiki/Floating_point
trace(Math.format((0.01+0.05), 2));

Get the most probable color from a words set

Are there any libraries existing or methods that let you to figure out the most probable color for a words set? For example, cucumber, apple, grass, it gives me green color. Did anyone work in that direction before?
If i have to do that, i will try to search images based on the words using google image or others and recognize the most common color of top n results.
That sounds like a pretty reasonable NLP problem and one thats very easy to handle via map-reduce.
Identify a list of words and phrases that you call colors ['blue', 'green', 'red', ...].
Go over a large corpus of sentences, and for the sentences that mention a particular color, for every other word in that sentence, note down (word, color_name) in a file. (Map Step)
Then for each word you have seen in your corpus, aggregate all the colors you have seen for it to get something like {'cucumber': {'green': 300, 'yellow': 34, 'blue': 2}, 'tomato': {'red': 900, 'green': 430'}...} (Reduce Step)
Provided you use a large enough corpus (something like wikipedia), and you figure out how to prune really small counts, rare words, you should be able to make pretty comprehensive and robust dictionary mapping millions of the items to their colors.
Another way to do that is to do a text search in google for combinations of colors and the word in question and take the combination with the highest number of results. Here's a quick Python script for that:
import urllib
import json
import itertools
def google_count(q):
query = urllib.urlencode({'q': q})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
return int(data['cursor']['estimatedResultCount'])
colors = ['yellow', 'orange', 'red', 'purple', 'blue', 'green']
# get a list of google search counts
res = [google_count('"%s grass"' % c) for c in colors]
# pair the results with their corresponding colors
res2 = list(itertools.izip(res, colors))
# get the color with the highest score
print "%s is %s" % ('grass', sorted(res2)[-1][1])
This will print:
grass is green
Daniel's and Xi.lin's answers are very good ideas. Along the same axis, we could combine both with an approach similar to Xilin's but more simple: Query Google Image with the word you want to find the color associated with + a "Color" filter (see in the lower left bar). And see which color yields more results.
I would suggest using a tightly defined set of sources if possible such as Wikipedia and Wordnet.
Here, for example, is Wordnet for "panda":
S: (n) giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca
(large black-and-white herbivorous mammal of bamboo forests of China and Tibet;
in some classifications considered a member of the bear family or of a separate
family Ailuropodidae)
S: (n) lesser panda, red panda, panda, bear cat, cat bear,
Ailurus fulgens (reddish-brown Old World raccoon-like carnivore;
in some classifications considered unrelated to the giant pandas)
Because of the concise, carefully constructed language it is highly likely that any colour words will be important. Here you can see that pandas are both black-and-white and reddish-brown.
If you identify subsections of Wikipedia (e.g. "Botanical Description") this will help to increase the relevance of your results. Also the first image in Wikipedia is very likely to be the best "definitive" one.
But, as with all statistical methods, you will get false positives (and negatives , though these are probably less of a problem).