Why Sacrebleu returns zero BLEU score for short sentences? - nltk

Why scarebleu needs that sentences ends with dot? If I remove dots, the value is zero.
import sacrebleu, nltk
sys = ["This is cat."]
refs = [["This is a cat."],
["This is a bad cat."]]
b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))
This returns the following:
b3 35.1862973998119
b3 35.19
When I remove the ending dots.
sys = ["This is cat"]
refs = [["This is a cat"],
["This is a bad cat"]]
b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))
It prints zero using scarebleu which is again weird!:
b3 0.0
b3 0.0

BLEU is defined as a geometrical average of (modified) n-gram precisions for unigrams up to 4-grams (times brevity penalty). Thus if there is no matching 4-gram (no 4-tuple of words) in the whole test set, BLEU is 0 by definition. having a dot at the end which will get tokenized, makes it so that that there are now matches for 4-grams because smoothing is applied.
BLEU was designed for scoring test sets with hundreds of sentences where such case is very unlikely. For scoring single sentences, you can use a sentence-level version of BLEU which uses some kind of smoothing, but the results are still not ideal. You can also use a character-based metric, e.g. chrF (sacrebleu -m chrf).
You can also pass use_effective_order=True to corpus_bleu so that only the matched n-gram orders are counted instead of 4 n-grams. However, in that case, the metric is not exactly what people would refer to BLEU.

Related

Why is alpha set to 15 in NLTK - VADER?

I am trying to understand what the VADER does for analysis of sentences.
Why is the hyper-parameter Alpha set to 15 here? I understand that the it is unstable when left unbound, but why 15?
def normalize(score, alpha=15):
"""
Normalize the score to be between -1 and 1 using an alpha that
approximates the max expected value
"""
norm_score = score/math.sqrt((score*score) + alpha)
return norm_score
Vader's normalization equation is which is the equation for
I have read the paper of the research for Vader from here:http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
Unfortunately, I could not find any reason why such a formula and 15 as the value for alpha was chosen but the experiments and the graph show that as x grows which is the sum of sentiments' scores grow the value becomes closer to -1 or 1 which indicates that as number of words grow the score tends more towards -1 or 1. Which means that Vader works better with short documents or tweets compared to long documents.

How to build deep learning model that picks words from serval distinct bags and forms a meaningful sentence [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is,
Consider we have a vocabulary of 10,000 words.
The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag.
ex:Bag 1 will be of size 10000*1
if Hello's index was 1 India's index was 2 and Manager's was 4
It will be
[0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet.
*I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1)
word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!
First, we need a way that the computer can recognise a word otherwise it cannot pick the correct one. That means at this stage, we need to decide what we are teaching the computer to begin with (ie what is a verb, noun, grammar) but I will assume we will dump a dictionary into it and give no information except the words themselves.
So that the computer can compute what sentences are, we need to convert them to numbers (one way would be to work alphabetically starting at 1, using them as keys for a dictionary (digital this time(!)) and the word as the value). Now we can apply the same linear algebra techniques to this problem as any other problem.
So we need to make generations of matrices of weights to multiply into the keys of the dictionary, then remove all the weights beyond the range of dictionary keys, the rest can be used to get the value in the dictionary and make a sentence. Optionally, you can also use a threshold value to take off of all the outputs of the matrix multiplication
Now for the hard part: learning. Once you have a few (say 100) matrices, we need to "breed" the best ones (this is where human intervention is needed) and you need to pick the 50 most meaningful sentences (might be hard at first) and use them to base your next 100 of (easiest way would be to weight the 50 matrices randomly for a weighted mean 100 times).
And the boring bit, keep running the generations over and over until you get to a point where your sentences are meaningful most of the time (of course there is no guarantee that it will always be meaningful but that's the nature of ANN's)
If you find it doesn't work, you can use more layers (more matrices) and/or I recently heard of a different technique that dynamically changed the network but I can't really help with that.
Have a database with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train and y_train:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True in all LSTM layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size).
Use the model to predict one word from the _start_ word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.

Stata drops variables that "predicts failure perfeclty" even though the correlation between the variables isn't 1 or -1?

I am running a logit regression on some data. My dependent variable is binary as are all but one of my independent variables.
When I run my regression, stata drops many of my independent variables and gives the error:
"variable name" != 0 predicts failure perfectly
"variable name" dropped and "a number" obs not used
I know for a fact that some of the variables dropped don't predict failure perfectly. In other words, the dependent variables can take on the value 1 for either the value 1 or 0 of the independent variable.
Why is this happening and how can I resolve it?
Bivariate cross tabulation does not show the problem. Try this:
http://www.stata.com/support/faqs/statistics/completely-determined-in-logistic-regression/index.html
First confirm that this is what is happening [collinear]. (For your data, replace x1 and x2 with the independent variables of your model.)
Number covariate patterns:
egen pattern = group(x1 x2)
Identify pattern with only one outcome:
logit y x1 x2
predict p
summarize p
the extremes of p will be almost 0 or almost 1
tab pattern if p < 1e-7 // (use a value here slightly bigger than the min)
or in the above use "if p > 1 - 1e-7" if p is almost 1
list x1 x2 if pattern == XXXX // (use the value here from the tab step)
the above identifies the covariate pattern
The covariate pattern that predicts outcome perfectly may be meaningful to the researcher or may be an anomaly due to having many variables in the model.
Now you must get rid of the collinearity:
logit y x1 x2 if pattern ~= XXXX // (use the value here from the tab step)
note that there is collinearity
*You can omit the variable that logit drops or drop another one.
Refit the model with the collinearity removed:
logit y x1
You may or may not want to include the covariate pattern that predicts outcome perfectly. It depends on the answer to (3). If the covariate pattern that predicts outcome perfectly is meaningful, you may want to exclude these observations from the model:
logit y x1 if pattern ~= XXXX
Here one would report
Covariate pattern such and such predicted outcome perfectly
The best model for the rest of the data is ....xyz

Stata: How to deactivate automatic omission because of collinearity

Is there any possibility to tell stata not to automatically omit variables due to (near) collinearity in regressions? I use dummy variables to deal with outliers in my sample; i.e. they take the value 1 for only one observations and are zero for all others. Stata drops most of these dummies as it recognizes them as collinear, which of course is true, but they're not perfectly collinear and I'd like to keep them in the regression.
Stata will only drop perfectly collinear variables, so the answer is "no, you cannot".
I have answered similar question yesterday. Yes, it is possible to fix it [I have].
Bivariate cross tabulation does not show the problem. Try this:
http://www.stata.com/support/faqs/statistics/completely-determined-in-logistic-regression/index.html
First confirm that this is what is happening [collinear]. (For your data, replace x1 and x2 with the independent variables of your model.)
Number covariate patterns:
egen pattern = group(x1 x2)
Identify pattern with only one outcome:
logit y x1 x2 predict p summarize p
the extremes of p will be almost 0 or almost 1 tab pattern if p < 1e-7 // (use a value here slightly bigger than the min)
or in the above use "if p > 1 - 1e-7" if p is almost 1 list x1 x2 if pattern == XXXX // (use the value here from the tab step)
the above identifies the covariate pattern
The covariate pattern that predicts outcome perfectly may be meaningful to the researcher or may be an anomaly due to having many variables in the model.
Now you must get rid of the collinearity:
logit y x1 x2 if pattern ~= XXXX // (use the value here from the tab step)
note that there is collinearity *You can omit the variable that logit drops or drop another one.
Refit the model with the collinearity removed:
logit y x1
You may or may not want to include the covariate pattern that predicts outcome perfectly. It depends on the answer to (3). If the covariate pattern that predicts outcome perfectly is meaningful, you may want to exclude these observations from the model:
logit y x1 if pattern ~= XXXX
Here one would report
Covariate pattern such and such predicted outcome perfectly The best model for the rest of the data is ....xyz

Algorithm to generate all possible letter combinations of given string down to 2 letters

Algorithm to generate all possible letter combinations of given string down to 2 letters
Trying to create an Anagram solver in AS3, such as this one found here:
http://homepage.ntlworld.com/adam.bozon/anagramsolver.htm
I'm having a problem wrapping my brain around generating all possible letter combinations for the various lengths of strings. If I was only generating permutations for a fixed length, it wouldn't be such a problem for me... but I'm looking to reduce the length of the string and obtain all the possible permutations from the original set of letters for a string with a max length smaller than the original string. For example, say I want a string length of 2, yet I have a 3 letter string of “abc”, the output would be: ab ac ba bc ca cb.
Ideally the algorithm would produce a complete list of possible combinations starting with the original string length, down to the smallest string length of 2. I have a feeling there is probably a small recursive algorithm to do this, but can't wrap my brain around it. I'm working in AS3.
Thanks!
For the purpose of writing an anagram solver the kind of which you linked, the algorithm that you are requesting is not necessary. It is also VERY expensive.
Let's look at a 6-letter word like MONKEY, for example. All 6 letters of the word are different, so you would create:
6*5*4*3*2*1 different 6-letter words
6*5*4*3*2 different 5-letter words
6*5*4*3 different 4-letter words
6*5*4 different 3-letter words
6*5 different 2-letter words
For a total of 1950 words
Now, presumably you're not trying to spit out all 1950 words (e.g. 'OEYKMN') as anagrams (which they are, but most of them are also gibberish). I'm guessing you have a dictionary of legal English words, and you just want to check if any of those words are anagrams of the query word, with the option of not using all letters.
If that is the case, then the problem is simple.
To determine if 2 words are anagrams of each other, all you need to do is count how many times each letters are used, and compare these numbers!
Let's restrict ourself to only 26 letters A-Z, case insensitive. What you need to do is write a function countLetters that takes a word and returns an array of 26 numbers. The first number in the array corresponds to the count of the letter A in the word, second number corresponds to count of B, etc.
Then, two words W1 and W2 are exact anagram if countLetters(W1)[i] == countLetters(W2)[i] for every i! That is, each word uses each letter the exact same number of times!
For what I'd call sub-anagrams (MONEY is a sub-anagram of MONKEY), W1 is a sub-anagram of W2 if countLetters(W1)[i] <= countLetters(W2)[i] for every i! That is, the sub-anagram may use less of certain letters, but not more!
(note: MONKEY is also a sub-anagram of MONKEY).
This should give you a fast enough algorithm, where given a query string, all you need to do is read through the dictionary once, comparing the letter count array of each word against the letter count array of the query word. You can do some minor optimizations, but this should be good enough.
Alternatively, if you want utmost performance, you can preprocess the dictionary (which is known in advance) and create a directed acyclic graph of sub-anagram relationship.
Here's a portion of such a graph for illustration:
D=1,G=1,O=1 ----------> D=1,O=1
{dog,god} \ {do,od}
\
\-------> G=1,O=1
{go}
Basically each node is a bucket for all words that have the same letter count array (i.e. they're exact anagrams). Then there's a node from N1 to N2 if N2's array is <= (as defined above) N1's array (you can perform transitive reduction to store the least amount of edges).
Then to list all sub-anagrams of a word, all you have to do is find the node corresponding to its letter count array, and recursively explore all nodes reachable from that node. All their buckets would contain the sub-anagrams.
The following js code will find all possible "words" in an n letter word. Of course this doesn't mean that they are real words but does give you all the combinations. On my machine it takes about 0.4 seconds for a 7 letter word and 15 secs for a 9 letter word (up to almost a million possibilities if no repeated letters). However those times include looking in a dictionary and finding which are real words.
var getWordsNew=function(masterword){
var result={}
var a,i,l;
function nextLetter(a,l,key,used){
var i;
var j;
if(key.length==l){
return;
}
for(i=0;i<l;i++){
if(used.indexOf(""+i)<0){
result[key+a[i]]="";
nextLetter(a,l,key+a[i],used+i);
}
}
}
a=masterword.split("");
l=a.length;
for (i = 0; i < a.length; i++) {
result[a[i]] = "";
nextLetter(a, l, a[i], "" + i)
}
return result;
}
Complete code at
Code for finding words in words
You want a sort of arrangements. If you're familiar with the permutation algorithm then you know you have a check to see when you've generated enough numbers. Just change that limit:
I don't know AS3, but here's a pseudocode:
st = an array
Arrangements(LettersInYourWord, MinimumLettersInArrangement, k = 1)
if ( k > MinimumLettersInArrangements )
{
print st;
}
if ( k > LettersInYourWord )
return;
for ( each position i in your word that hasn't been used before )
st[k] = YourWord[i];
Arrangements(<same>, <same>, k + 1);
for "abc" and Arrangements(3, 2, 1); this will print:
ab
abc
ac
acb
...
If you want those with three first, and then those with two, consider this:
st = an array
Arrangements(LettersInYourWord, DesiredLettersInArrangement, k = 1)
if ( k > DesiredLettersInArrangements )
{
print st;
return
}
for ( each position i in your word that hasn't been used before )
st[k] = YourWord[i];
Arrangements(<same>, <same>, k + 1);
Then for "abc" call Arrangements(3, 3, 1); and then Arrangements(3, 2, 1);
You can generate all words in an alphabet by finding all paths in a complete graph of the letters. You can find all paths in that graph by doing a depth first search from each letter and returning the current path at each point.
There is simple O(N) where n is size of vocabulary.
Just sort letters in each word in vocabulary or better, create binary mask of them and then compare whit letters you have.