What's the difference between NLTK's BLEU score and SacreBLEU?

What's the difference between NLTK's BLEU score and SacreBLEU? - nltk

I'm curious if anyone is familiar with the difference between using NLTK's BLEU score calculation and the SacreBLEU library.
In particular, I'm using both library's sentence BLEU scores, averaged over the entire dataset. The two give different results:
>>> from nltk.translate import bleu_score
>>> from sacrebleu import sentence_bleu
>>> print(len(predictions))
256
>>> print(len(targets))
256
>>> prediction = "this is the first: the world's the world's the world's the \
... world's the world's the world's the world's the world's the world's the world \
... of the world of the world'"
...
>>> target = "al gore: so the alliance for climate change has launched two campaigns."
>>> print(bleu_score.sentence_bleu([target], prediction))
0.05422283394039736
>>> print(sentence_bleu(prediction, [target]).score)
0.0
>>> print(sacrebleu.corpus_bleu(predictions, [targets]).score)
0.678758518214081
>>> print(bleu_score.corpus_bleu([targets], [predictions]))
0
As you can see, there's a lot of confusing inconsistencies going on. There's no way that my BLEU score is 67.8%, but it's also not supposed to be 0% (there are a lot of overlapping n-grams like "the").
I'd appreciate it if anyone could shed some light on this. Thanks.

NLTK and SacreBLEU use different tokenization rules, mostly in how they handle punctuation. NLTK uses its own tokenization, whereas SacreBLEU replicates the original Perl implementation from 2002. The tokenization rules are probably more elaborate in NLTK, but they make the number incomparable with the original implementation.
The corpus BLEU that you got from SacreBLEU is not 67.8%, but 0.67% – the numbers from SacreBLEU are already multiplied by 100, unlike NLTK. So, I would not say there is a huge difference between the scores.
The sentence-level BLEU can use different smoothing techniques that should ensure that score would get reasonable values even if 3-gram of 4-gram precision would be zero. However, note that BLEU as a sentence-level metric is very unreliable.

Related

Can LDAvis analyse the results of vowpal_wabbit LDA?

LDAvis provides a excellent way of visualsing and exploring topic models. LDAvis requires 5 parameters:
phi (matrix with dimensions number of terms times number of topics)
theta (matrix with dimensions number of documents times number of topics)
number of words per document (integer vector)
the vocabulary (character vector)
the word frequency in the whole corpus (integer vector)
The short version of my question is: after fitting a LDA model with vowpal wabbit, how do one derive phi and theta?
theta represents the mixture of topics per document, and must thus sum to 1 per document.
phi represents the probability of a term given the topic, and must thus sum to 1 per topic.
After running LDA with vowpal wabbit (vw) some kind of weights are stored in a model. A human readable version of that model can be aquired by feeding a special file, with one document per term in the vocabulary while inactivating learning (by the -t parameter), e.g.
vw -t -i weights -d dictionary.vw --readable_model readable.model.txt
According to the documentation of vowpal wabbit, all columns expect the first one of readable.model.txt now "represent the per-word topic distributions."
You can also generate predictions with vw, i.e. for a collection of documents
vw -t -i weights -d some-documents.txt -p predictions.txt
Both predictions.txt and readable.model.txt has a dimension that reflects the number of inputs (rows) and number of topics (columns), and none of them are probability distributions, because they do not sum to 1 (neither per row, nor per column).
I understand that vw is not for the faint hearted and that some programming/scripting will be required on my part, but I'm sure there must be some way to derive theta and phi from some the output of vw. I've been stuck on this problem for days now, please give me some hints.

I don't know how to directly use pyLDAvis with Vowpal Wabbit.
However, as you are already using a python tool you could use the Gensim wrapper and pyLDAvis together.
The python wrapper for VowpalWabbit was offered in gensim (< 4.0.0).
You can simply use Gensim as if you would have trained the model by Gensim itself after using vwmodel2ldamodel.
This workaround might be the easiest way if you are not familiar with the internals of Vowpal Wabbit (and LDA in general).

Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?

Last month, a user called #jojek told me in a comment the following advice:
I can bet that given enough data, CNN on Mel energies will outperform MFCCs. You should try it. It makes more sense to do convolution on Mel spectrogram rather than on decorrelated coefficients.
Yes, I tried CNN on Mel-filterbank energies, and it outperformed MFCCs, but I still don't know the reason!
Although many tutorials, like this one by Tensorflow, encourage the use of MFCCs for such applications:
Because the human ear is more sensitive to some frequencies than others, it's been traditional in speech recognition to do further processing to this representation to turn it into a set of Mel-Frequency Cepstral Coefficients, or MFCCs for short.
Also, I want to know if Mel-Filterbank energies outperform MFCCs only with CNN, or this is also true with LSTM, DNN, ... etc. and I would appreciate it if you add a reference.
Update 1:
While my comment on #Nikolay's answer contains relevant details, I will add it here:
Correct me if I’m wrong, since applying DCT on the Mel-filterbank energies, in this case, is equivalent to IDFT, it seems to me that when we keep the 2-13 (inclusive) cepstral coefficients and discard the rest, is equivalent to a low-time liftering to isolate the vocal tract components, and drop the source components (which have e.g. the F0 spike).
So, why should I use all the 40 MFCCs since all I care about for the speech command recognition model is the vocal tract components?
Update 2
Another point of view (link) is:
Notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.
References:
https://tspace.library.utoronto.ca/bitstream/1807/44123/1/Mohamed_Abdel-rahman_201406_PhD_thesis.pdf

The thing is that the MFCC is calculated from mel energies with simple matrix multiplication and reduction of dimension. That matrix multiplication doesn't affect anything since any other neural networks applies many other operations afterwards.
What is important is reduction of dimension where instead of 40 mel energies you take 13 mel coefficients dropping the rest. That reduces accuracy with CNN, DNN or whatever.
However, if you don't drop and still use 40 MFCCs you can get the same accuracy as for mel energy or even better accuracy.
So it doesn't matter MEL or MFCC, it matters how many coefficients do you keep in your features.

Is there a notation named sensor in nltk?

I am learning Stanford CS224N: natural language processing with Deep Learning.
Chris said
"very fine-grain differences between sensors that are a human being
can barely understand the difference between them and relate to"
in Lecture 1 while he is illustrating the piece of NLTK code.
Is there a notation named sensor in nltk? if yes, what does that mean?

I think that the automatic captioning of Youtube is wrong and that the lecturer pronounced the word synset.
And yes, there is a notation for synsets in NLTK, in fact the notation is coming from Wordnet.
You can get a synset with:
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
where dog is the morphological stem of one of the lemma, n is the part of speech (noun in this case), and 01 is an index.
According to the NLTK documentation:
Synset(wordnet_corpus_reader)
Create a Synset from a lemma.pos.number string where: lemma is the word’s morphological stem pos is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB number is the sense number, counting from 0.

wordnet on different text?

I am new to nltk, and I find wordnet functionality pretty useful. It gives synsets, hypernyms, similarity, etc. But however it fails to give similarity between locations like 'Delhi'-'Hyderabad' obviously as these words are not in the wordnet corpus.
So, I would like to know, if somehow I can update the wordnet corpus OR create wordnet over a different corpus e.g. Set of pages extracted from wikipedia related to travel? If at all we can create wordnet over different corpus, then what would be the format, steps to do the same, any limitations?
Please can you point me to links that describe the above concerns. I have searched the internet, googled, read portions of nltk book, but I don't have a single hint to above question.
Pardon me, if the question sounds completely ridiculous.

For flexibility in measuring the semantic similarity of very specific terms like Dehli or Hyderabad, what you want is not something hand-crafted like WordNet, but an automatically-learned similarity measure from a very large database. These are statistical similarity approaches. Of course, you want to avoid having to train such a model on data yourself...
Thus one thing that may be useful is the Google Distance (wikipedia, original paper). It seems fairly simple to implement such a measure in a language like R (code), and the original paper reports 87% agreement with WordNet.

The similarity measures in Wordnet work as expected because Wordnet measures semantic similarity. In that sense, both are cities, so they are very similar. What you are looking for is probably called geographic similarity.
delhi = wn.synsets('Delhi', 'n')[0]
print delhi.definition()
# a city in north central India
hyderabad = wn.synsets('Hyderabad', 'n')[0]
print hyderabad.definition()
# a city in southern Pakistan on the Indus River
delhi.wup_similarity(hyderabad)
# 0.9
melon = wn.synsets('melon', 'n')[0]
delhi.wup_similarity(melon)
# 0.3
There is a Wordnet extension, called Geowordnet. I kind of had the same problem as you at one point and tried to unify Wordnet with some of its extensions: wnext. Hope that helps.

S -> NP VP, do these sentences follow this format?

Am parsing some sentences (from the inaugural speech in the nltk corpus) with the format S -> NP VP, and I want to make sure I parsed them correctly, do these sentences follow the aforementioned format, sorry if this question seems trivial, English is not my first language. If anyone has any questions on a given sentence follows NP VP, ask me and I will give you my reasons on why I picked it and give you it's parsing tree.
god bless you
our capacity remains undiminished
their memories are short
they are serious
these things are true
the capital was abandoned
they are many
god bless the united stated of
america
the enemy was advancing
all this we can do
all this we will do
Thanks in advance.

The first 9 are NP VP. In the last two, "all this" is the direct object, which is part of the VP.
god bless you
NP- VP-------
our capacity remains undiminished
NP---------- VP------------------
their memories are short
NP------------ VP-------
they are serious
NP-- VP---------
these things are true
NP---------- VP------
the capital was abandoned
NP--------- VP-----------
they are many
NP-- VP------
god bless the united stated of america
NP- VP--------------------------------
the enemy was advancing
NP------- VP-----------
all this we can do
VP------ NP VP----
all this we will do
VP------ NP VP-----
Note that the last two sentences are semantically equivalent to the sentences "We can do all this" and "We will do all this", an order which makes the subject predicate/verb predicate breakdown easier.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008