word synonym / antonym detection - deep-learning

I need to create a classifier that takes 2 words and determines if they are synonyms or antonyms. I tried nltk's antsyn-net but it doesn't have enough data.
example:
capitalism <-[antonym]-> socialism
capitalism =[synonym]= free market
god <-[antonym]-> atheism
political correctness <-[antonym]-> free speach
advertising =[synonym]= marketing
I was thinking about taking a BERT model, because may be some of the relations would be embedded in it and transfer-learn on a data-set that I found.

I would suggest a following pipeline:
Construct a training set from existing dataset of synonyms and antonyms (taken e.g. from the Wordnet thesaurus). You'll need to craft negative examples carefully.
Take a pretrained model such as BERT and fine-tune it on your tasks. If you choose BERT, it should be probably BertForNextSentencePrediction where you use your words/prhases instead of sentences, and predict 1 if they are synonyms and 0 if they are not; same for antonyms.

Related

Giving pretokenized input to sentiment classifier

I am using the sentiment classifier in python according to this demo.
Is it possible to give pre-tokenized text as input to the predictor? I would like to be able to use my own custom tokenizer.
There are two AllenNLP sentiment analysis models, and they are both tightly tied to their tokenizations. The GLoVe-based one needs tokens that correspond to the pre-trained GLoVe embeddings, and similarly the RoBERTa one needs tokens (word pieces) that correspond with its pretraining. It does not really make sense to use these models with a different tokenizer.

Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?

Last month, a user called #jojek told me in a comment the following advice:
I can bet that given enough data, CNN on Mel energies will outperform MFCCs. You should try it. It makes more sense to do convolution on Mel spectrogram rather than on decorrelated coefficients.
Yes, I tried CNN on Mel-filterbank energies, and it outperformed MFCCs, but I still don't know the reason!
Although many tutorials, like this one by Tensorflow, encourage the use of MFCCs for such applications:
Because the human ear is more sensitive to some frequencies than others, it's been traditional in speech recognition to do further processing to this representation to turn it into a set of Mel-Frequency Cepstral Coefficients, or MFCCs for short.
Also, I want to know if Mel-Filterbank energies outperform MFCCs only with CNN, or this is also true with LSTM, DNN, ... etc. and I would appreciate it if you add a reference.
Update 1:
While my comment on #Nikolay's answer contains relevant details, I will add it here:
Correct me if I’m wrong, since applying DCT on the Mel-filterbank energies, in this case, is equivalent to IDFT, it seems to me that when we keep the 2-13 (inclusive) cepstral coefficients and discard the rest, is equivalent to a low-time liftering to isolate the vocal tract components, and drop the source components (which have e.g. the F0 spike).
So, why should I use all the 40 MFCCs since all I care about for the speech command recognition model is the vocal tract components?
Update 2
Another point of view (link) is:
Notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.
References:
https://tspace.library.utoronto.ca/bitstream/1807/44123/1/Mohamed_Abdel-rahman_201406_PhD_thesis.pdf
The thing is that the MFCC is calculated from mel energies with simple matrix multiplication and reduction of dimension. That matrix multiplication doesn't affect anything since any other neural networks applies many other operations afterwards.
What is important is reduction of dimension where instead of 40 mel energies you take 13 mel coefficients dropping the rest. That reduces accuracy with CNN, DNN or whatever.
However, if you don't drop and still use 40 MFCCs you can get the same accuracy as for mel energy or even better accuracy.
So it doesn't matter MEL or MFCC, it matters how many coefficients do you keep in your features.

One-hot encoding to word2vec embedding

I'm trying to create vectors for categorical information that I have at hand. This information is intended to be used for aiding seq2seq network for NLP purposes (like summarization).
To get the idea, maybe an example would be of help:
Sample Text: shark attacks off Florida in a 1-hour span
And suppose that we have this hypothetical categorical information:
1. [animal, shark, sea, ocean]
2. [animal, tiger, jungle, mountains]
...
19. [animal, eagle, sky, mountains]
I want to feed sample text to an LSTM network token-by-token (like seq2seq networks). I'm using pre-trained GloVe embeddings as my original embeddings which are fed into the network, but also want to concatenate a dense vector to each token denoting its category.
For now, I know that I can simply use the one-hot embeddings (0-1 binary). So, for example, the first input (for shark) to the RNN network would be:
# GloVe embeddings of shark + one-hot encoding for shark, + means concatenation
[-0.323 0.213 ... -0.134 0.934 0.031 ] + [1 0 0 0 0 ... 0 0 1]
The problem is that I have an extreme number of categories out there (around 20,000). After searching over the Internet, it seemed to me that people suggest using word2vec instead of one-hots. But, I can't get the underlying idea of how word2vec can demonstrate the categorical features in this case. Does anybody have a more clear idea?
Word2Vec can't be used for classification. It is just the underlying algorithm.
For classification you can use Doc2Vec or something similar.
It basically takes a list of documents and each has unique id assigned to it. After the training it builds relations between the documents similar to those which word2vec builds for the words. Then when you give it an unknown document it will tell you the top n most similar, and if your documents have previously defined tags you can assume that the unknown document can be labeled the same way.

wordnet on different text?

I am new to nltk, and I find wordnet functionality pretty useful. It gives synsets, hypernyms, similarity, etc. But however it fails to give similarity between locations like 'Delhi'-'Hyderabad' obviously as these words are not in the wordnet corpus.
So, I would like to know, if somehow I can update the wordnet corpus OR create wordnet over a different corpus e.g. Set of pages extracted from wikipedia related to travel? If at all we can create wordnet over different corpus, then what would be the format, steps to do the same, any limitations?
Please can you point me to links that describe the above concerns. I have searched the internet, googled, read portions of nltk book, but I don't have a single hint to above question.
Pardon me, if the question sounds completely ridiculous.
For flexibility in measuring the semantic similarity of very specific terms like Dehli or Hyderabad, what you want is not something hand-crafted like WordNet, but an automatically-learned similarity measure from a very large database. These are statistical similarity approaches. Of course, you want to avoid having to train such a model on data yourself...
Thus one thing that may be useful is the Google Distance (wikipedia, original paper). It seems fairly simple to implement such a measure in a language like R (code), and the original paper reports 87% agreement with WordNet.
The similarity measures in Wordnet work as expected because Wordnet measures semantic similarity. In that sense, both are cities, so they are very similar. What you are looking for is probably called geographic similarity.
delhi = wn.synsets('Delhi', 'n')[0]
print delhi.definition()
# a city in north central India
hyderabad = wn.synsets('Hyderabad', 'n')[0]
print hyderabad.definition()
# a city in southern Pakistan on the Indus River
delhi.wup_similarity(hyderabad)
# 0.9
melon = wn.synsets('melon', 'n')[0]
delhi.wup_similarity(melon)
# 0.3
There is a Wordnet extension, called Geowordnet. I kind of had the same problem as you at one point and tried to unify Wordnet with some of its extensions: wnext. Hope that helps.

Find similarity of a sentence with 6 basic emotions using wordnet

i'm working on a project and a part of it needs to detect emotion of the text we work on.
For example,
He is happy to go home.
I'll be taking two words from the above sentence i.e happy and home.
I'll be having a table containing 6 basic emotions. ( Happy, Sad, fear,anger,disgust, suprise)
Each of these emotions will be having some synsets associated with them.
I need to find the similarity between these synsets and the word happy and then similarity between these synsets and the word home.
I tried to use WORDNET for this purpose but couldn't able to understand how wordnet works as i'm new to this.
I think you want to find words in sentence that are similar to any of the words that represent any of the 6 basic given emotions. If I am correct I think you can use following solution.
First extract synset of each of the word sense representing 6 basic emotions. Now form the vectorized representation of each of these synset(collection of synonymous words). You can do this using word2Vec tool available at https://code.google.com/archive/p/word2vec/ . e.g.
Suppose "happy" has the word senses a1, a2, a3 as its synonymous words then
1. First train Word2Vec tool on any large English Corpus e.g. Bojar corpus
2. Then using trained word2Vec obtain word embeddings(vectorized representation) of each synonymous word a1, a2, a3.
3. Then vectorized representation of synset of "happy" would be average of vectorized representation of a1, a2, a3.
4. In this way you can have vectorized representation synset of each of the 6 basic emotion.
Now for given sentence find vectorized representation of each of the word in using trained word2vec generated vocabulary. Now you can use cosine similarity
(https://en.wikipedia.org/wiki/Cosine_similarity) to find distance(similarity) of each of the word from synset of 6 basic emotions. In this way you can determine emotion(basic level) of the sentence.
Source of the technique : Research paper "Unsupervised Most Frequent Sense Detection using Word Embeddings" by Sudha et. al.(http://www.aclweb.org/anthology/N15-1132)