One-hot encoding to word2vec embedding - deep-learning

I'm trying to create vectors for categorical information that I have at hand. This information is intended to be used for aiding seq2seq network for NLP purposes (like summarization).
To get the idea, maybe an example would be of help:
Sample Text: shark attacks off Florida in a 1-hour span
And suppose that we have this hypothetical categorical information:
1. [animal, shark, sea, ocean]
2. [animal, tiger, jungle, mountains]
...
19. [animal, eagle, sky, mountains]
I want to feed sample text to an LSTM network token-by-token (like seq2seq networks). I'm using pre-trained GloVe embeddings as my original embeddings which are fed into the network, but also want to concatenate a dense vector to each token denoting its category.
For now, I know that I can simply use the one-hot embeddings (0-1 binary). So, for example, the first input (for shark) to the RNN network would be:
# GloVe embeddings of shark + one-hot encoding for shark, + means concatenation
[-0.323 0.213 ... -0.134 0.934 0.031 ] + [1 0 0 0 0 ... 0 0 1]
The problem is that I have an extreme number of categories out there (around 20,000). After searching over the Internet, it seemed to me that people suggest using word2vec instead of one-hots. But, I can't get the underlying idea of how word2vec can demonstrate the categorical features in this case. Does anybody have a more clear idea?

Word2Vec can't be used for classification. It is just the underlying algorithm.
For classification you can use Doc2Vec or something similar.
It basically takes a list of documents and each has unique id assigned to it. After the training it builds relations between the documents similar to those which word2vec builds for the words. Then when you give it an unknown document it will tell you the top n most similar, and if your documents have previously defined tags you can assume that the unknown document can be labeled the same way.

Related

What does it mean when we combine text features and feed it to a Neural Network?

I am reading this paper -"Review Spam Detection Using Word Embeddings and Deep Neural Networks" - paywall link and here they talk about how they combined ngram and skip-gram features of text before feeding it to the to feed-forward network.Here is the architectureenter image description here
Some brief description of the dataset:-
no. of documents=1600
dimension of skip-gram model=500
no. of the n-gram features(uni,bi,trigram)=2000
For example:-The pictures show that skip-gram and n-gram models were combined before they were sent as an input to the feed-forward network.
Let's suppose the dimension of skip-gram is (no. of documents, dimension of skip-gram) and the dimension of the n-gram model is (no. of documents, no. of n-gram features)
My question is what does it mean when you combine two different features like skip-gram and n-gram. Does it mean concatenation i.e how do you combine two features? Along which axis do you combine those features?
The size of the word vectors (embeddings) was set to 500 and context size c = 5 [7]
to generate a complex representation. The average values of the vector were used to
represent each review. Thus, the input attributes (features) for the subsequent supervised
learning included 2000 n-grams and 500 embeddings.
Deep feedforward neural network (DNN) was used to classify reviews into
spam/legitimate categories.
Hope I have explained it well this time

How is the self-attention mechanism in Transformers able to learn how the words are related to each other?

Given the sentence The animal didn't cross the street because it was too tired, how the self-attention is able to map with a higher score the word aninal intead of the word street ?
I'm wondering if that might be a consequence of the word embedding vectors fed into the network, that some how already encapsulate some degree of distance among the words.
Word Embeddings are first added to Positional Encoding which adds information about the word's position in the sequence. Then through each Encoder stack(6 to be precise), the Embeddings undergo multiple transformations and are refined to form better representations before they are passed on to the decoder.
The modification to the Embeddings as it passes through the Encoder Stack is learnable. Sometimes it may appear that some Attention-Heads at the top Stack are doing something that may look like coreference resolution which you pointed out in your example. Attending more to the word "animal" simply results in better representation than attending to "street".
How do we know which representations are better? The one that minimizes the loss or produces a better output of course!

Does any H2O algorithm support multi-label classification?

Is deep learning model supports multi-label classification problem or any other algorithms in H2O?
Orginal Response Variable -Tags:
apps, email, mail
finance,freelancers,contractors,zen99
genomes
gogovan
brazil,china,cloudflare
hauling,service,moving
ferguson,crowdfunding,beacon
cms,naytev
y,combinator
in,store,
conversion,logic,ad,attribution
After mapping them on the keys of the dictionary:
Then
Response variable look like this:
[74]
[156, 89]
[153, 13, 133, 40]
[150]
[474, 277, 113]
[181, 117]
[15, 87, 8, 11]
Thanks
No, H2O only contains algorithms that learn to predict a single response variable at a time. You could turn each unique combination into a single class and train a multi-class model that way, or predict each class with a separate model.
Any algorithm that creates a model that gives you "finance,freelancers,contractors,zen99" for one set of inputs, and "cms,naytev" for another set of inputs is horribly over-fitted. You need to take a step back and think about what your actual question is.
But in lieu of that, here is one idea: train some word embeddings (or use some pre-trained ones) on your answer words. You could then average the vectors for each set of values, and hope this gives you a good numeric representation of the "topic". You then need to turn your, say, 100 dimensional averaged word vector into a single number (PCA comes to mind). And now you have a single number that you can give to a machine learning algorithm, and that it can predict.
You still have a problem: having predicted a number, how do you turn that number into a 100-dim vector, and from there in to a topic, and from there into topic words? Tricky, but maybe not impossible.
(As an aside, if you turn the above "single number" into a factor, and have the machine learning model do a categorization, to predicting the most similar topic to those it has seen before... you've basically gone full circle and will get a model identical to the one you started with that has too many classes.)

How to massage inputs into Keras framework?

I am new to keras and despite reading the documentation and the examples folder in keras, I'm still struggling with how to fit everything together.
In particular, I want to start with a simple task: I have a sequence of tokens, where each token has exactly one label. I have a lot training data like this - practically infinite, as I can generate more (token, label) training pairs as needed.
I want to build a network to predict labels given tokens. The number of tokens must always be the same as the number of labels (one token = one label).
And I want this to be based on all surrounding tokens, say within the same line or sentence or window -- not just on the preceding tokens.
How far I got on my own:
created the training numpy vectors, where I converted each sentence into a token-vector and label-vector (of same length), using a token-to-int and label-to-int mappings
wrote a model using categorical_crossentropy and one LSTM layer, based on https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.
Now struggling with:
All the input_dim and input_shape parameters... since each sentence has a different length (different number of tokens and labels in it), what should I put as input_dim for the input layer?
How to tell the network to use the entire token sentence for prediction, not just one token? How to predict a whole sequence of labels given a sequence of tokens, rather than just label based on previous tokens?
Does splitting the text into sentences or windows make any sense? Or can I just pass a vector for the entire text as a single sequence? What is a "sequence"?
What are "time slices" and "time steps"? The documentation keeps mentioning that and I have no idea how that relates to my problem. What is "time" in keras?
Basically I have trouble connecting the concepts from the documentation like "time" or "sequence" to my problem. Issues like Keras#40 didn't make me any wiser.
Pointing to relevant examples on the web or code samples would be much appreciated. Not looking for academic articles.
Thanks!
If you have sequences of different length you can either pad them or use a stateful RNN implementation in which the activations are saved between batches. The former is the easiest and most used.
If you want to use future information when using RNNs you want to use a bidirectional model where you concatenate two RNN's moving in opposite directions. RNN will use a representation of all previous information when e.g. predicting.
If you have very long sentences it might be useful to sample a random sub-sequence and train on that. Fx 100 characters. This also helps with overfitting.
Time steps are your tokens. A sentence is a sequence of characters/tokens.
I've written an example of how I understand your problem but it's not tested so it might not run. Instead of using integers to represent your data I suggest one-hot encoding if it is possible and then use binary_crossentropy instead of mse.
from keras.models import Model
from keras.layers import Input, LSTM, TimeDistributed
from keras.preprocessing import sequence
# Make sure all sequences are of same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
# The input shape is your sequence length and your token embedding size (which is 1)
inputs = Input(shape=(maxlen, 1))
# Build a bidirectional RNN
lstm_forward = LSTM(128)(inputs)
lstm_backward = LSTM(128, go_backwards=True)(inputs)
bidirectional_lstm = merge([lstm_forward, lstm_backward], mode='concat', concat_axis=2)
# Output each timestep into a fully connected layer with linear
# output to map to an integer
sequence_output = TimeDistributed(Dense(1, activation='linear'))(bidirectional_lstm)
# Dense(n_classes, activation='sigmoid') if you want to classify
model = Model(inputs, sequence_output)
model.compile('adam', 'mse')
model.fit(X_train, y_train)

Neural Network outputs

I need to develop a neural network and classify the inputs into 3 categories. One of the category is "Don't Know"
Should I train the network using a single output perceptron which categories the training examples as 1,2, or 3? Or should I use a 2 output perceptron and use a binary scheme (01, 10, 00/11) to classify the inputs?
You should use 3 output neurons (one for each class). In the training phase, set output of neuron representing correct class to 1 and all others to 0. Single output with 1 2 and 3 is not optimal because that contains implicit assumtion that classes 2 and 3 are somehow "closer" to each other then 1 and 3. 2 outputs with binary coding is also not good, because in addition to solving classification problem you NN will have to learn binary encoding.
Also, its probably best to use softmax activation on output layer with cross-entropy error function. Softmax will normalize output, so values at each neuron could be interpreted as class probabilities.
Note that "don't know" class in only useful if you have training examples labeled as "don't know". Otherwise, use two output neurons.