Encoding a lot of categorical variables - deep-learning

I have 10 million categorical variables (each variable has 3 categories). What is the best way to encode these 10 million variables to train a deep learning model on them? (If I use one hot encoding, then I will end up having 30 million variables. Also, embedding layer with one output makes no sense (it is similar to integer encoding and there is no order between these categories) and embedding layer with two outputs does not make that much difference. Usually, we use embedding layer when number of categories is a lot). Please give me your opinion.

You should treat this problem like word embeddings, where you also have a lot of entities (usually 30-50 thousand).
Make a random embedding for each category, of dimension 100-300. Use triplet loss or something like it to train the embeddings. Basically, create a valid pair of embeddings, or a pair of embedding and input. For word vector these are words that co-occur in a context window (they are near each other in a sentence). Then pick some other, unrelated words at random. Train the network so that the valid pair are closer (cosine distance) than the random pairs; there are different loss functions you can try, but basically the closer the valid pair and the further the random pair the lower the loss.
However, I would think about how you have formulated your problem. Do you actually have 10 million categories? Why do you have more labels than there are words in any human language? If you can group them into hierarchies so that you have fewer labels at multiple stages your model will be more effective.

Did you already use ordinal encoder ? This would encode the categories but won't increase the number of variables.

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

How can I consider word dependence along with the semantic information in information retrieval?

I am working on a project that text retrieval is an important part of it. There is a reference collection (D), and users can enter queries (Q). Therefore, like a search engine, the goal is to retrieve the most related documents to each query.
I used pre-trained word embeddings to extract semantic knowledge about each word within a text. I then aggregated the continuous vectors of words to represent each text as a vector (using mean/sum aggregate function). Next, I indexed the source vectors and extracted the most similar vectors to the query vector. However, the result was not acceptable. I also tested the traditional approaches like the BOW technique. While these approaches work very well in some situations, they do not consider semantic and syntactic information (that made them not good for some queries).
Based on my investigation, considering word dependence (for example, co-occurring the words in the same sentence) along with the semantic information (obtained using the pre-trained word embeddings) can be very useful. However, I do not know how to combine them to be applicable in IR.
It should be noted that:
I'm not looking for paragraph2vec or doc2vec; those require training on a large data corpus, and I don't have a large data corpus. Instead, I want to use an existing word embeddings.
I'm not looking for a re-ranking technique like learning to rank approach. Instead, I'm looking for a way to take advantage of both syntactic and semantic information in the representation step i.e. mapping the text or query to a feature vector.
Any help would be appreciated.

What is the best way to represent a collection of documents in a fixed length vector?

I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?
The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.

How to implement a simple Markov model to assign authors to anonymous texts?

Let's say I have harvested the posts from a forum. Then I removed all the usernames and signatures, so that now I only know what post was in which thread but not who posted what, or even how many authors there are (though clearly the number of authors cannot be greater than the number of texts).
I want to use a Markov model (look at which words/letters follow which ones) to figure out how many people used this forum, and which posts were written by the same person. To vastly simplify, perhaps one person tends to say "he were" while another person tends to say "he was" - I'm talking about model that works with this sort of basic logic.
Note how there are some obvious issues with the data: Some posts may be very short (one word answers). They may be repetitive (quoting each other or using popular forum catchphrases). The individual texts are not very long.
One could suspect that it would be rare for a person to make consecutive posts or that it is likely that people are more likely to post in threads they have already posted in. Exploiting this is optional.
Let's assume the posts are plaintexts and have no markup, and that everyone on the forum uses English.
I would like to obtain a distance matrix for all texts T_i such that D_ij is the probability that text T_i and text T_j are written by the same author, based on word/character pattern. I am planning to use this distance matrix to cluster the texts, and ask questions such as "What other texts were authored by the person who authored this text?"
How would I actually go about implementing this? Do I need a hidden MM? If so, what is the hidden state? I understand how to train an MM on a text and then generate a similar text (eg. generated Alice in the Wonderland) but after I train a frequency tree, how do I check a text with it to get the probability that it was generated by that tree? Should I look at letters, or words when building the tree?
My advice is put aside the business about the distance matrix and think first about a probabilistic model P(text | author). Constructing that model is that hard part of your work; once yo have it, you can compute P(author | text) via Bayes' rule. Don't put the cart before the horse: the model might or might not involve distance metrics or matrices of various kinds, but don't worry about that, just let it fall out of the model.
You might want to take a look at Hierarchical Clustering. With this algorithm you can define your own distance function and it will give you clusters based on it. If you define a good distance function, the resulting clusters will correspond to one author each.
This is probably quite hard to do though and you might need a lot of posts to really get an interesting result. Nevertheless, I wish you good luck!
You mention a Markov model in your question. Markov models are about sequences of tokens and how one token depends on previous tokens and possibly internal state.
If you want to use probabilistic methods you might want to use a different kind of statistical model that is not so much based on sequences but on bags or sets of words or features.
For example you could use the most K frequent words of the text and create all M-grams of tokens in each post where the nonfrequent words are replaced by empty placeholders. This could allow you to learn phrases commonly used by different authors.
In addition you could use single words as features, so that a post gets as features all words in the post (here you can ignore frequent words and use only rare words - the same authors might be interested in the same topics or use the same words or do the same spelling mistakes).
Additionally you can try to capture the style of authors in features: how many paragraphs, how long sentences, how many commas per sentence, does the author use capitalization or not, are numbers spelled out or not, etc ... these are all features that are not sequences as you would use in a HMM but features assigned to each post.
In summary: even though sequences are certainly important to catch phrases you definitely want more than just a sequence model.