Is normalization of word embeddings important? - reinforcement-learning

I am doing actor-critic reinforcement learning for an environment that is best represented as a "bag-of-words". For this reason, I have opted to use a single body, multi-head approach for the net architecture. I use a linear pre-processing layer to generate n word embeddings of dimension d. Then I run the (batch,n,d) words through a stack of (2) nn.TransformerEncoder layers for the body and each head is another encoder layer followed by a linear logit layer.
Since this is RL and I have limited compute, it is also difficult to evaluate training as its happening. I decided to try looking a the mean cosine similarity of the latent words after the encoder body. My intuition tells me if the net is learning a proper latent representation of the environment then dis-similar words should have low cosine similarity.
However even though the net is clearly improving somewhat the mean cosine sim. remains very high, > .99
Thinking about it more, I don't think theres any reason to believe my first intuition, especially since I am not even normalizing the words after encoder body stack. But even if I did normalize, I'm not sure that would encourage lower cosine sim. as I am using 256 dimensions per word. All normalizing does is reduce the dimension of the output space by 1, which should hardly matter here.
Does this make sense? Also any general advice about my net is welcome

Related

What activation layers learn?

I am trying to figure out what CNN architecture after every activation layers. Therefore, I have written a code to visualize some activation layers in my model. I used LeakyReLU as my activation layer. This is the figureLeakyRelu after Conv2d + BatchNorm
As can be seen from the figure, there are quite purple frames, which shows nothing. So my question is what does it mean. Does my model learn anything?
Generally speaking, activation layers (AL) don't learn.
The purpose of AL is to add non-linearity into the model, hence they usually apply a certain, fixed, function regardless of the data, without adapting with the data. As an example:
Max Pool: take the highest number in the region
Sigmoid/Tanh: put the all the numbers through a fixed computation
ReLU: takes the max between the numbers and 0
I tried to simplify the math, so pardon my inaccuracies.
As a closure, your purple frames are probably filters that didn't learn just yet, train the model to convergence and unless your model is highly bloated (too big for your data) your will see 'structures' in your filters.

Doesn't introduction of polynomial features lead to increased collinearity?

I was going through Linear and Logistic regression from ISLR and in both cases I found that one of the approaches adopted to increase the flexibility of the model was to use polynomial features - X and X^2 both as features and then apply the regression models as usual while considering X and X^2 as independent features (in sklearn, not the polynomial fit of statsmodel). Does that not increase the collinearity amongst the features though? How does it affect the model performance?
To summarize my thoughts regarding this -
First, X and X^2 have substantial correlation no doubt.
Second, I wrote a blog demonstrating that, at least in Linear regression, collinearity amongst features does not affect the model fit score though it makes the model less interpretable by increasing coefficient uncertainty.
So does the second point have anything to do with this, given that model performance is measured by the fit score.
Multi-collinearity isn't always a hindrance. It depends from data to data. If your model isn't giving you the best results(high accuracy or low loss), you then remove the outliers or highly correlated features to improve it but is everything is hunky-dory, you don't bother about them.
Same goes with polynomial regression. Yes it adds multi-collinearity in your model by introducing x^2, x^3 features into your model.
To overcome that, you can use orthogonal polynomial regression which introduces polynomials that are orthogonal to each other.
But it will still introduce higher degree polynomials which can become unstable at the boundaries of your data space.
To overcome this issue, you can use Regression Splines in which it divides the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called Knots. Functions which we can use for modelling each piece/bin are known as Piecewise functions. This function has a constraint , suppose, if it is introducing 3 degree of polynomials or cubic features and then the function should be second-order differentiable.
Such a piecewise polynomial of degree m with m-1 continuous derivatives is called a Spline.

Predicting rare events and their strength with LSTM autoencoder

I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,

Why do we need the hyperparameters beta and alpha in LDA?

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

loss function in LSTM neural network

I do not understand what is being minimized in these networks.
Can someone please explain what is going on mathematically when the loss gets smaller in LSTM network?
model.compile(loss='categorical_crossentropy', optimizer='adam')
From the keras documentation, categorical_crossentropy is just the multiclass logloss. Math and theoretical explanation for log loss here.
Basically, the LSTM is assigning labels to words (or characters, depending on your model), and optimizing the model by penalizing incorrect labels in word (or character) sequences. The model takes an input word or character vector, and tries to guess the next "best" word, based on training examples. Categorical crossentropy is a quantitative way of measuring how good the guess is. As the model iterates over the training set, it makes less mistakes in guessing the next best word (or character).