I do not understand what is being minimized in these networks.
Can someone please explain what is going on mathematically when the loss gets smaller in LSTM network?
model.compile(loss='categorical_crossentropy', optimizer='adam')
From the keras documentation, categorical_crossentropy is just the multiclass logloss. Math and theoretical explanation for log loss here.
Basically, the LSTM is assigning labels to words (or characters, depending on your model), and optimizing the model by penalizing incorrect labels in word (or character) sequences. The model takes an input word or character vector, and tries to guess the next "best" word, based on training examples. Categorical crossentropy is a quantitative way of measuring how good the guess is. As the model iterates over the training set, it makes less mistakes in guessing the next best word (or character).
Related
I am doing actor-critic reinforcement learning for an environment that is best represented as a "bag-of-words". For this reason, I have opted to use a single body, multi-head approach for the net architecture. I use a linear pre-processing layer to generate n word embeddings of dimension d. Then I run the (batch,n,d) words through a stack of (2) nn.TransformerEncoder layers for the body and each head is another encoder layer followed by a linear logit layer.
Since this is RL and I have limited compute, it is also difficult to evaluate training as its happening. I decided to try looking a the mean cosine similarity of the latent words after the encoder body. My intuition tells me if the net is learning a proper latent representation of the environment then dis-similar words should have low cosine similarity.
However even though the net is clearly improving somewhat the mean cosine sim. remains very high, > .99
Thinking about it more, I don't think theres any reason to believe my first intuition, especially since I am not even normalizing the words after encoder body stack. But even if I did normalize, I'm not sure that would encourage lower cosine sim. as I am using 256 dimensions per word. All normalizing does is reduce the dimension of the output space by 1, which should hardly matter here.
Does this make sense? Also any general advice about my net is welcome
I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,
I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.
In implementation of densenet model as in CheXNet paper, in section 3.1 it is mentioned that:
Before inputting the images into the network, we downscale the images to 224x224 and normalize based on the mean and standard edviation of images in the ImageNet training set.
Why would we want to normalize new set of images with mean and std of different dataset?
How do we get the mean and std of ImageNet dataset? Is it provided somewhere?
Subtracting the mean centers the input to 0, and dividing by the standard deviation makes any scaled feature value the number of standard deviations away from the mean.
Consider how a neural network learns its weights. C(NN)s learn by continually adding gradient error vectors (multiplied by a learning rate) computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The thing to notice here is the "multiplied by a learning rate".
If we didn't scale our input training vectors, the ranges of our distributions of feature values would likely be different for each feature, and thus the learning rate would cause corrections in each dimension that would differ (proportionally speaking) from one another. We might be over compensating a correction in one weight dimension while undercompensating in another.
This is non-ideal as we might find ourselves in a oscillating (unable to center onto a better maxima in cost(weights) space) state or in a slow moving (traveling too slow to get to a better maxima) state.
Original Post: https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn
They used mean and std dev of the ImageNet training set because the weights of their model were pretrained on ImageNet (see Model Architecture and Training section of the paper).
I wonder if anyone had experience with training deep neural nets on, say, CIFAR10 or ILSVRC-2012 datasets and comparing the final results for single and double precision computations?
Can't comment, but this is less of an answer and more of just information and a though experiment. I think this question will be hard to test just because using 64 bit will require CPU instead of the GPU and will increase runtime considerably.
First a paper on the subject from Vincent Vanhoucke at Google: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37631.pdf. This paper is focused on optimizing deep networks on cpus and a huge area of optimization is using 'fixed-point SIMD' instructions which as explained in the next link equates to 8 bit precision (unless I'm mistaken). An explanation of this paper can be found at http://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/.
In my own experience, I've used 16 bit precision for my inputs instead of 32 bit for Deep Q-Learning and I've noticed no difference in performance.
I'm not an expert in low level computing but what are the extra digits really helping with? The point of training is to maximize the probability the network will assign the correct class with a large margin (ie softmax output of 95%+). A +- difference of 0.0001 won't change the predicted class.