How to apply exponential moving average decay for variables in pytorch? - deep-learning

I am reading following paper. And it uses EMA decay for variables.
Bidirectional Attention Flow for Machine Comprehension
During training, the moving averages of all weights of the model are
maintained with the exponential decay rate of 0.999.
They use TensorFlow and I found the related code of EMA.
https://github.com/allenai/bi-att-flow/blob/master/basic/model.py#L229
In PyTorch, how do I apply EMA to Variables?

Moving average is the key concept of momentum in gradient descent.
In PyTorch document you can find:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
Change the parameter momentum to the value you want.

Related

Use LSTM to forecast Precipitation

I build a LSTM to forecast Precipitation, but it doesn't work well.
My code is very simple and data is very short only contains 720 points.
i use MinMaxScale to scale the data.
this is my code, seq_len = 12
model = Sequential([
layers.LSTM(2, input_shape=(SEQ_LEN, 1),
layers.Dense(1)])
my data is like this
and the output compares with true value like this
I use adam and mae loss function, epoch=10
is it underfitting? or is this simple net can't do this work?
r2_score is no more than 0.55
please tell me how to adjust it. thanks
there are so many options;
first of all it would be better to define the optimized window size by changing the periods of the sequences
The second option would be changing the batch-size of the dataset
Change optimizer into SGD cause of few datapoints and before training model define the best values for learning rate by setting Learning Rate Schedule callback
Try another model architecture with convolution layers and etc
Sometimes it would be a trick to help model performance by setting lambda layer after the last layer to scale up values cause of lstm default activation function is tanh.

Most wierd loss function shape (because of weight decay parameter)

I am training a large neural network model (1 module Hourglass) for a facial landmark recognition task. Database used for training is WFLW.
Loss function used is MSELoss() between the predicted output heatmaps, and the ground-truth heatmaps.
- Batch size = 32
- Adam Optimizer
- Learning rate = 0.0001
- Weight decay = 0.0001
As I am building a baseline model, I have launched a basic experiment with the parameters shown above. I previously had executed a model with the same exact parameters, but with weight-decay=0. The model converged successfully. Thus, the problem is with the weight-decay new value.
I was expecting to observe a smooth loss function that slowly decreased. As it can be observed in the image below, the loss function has a very very wierd shape.
This will probably be fixed by changing the weight decay parameter (decreasing it, maybe?).
I would highly appreciate if someone could provide a more in-depth explanation into the strange shape of this loss function, and its relation with the weight-decay parameter.
In addition, to explain why this premature convergence into a very specific value of 0.000415 with a very narrow standard deviation? Is it a strong local minimum?
Thanks in advance.
Loss should not consistently increase when using gradient descent. It does not matter if you use weight decay or not, there is either a bug in your code (e.g. worth checking what happens with normal gradient descent, not Adam, as there are ways in which one can wrongly implement weight decay with Adam), or your learning rate is too large.

weight decay in caffe. How exactly is it used?

From some old discussions (link1, link2) I got the idea that 'weight_decay' parameter is the regularization parameter for L2 loss over the weights. For example, in the cifar10 solver, the weight_decay value is 0.004. Does it mean the loss to be minimized is is "cross-entropy + 0.004*sum_of_L2_Norm_of_all_weights"? Is it, by any chance, "cross-entropy + 0.004/2*sum_of_L2_Norm_of_all_weights"?
The loss seems to be cross-entropy+0.004/2*sum_of_L2_Norm_of_all_weights.
Looking at the official caffe implementation of AlexNet, the solver file (https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/solver.prototxt) sets weight_decay=0.0005, while in the original AlexNet paper (http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf, page 6) the gradient update includes the term
-0.0005*e*w_i
Since the gradient is the partial derivative of the loss, and the regularization component of the loss is usually expressed as lambda*||w||^2, it seems as if
weight_decay=2*lambda

Hard to understand Caffe MNIST example

After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.
Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.
For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.

Loss function for ordinal target on SoftMax over Logistic Regression

I am using Pylearn2 OR Caffe to build a deep network. My target is ordered nominal. I am trying to find a proper loss function but cannot find any in Pylearn2 or Caffe.
I read a paper "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels" . I get the general idea - but I am not sure I understand what will the thresholds be, if my final layer is a SoftMax over Logistic Regression (outputting probabilities).
Can some help me by pointing to any implementation of such a loss function ?
Thanks
Regards
For both pylearn2 and caffe, your labels will need to be 0-4 instead of 1-5...it's just the way they work. The output layer will be 5 units, each is a essentially a logistic unit...and the softmax can be thought of as an adaptor that normalizes the final outputs. But "softmax" is commonly used as an output type. When training, the value of any individual unit is rarely ever exactly 0.0 or 1.0...it's always a distribution across your units - which log-loss can be calculated on. This loss is used to compare against the "perfect" case and the error is back-propped to update your network weights. Note that a raw output from PL2 or Caffe is not a specific digit 0,1,2,3, or 5...it's 5 number, each associated to the likelihood of each of the 5 classes. When classifying, one just takes the class with the highest value as the 'winner'.
I'll try to give an example...
say I have a 3 class problem, I train a network with a 3 unit softmax.
the first unit represents the first class, second the second and third, third.
Say I feed a test case through and get...
0.25, 0.5, 0.25 ...0.5 is the highest, so a classifier would say "2". this is the softmax output...it makes sure the sum of the output units is one.
You should have a look at ordinal (logistic) regression. This is the formal solution to the problem setup you describe ( do not use plain regression as the distance measures of errors are wrong).
https://stats.stackexchange.com/questions/140061/how-to-set-up-neural-network-to-output-ordinal-data
In particular I recommend looking at Coral ordinal regression implementation at
https://github.com/ck37/coral-ordinal/issues.