I've trained a single layer, 100 hidden unit RBM with binary input units and ReLU activation on the hidden layer. Using a training set of 50k MNIST images, I end up with ~5% RMSE on the 10k image test set after 500 epochs of full-batch training with momentum and L1 weight penalty.
Looking at the visualisation below, it is clear that there are big differences between the hidden units. Some appear to have converged into a very well defined response pattern, while others are indistinguishable from noise.
My question is: how would you interpret this apparent variety, and what technique could possibly help with achieving a more balanced result? Does a situation like this call for more regularization, slower learning, longer learning, or something else?
Raw weights of the 100 hidden units, reshaped into the input image size.
Related
I'm investigating the task of training a neural network to predict one future value given a sinusoidal input. So for example, as seen in the Figure, the input signal is x and the expected output signal y. The model's output is y^. Doing the regression task is fairly straightforward, and there are a lot of choices for this problem. I'm using a simple recurrent neural network with mean-squared error (MSE) loss between y and y^.
Additionally, suppose I know that the sinusoid is made up of N modalities, e.g., at some points, the wave oscillates at 5 Hz, then 10 Hz, then back to 5 Hz, then up to 15 Hz maybe—i.e., N=3.
In this case, I have ground-truth class labels in a vector k and the model does both regression and classification, additionally outputting a vector k^. An example is shown in the Figure. As this is a multi-class problem with exclusivity, I figured binary cross entropy (BCE) loss should be relevant here.
I'm sure there is a lot of research about combining loss functions, but does just adding MSE and BCE make sense? Scaling one up or down by a factor of 10 doesn't seem to change the learning outcome too much. So I was wondering what is considered the standard approach to problems where there is a joint classification and regression objective.
Additionally, on top of just BCE, I want to penalize k^ for quickly jumping around between classes; for example, if the model guesses one class, I'd like it to stay in that one class and switch only when it's necessary. See how in the Figure, there are fast dark blue blips in k^. I would like the same solid bands as seen in k, and naive BCE loss doesn't account for that.
Appreciate any and all advice!
I am training a large neural network model (1 module Hourglass) for a facial landmark recognition task. Database used for training is WFLW.
Loss function used is MSELoss() between the predicted output heatmaps, and the ground-truth heatmaps.
- Batch size = 32
- Adam Optimizer
- Learning rate = 0.0001
- Weight decay = 0.0001
As I am building a baseline model, I have launched a basic experiment with the parameters shown above. I previously had executed a model with the same exact parameters, but with weight-decay=0. The model converged successfully. Thus, the problem is with the weight-decay new value.
I was expecting to observe a smooth loss function that slowly decreased. As it can be observed in the image below, the loss function has a very very wierd shape.
This will probably be fixed by changing the weight decay parameter (decreasing it, maybe?).
I would highly appreciate if someone could provide a more in-depth explanation into the strange shape of this loss function, and its relation with the weight-decay parameter.
In addition, to explain why this premature convergence into a very specific value of 0.000415 with a very narrow standard deviation? Is it a strong local minimum?
Thanks in advance.
Loss should not consistently increase when using gradient descent. It does not matter if you use weight decay or not, there is either a bug in your code (e.g. worth checking what happens with normal gradient descent, not Adam, as there are ways in which one can wrongly implement weight decay with Adam), or your learning rate is too large.
So, I'm doing a 4 label x-ray images classification on around 12600 images:
Class1:4000
Class2:3616
Class3:1345
Class4:4000
I'm using VGG-16 architecture pertained on the imageNet dataset with cross-entrpy and SGD and a batch size of 32 and a learning rate of 1e-3 running on pytorch
[[749., 6., 50., 2.],
[ 5., 707., 9., 1.],
[ 56., 8., 752., 0.],
[ 4., 1., 0., 243.]]
I know since both train loss/acc are relatively 0/1 the model is overfitting, though I'm surprised that the val acc is still around 0.9!
How to properly interpret that and what causing it and how to prevent it?
I know it's something like because the accuracy is the argmax of softmax like the actual predictions are getting lower and lower but the argmax always stays the same, but I'm really confused about it! I even let it train for +64 epochs same results flat acc while loss increases gradually!
PS. I have seen other questions with answers and didn't really get an explanation
I think your question already says about what is going on. Your model is overfitting as you have also figured out. Now, as you are training more your model slowly becoming more specialized to the train set and loosing the the capability to generalize gradually. So the softmax probabilities are getting more and more flat. But still it is showing more or less the same accuracy for validation set as still now the correct class has at least slightly more probability than the others. So in my opinion there can be some possible reasons for this:
Your train set and validation set may not be from the same distribution.
Your validation set doesn't cover all cases need to be evaluated, it probably contains similar types of images but they do not differ too much. So, when the model can identify one, it can identify many of them from the validation set. If you add more heterogeneous images in validation set, you will no longer see such a large accuracy in validation set.
Similarly, we can say your train set has images which are heterogeneous i.e, they have a lot of variations, and the validation set is covering only a few varieties, so as training goes on, those minorities are getting less priority as the model yet to have many things to learn and generalize. This can happen if you augment your train-set and your model finds the validation set is relatively easier initially (until overfitting), but as training goes on the model gets lost itself while learning a lot of augmented varieties available in the train set. In this case don't make the augmentation too much wild. Think, if the augmented images are still realistic or not. Do augmentation on images as long as they remain realistic and each type of these images' variations occupy enough representative examples in the train set. Don't include unnecessary situations in augmentation those will never occur in reality, as these unrealistic examples will just increase burden on the model than doing any help.
Would you please advice how to interpret the results of an epoch; loss and val_loss, their differences with each other and also with other epochs?
An output as an example:
"In machine-learning parlance, an epoch is a complete pass through a given dataset." taken from https://deeplearning4j.org/glossary
Whereas an iterations is for example a mini batch
loss is the actual error based on the measure you've specified, the lower the better. Or in other words, how good does the neural network fit the data
val_loss is the same, but not on the training but on the validation data set, I assume you are taking about keras...
COMMENT ON THE FIGURE:
So since your loss is not decreasing over time I'd try the following:
increase learning rate
increase size of neural network (layers, neurons)
train for longer
Since loss and val_loss are pretty the same this means you are not overf-itting, but it seams you are not learning at all
After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.
Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.
For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.