Pytorch: test loss becoming nan after some iteration - deep-learning

I am trying to train a deep learning architecture, the model trains perfectly. I am testing after each epoch. For 7 epoch all the loss and accuracy seems okay but at 8 epoch during the testing test loss becomes nan. I have checked my data, it got no nan. Also my test accuracy is higher than train which is weird. Train data size is 37646 and test is 18932 so it should be enough. Before becoming nan test started to become very high around 1.6513713663602217e+30. This is really weird and I don't understand why is happening. Any help or suggestion is much appreciated.

Assuming that a very high learning rate isn't the cause of the problem, you can clip your gradients before the update, using PyTorch's gradient clipping.
Example:
optimizer.zero_grad()
loss, hidden = model(data, hidden, targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()
This is the first thing to do when you have a NaN loss, if of course you have made sure than you don't have NaNs elsewhere, e.g. in your input features. I have made use of gradient clipping in cases where increasing the learning rate caused NaNs, but still wanted to test a higher learning rate. Decreasing the learning rate could also solve your problem, but I'm guessing that you have already tried this.
Empirically, I set clip_value = 5 most of the times, and then see its (usually non-significant) impact on performance. Feel free to experiment with different values.

Related

Is GradScaler necessary with Mixed precision training with pytorch?

So going the AMP: Automatic Mixed Precision Training tutorial for Normal networks, I found out that there are two versions, Automatic and GradScaler. I just want to know if it's advisable / necessary to use the GradScaler with the training becayse it is written in the document that:
Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(1):
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
opt.zero_grad()
Also, looking at NVIDIA Apex Documentation for PyTorch, they have used it as,
from apex import amp
model, optimizer = amp.initialize(model, optimizer)
loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
I think this is what GradScaler does too so I think it is a must. Can someone help me with the query here.
Short answer: yes, your model may fail to converge without GradScaler().
There are three basic problems with using FP16:
Weight updates: with half precision, 1 + 0.0001 rounds to 1. autocast() takes care of this one.
Vanishing gradients: with half precision, anything less than (roughly) 2e-14 rounds to 0, as opposed to single precision 2e-126. GradScaler() takes care of this one.
Explosive loss: similar to the above, overflow is also much more likely with half precision. This is also managed by autocast() context.

Most wierd loss function shape (because of weight decay parameter)

I am training a large neural network model (1 module Hourglass) for a facial landmark recognition task. Database used for training is WFLW.
Loss function used is MSELoss() between the predicted output heatmaps, and the ground-truth heatmaps.
- Batch size = 32
- Adam Optimizer
- Learning rate = 0.0001
- Weight decay = 0.0001
As I am building a baseline model, I have launched a basic experiment with the parameters shown above. I previously had executed a model with the same exact parameters, but with weight-decay=0. The model converged successfully. Thus, the problem is with the weight-decay new value.
I was expecting to observe a smooth loss function that slowly decreased. As it can be observed in the image below, the loss function has a very very wierd shape.
This will probably be fixed by changing the weight decay parameter (decreasing it, maybe?).
I would highly appreciate if someone could provide a more in-depth explanation into the strange shape of this loss function, and its relation with the weight-decay parameter.
In addition, to explain why this premature convergence into a very specific value of 0.000415 with a very narrow standard deviation? Is it a strong local minimum?
Thanks in advance.
Loss should not consistently increase when using gradient descent. It does not matter if you use weight decay or not, there is either a bug in your code (e.g. worth checking what happens with normal gradient descent, not Adam, as there are ways in which one can wrongly implement weight decay with Adam), or your learning rate is too large.

Deep learning model stuck in local minima or overfit?

I trained an image classification model of 10 classes by finetuning EfficientNet-B4 for 100 epochs. I split my training data to 70/30. I used stochastic gradient descent with Nesterov momentum of 0.9 and the starting learning rate is 0.001. The batch size is 10. The test loss seemed to stuck at 84% for the next 50 epochs (51st - 100th). I do not know whether the model was stuck in local minima or the model was overfitted. Below is an image of the test and train loss from 51st epoch to 100th. I need your help a lot. Thanks. Train test loss image from 51st to 100th epoch.
From the graph you provided, both validation and training losses are still going down so your model is still training and there is no overfit. If your test set is stuck at the same accuracy, the reason is probably that the data you are using for your training/validation dataset does not generalize well enough on your test dataset (in your graph the validation only reached 50% accuracy while your test set reached 84% accuracy).
I looked into your training and validation graph. yes, your model is training and the losses are going down, but your validation error is near 50%, which means 'random guess'.
Possible reasons-
1- From your train error (which is presented in the image between 50-100 epoch), the error in average is going down, but it's random. like your error at epoch 100 is pretty much the same at epoch 70. This could be because your either dataset is too simple and you are forcing huge network like an efficient net to overfit it.
2- it could also be because of the way you are finetuning it, there could be any problem. like which all layers you froze and for which layer you are taking the gradients while doing BP. I am assuming you are using pre-trained weights.
3- Optimizer issue. try to use Adam
It would be great if you can provide total losses (from epoch 1 - 100).

How to interpret the results of an epoch in Deep Learning?

Would you please advice how to interpret the results of an epoch; loss and val_loss, their differences with each other and also with other epochs?
An output as an example:
"In machine-learning parlance, an epoch is a complete pass through a given dataset." taken from https://deeplearning4j.org/glossary
Whereas an iterations is for example a mini batch
loss is the actual error based on the measure you've specified, the lower the better. Or in other words, how good does the neural network fit the data
val_loss is the same, but not on the training but on the validation data set, I assume you are taking about keras...
COMMENT ON THE FIGURE:
So since your loss is not decreasing over time I'd try the following:
increase learning rate
increase size of neural network (layers, neurons)
train for longer
Since loss and val_loss are pretty the same this means you are not overf-itting, but it seams you are not learning at all

How do I prevent Keras from always predicting the underlying distribution of my data?

I am training a Deep CNN on a very unbalanced data set for a binary classification problem. I have 90% 0's and 10% 1's. To penalize the misclassification of 1, I am using a class_weight that was determined by sklearn's compute_class_weight(). In the validation tuple passed to the fit_generator(), I am using a sample_weight that was computed by sklearn's compute_sample_weight().
The network seems to be learning fine but the validation accuracy continues to be 90% or 10% after every epoch. How can I solve this data unbalance issue in Keras considering the steps I have already taken to overcome it?
Picture of fit_generator: fit_generator()
Picture of log outputs: log outputs
It's ver y strange that your val_accuracy jumps from 0.9 to 0.1 and back. Do you have right learning rate? Try to lower it even more.
And my advice: use f1 metric also.
How did you split the data - train set classes have the same rate in test set?