Some of my parameters
base_lr: 0.04
max_iter: 170000
lr_policy: "poly"
batch_size = 8
iter_size =16
this is how the training process looks until now:
The Loss here seems stagnant, is there a problem here or this normal?
The solution for me was to lower the base learning rate by a factor of 10 before resuming training from a solverstate snapshot.
To achieve this same solution automatically, you can set the "gamma" and "stepsize" parameters in your solver.prototxt:
base_lr: 0.04
stepsize:10000
gamma:0.1
max_iter: 170000
lr_policy: "poly"
batch_size = 8
iter_size =16
This will reduce your base_lr by a factor of 10 every 10,000 iterations.
Please note, it is normal for loss to fluctuate between values, and even hover around a constant value before making a dip. This could be the cause of your issue, I would suggest training well beyond 1800 iterations before falling back on the above implementation. Look up graphs of caffe train loss logs.
Additionally, please direct all future questions to the caffe mailing group. This serves as a central location for all caffe questions and solutions.
I struggled with this myself and didn't find solutions anywhere before I figured it out. Hope what worked for me will work for you!
Related
I am working on a project to predict soccer player values from a set of inputs. The data consists of about 19,000 rows and 8 columns (7 columns for input and 1 column for the target) all of numerical values.
I am using a fully connected Neural Network for the prediction but the problem is the loss is not decreasing as it should.
The loss is very large (1e+13) and doesn’t decrease as it should, it just fluctuates.
This is the function I am using to run the model:
def gradient_descent(model, learning_rate, num_epochs, data_loader, criterion):
losses = []
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(num_epochs): # one epoch
for inputs, outputs in data_loader: # one iteration
inputs, outputs = inputs.to(torch.float32), outputs.to(torch.float32)
logits = model(inputs)
loss = criterion(torch.squeeze(logits), outputs) # forward-pass
optimizer.zero_grad() # zero out the gradients
loss.backward() # compute the gradients (backward-pass)
optimizer.step() # take one step
losses.append(loss.item())
loss = sum(losses[-len(data_loader):]) / len(data_loader)
print(f'Epoch #{epoch}: Loss={loss:.3e}')
return losses
The model is fully connected neural network with 4 hidden layers, each with 7 neurons. input layer has 7 neurons and output has 1. I am using MSE for loss function. I tried changing the learning rate but it is still bad.
What could be the reason behind this?
Thank you!
It is difficult to diagnose your problem from the information you provided, but I'll try to point you in some useful directions.
Data Normalization:
The way we initialize the weights in deep NN has a significant effect on the training process. See, e.g.:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
Most initialization methods assume the inputs have zero mean and unit variance (or similar statistics). If your inputs violate these assumptions, you will find it difficult to train. See, e.g., this post.
Normalize the Targets:
You are trying to solve a regression problem (MSE loss), it might be the case that your targets are poorly scaled and causing very large loss values. Try and normalize the targets to span a more compact range.
Learning Rate:
Try and adjust your learning rate: both increasing it and decreasing it by orders of magnitude.
I am trying to train a deep learning architecture, the model trains perfectly. I am testing after each epoch. For 7 epoch all the loss and accuracy seems okay but at 8 epoch during the testing test loss becomes nan. I have checked my data, it got no nan. Also my test accuracy is higher than train which is weird. Train data size is 37646 and test is 18932 so it should be enough. Before becoming nan test started to become very high around 1.6513713663602217e+30. This is really weird and I don't understand why is happening. Any help or suggestion is much appreciated.
Assuming that a very high learning rate isn't the cause of the problem, you can clip your gradients before the update, using PyTorch's gradient clipping.
Example:
optimizer.zero_grad()
loss, hidden = model(data, hidden, targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()
This is the first thing to do when you have a NaN loss, if of course you have made sure than you don't have NaNs elsewhere, e.g. in your input features. I have made use of gradient clipping in cases where increasing the learning rate caused NaNs, but still wanted to test a higher learning rate. Decreasing the learning rate could also solve your problem, but I'm guessing that you have already tried this.
Empirically, I set clip_value = 5 most of the times, and then see its (usually non-significant) impact on performance. Feel free to experiment with different values.
I trained an image classification model of 10 classes by finetuning EfficientNet-B4 for 100 epochs. I split my training data to 70/30. I used stochastic gradient descent with Nesterov momentum of 0.9 and the starting learning rate is 0.001. The batch size is 10. The test loss seemed to stuck at 84% for the next 50 epochs (51st - 100th). I do not know whether the model was stuck in local minima or the model was overfitted. Below is an image of the test and train loss from 51st epoch to 100th. I need your help a lot. Thanks. Train test loss image from 51st to 100th epoch.
From the graph you provided, both validation and training losses are still going down so your model is still training and there is no overfit. If your test set is stuck at the same accuracy, the reason is probably that the data you are using for your training/validation dataset does not generalize well enough on your test dataset (in your graph the validation only reached 50% accuracy while your test set reached 84% accuracy).
I looked into your training and validation graph. yes, your model is training and the losses are going down, but your validation error is near 50%, which means 'random guess'.
Possible reasons-
1- From your train error (which is presented in the image between 50-100 epoch), the error in average is going down, but it's random. like your error at epoch 100 is pretty much the same at epoch 70. This could be because your either dataset is too simple and you are forcing huge network like an efficient net to overfit it.
2- it could also be because of the way you are finetuning it, there could be any problem. like which all layers you froze and for which layer you are taking the gradients while doing BP. I am assuming you are using pre-trained weights.
3- Optimizer issue. try to use Adam
It would be great if you can provide total losses (from epoch 1 - 100).
https://github.com/slavaglaps/ResNet_cifar10/blob/master/resnet.ipynb
This is my model trained in 100 epochs
Accuracy on similar models and similar data reaches 90%
What is my problem?
I think it's worth reducing the learning rate with the passage of the epochs.
What do you think that can help me?
There are a few subtle differences.
You are trying to apply ImageNet style architecture to Cifar-10. First convolution is 3 x 3, not 7 x 7. There is no max-pooling layer. The image is downsampled purely by using stride-2 convolutions.
You should probably do mean-centering by keeping featurewise_center = True in ImageDataGenerator.
Do not use very high number of filters such as [512, 1024, 2048]. There are only 50,000 images for you to train unlike ImageNet which has about a million.
In short, read up section 4.2 in the deep residual network paper and try to replicate the network. You may also read this blog.
I trained FCN32 from the scratch on my data, unfortunately I am getting a black image as output. Here is the loss curve.
I am not sure whether this training loss curve is normal or not, or whether I have done something wrong or not.
I really appreciate experts'idea on this. And
why the output is a black image?
Is the network overfitting?
Should I change lr_mult value in Deconvolution layer, from 0
to any other value?
Thanks a lot
Edited:
I changed the lr_mult value in Deconvolution layer, from 0
to 3
and the following shows the solver:
test_interval: 1000 #1000000
display: 100
average_loss: 100
lr_policy: "step"
stepsize: 100000
gamma: 0.1
base_lr: 1e-7
momentum: 0.99
iter_size: 1
max_iter: 500000
weight_decay: 0.0005
I got the following train-loss curve and again I am getting black image. I do not know what is the mistake and why it is behaving like this, could someone please share some ideas? Thanks
There is an easy way to check if you are overfitting on the training data or just did something wrong in the algorithm. Just predict on the training data and look at the output. If this is very similar or equal to the desired output you are overfitting and you will probably have to apply dropout and weight regularization.
If the output is also black on the training data your labels or your optimization metric is probably wrong.
Should I change lr_mult value in Deconvolution layer, from 0 to any other value?
lr_mult = 0 means this layer does not learn (source, source 2). If you want that layer to learn, you should better set it to a positive value. Depending on your initialization, this might very well be the reason why the image is black.