I trained FCN32 from the scratch on my data, unfortunately I am getting a black image as output. Here is the loss curve.
I am not sure whether this training loss curve is normal or not, or whether I have done something wrong or not.
I really appreciate experts'idea on this. And
why the output is a black image?
Is the network overfitting?
Should I change lr_mult value in Deconvolution layer, from 0
to any other value?
Thanks a lot
Edited:
I changed the lr_mult value in Deconvolution layer, from 0
to 3
and the following shows the solver:
test_interval: 1000 #1000000
display: 100
average_loss: 100
lr_policy: "step"
stepsize: 100000
gamma: 0.1
base_lr: 1e-7
momentum: 0.99
iter_size: 1
max_iter: 500000
weight_decay: 0.0005
I got the following train-loss curve and again I am getting black image. I do not know what is the mistake and why it is behaving like this, could someone please share some ideas? Thanks
There is an easy way to check if you are overfitting on the training data or just did something wrong in the algorithm. Just predict on the training data and look at the output. If this is very similar or equal to the desired output you are overfitting and you will probably have to apply dropout and weight regularization.
If the output is also black on the training data your labels or your optimization metric is probably wrong.
Should I change lr_mult value in Deconvolution layer, from 0 to any other value?
lr_mult = 0 means this layer does not learn (source, source 2). If you want that layer to learn, you should better set it to a positive value. Depending on your initialization, this might very well be the reason why the image is black.
Related
I am training VGG11 on a custom image dataset for 3-way 5-shot image classification using MAML from learn2learn. I am encapsulating the whole VGG11 model with MAML, i.e., not just the classification head. My hyperparameters are as follows:
Meta LR: 0.001
Fast LR: 0.5
Adaptation steps: 1
First order: False
Meta Batch Size: 5
Optimizer: AdamW
During the training, I noticed that after taking the first outer-loop optimization step, i.e., AdamW.step(), loss skyrockets to very large values, like ten thousands. Is this normal? Also, I am measuring the micro F1 score as accuracy metric of which curve for meta training/validation is as follows:
It is fluctuating too much in my opinion, is this normal? What could be the reason of this?
Thanks
I figured it out. I was using VGG11 with vanilla BatchNorm layers from PyTorch which was not working properly in meta training setup. I removed BatchNorm layers and now it works as expected.
I am training a large neural network model (1 module Hourglass) for a facial landmark recognition task. Database used for training is WFLW.
Loss function used is MSELoss() between the predicted output heatmaps, and the ground-truth heatmaps.
- Batch size = 32
- Adam Optimizer
- Learning rate = 0.0001
- Weight decay = 0.0001
As I am building a baseline model, I have launched a basic experiment with the parameters shown above. I previously had executed a model with the same exact parameters, but with weight-decay=0. The model converged successfully. Thus, the problem is with the weight-decay new value.
I was expecting to observe a smooth loss function that slowly decreased. As it can be observed in the image below, the loss function has a very very wierd shape.
This will probably be fixed by changing the weight decay parameter (decreasing it, maybe?).
I would highly appreciate if someone could provide a more in-depth explanation into the strange shape of this loss function, and its relation with the weight-decay parameter.
In addition, to explain why this premature convergence into a very specific value of 0.000415 with a very narrow standard deviation? Is it a strong local minimum?
Thanks in advance.
Loss should not consistently increase when using gradient descent. It does not matter if you use weight decay or not, there is either a bug in your code (e.g. worth checking what happens with normal gradient descent, not Adam, as there are ways in which one can wrongly implement weight decay with Adam), or your learning rate is too large.
thanks for looking at this question!
I attempted to train a simple DCGAN to generate room designs from a dataset of 215 coloured images of size 128x128. My attempt can be summarised as below:
Generator: 5 deconvolution layers from (100x1) noise input to (128x128x1) grayscale image output
Discriminator: 4 convolution layers from (128x128x1) grayscale image input
Optimizer: Adam at learning rate of 0.002 for both Generator and Discriminator
Batch size: 21 images/batch
Epoch: 100 epochs with 10 batches/epoch
Results:
1. D-loss is close to 0, G-loss is close to 1. After which, I've cut down my discriminator by 2 convolution layers, reduce Adam learning rate to 0.00002, hoping that the discriminator doesn't overpower my generator.
Upon (1), D-loss and G-loss hovers around 0.5 - 1.0. However, the generated image still show noise images even after 100 epochs.
Questions:
Is there something wrong in terms of how I trained my GAN?
How should I modify my approach to successfully train the GAN?
Thank you so much everyone for your help, really looking forward!
I trained an image classification model of 10 classes by finetuning EfficientNet-B4 for 100 epochs. I split my training data to 70/30. I used stochastic gradient descent with Nesterov momentum of 0.9 and the starting learning rate is 0.001. The batch size is 10. The test loss seemed to stuck at 84% for the next 50 epochs (51st - 100th). I do not know whether the model was stuck in local minima or the model was overfitted. Below is an image of the test and train loss from 51st epoch to 100th. I need your help a lot. Thanks. Train test loss image from 51st to 100th epoch.
From the graph you provided, both validation and training losses are still going down so your model is still training and there is no overfit. If your test set is stuck at the same accuracy, the reason is probably that the data you are using for your training/validation dataset does not generalize well enough on your test dataset (in your graph the validation only reached 50% accuracy while your test set reached 84% accuracy).
I looked into your training and validation graph. yes, your model is training and the losses are going down, but your validation error is near 50%, which means 'random guess'.
Possible reasons-
1- From your train error (which is presented in the image between 50-100 epoch), the error in average is going down, but it's random. like your error at epoch 100 is pretty much the same at epoch 70. This could be because your either dataset is too simple and you are forcing huge network like an efficient net to overfit it.
2- it could also be because of the way you are finetuning it, there could be any problem. like which all layers you froze and for which layer you are taking the gradients while doing BP. I am assuming you are using pre-trained weights.
3- Optimizer issue. try to use Adam
It would be great if you can provide total losses (from epoch 1 - 100).
Some of my parameters
base_lr: 0.04
max_iter: 170000
lr_policy: "poly"
batch_size = 8
iter_size =16
this is how the training process looks until now:
The Loss here seems stagnant, is there a problem here or this normal?
The solution for me was to lower the base learning rate by a factor of 10 before resuming training from a solverstate snapshot.
To achieve this same solution automatically, you can set the "gamma" and "stepsize" parameters in your solver.prototxt:
base_lr: 0.04
stepsize:10000
gamma:0.1
max_iter: 170000
lr_policy: "poly"
batch_size = 8
iter_size =16
This will reduce your base_lr by a factor of 10 every 10,000 iterations.
Please note, it is normal for loss to fluctuate between values, and even hover around a constant value before making a dip. This could be the cause of your issue, I would suggest training well beyond 1800 iterations before falling back on the above implementation. Look up graphs of caffe train loss logs.
Additionally, please direct all future questions to the caffe mailing group. This serves as a central location for all caffe questions and solutions.
I struggled with this myself and didn't find solutions anywhere before I figured it out. Hope what worked for me will work for you!