I'm trying to train Arcface with reference to.
As far as I know, Arcface requires more than 200 training epochs on CASIA-webface with a large batch size.
Within 100 epochs of training, I stopped the training for a while because I was needed to use GPU for other tasks. And the checkpoints of the model(Resnet) and margin are saved. Before it was stopped, its loss recorded a value between 0.3~1.0, and training accuracy was mount to 80~95%.
When I resume the Arcface training by loading the checkpoint files using load_sate, it seems normal when the first batch is processed. But suddenly the loss increased sharply and the accuracy became very low.
Its loss suddenly became increased. How did this happen? I had no other way so anyway continued the training, but I don't think the loss is decreasing well even though it is a trained model over 100 epochs...
When I searched for similar issues, they told the problem was that the optimizer was not saved (Because the reference github page didn't save the optimizer, so did I. Is it true?
My losses after loading
if you see this line!
you are Decaying the learning rate of each parameter group by gamma.
This has altered your learning rate as you had reached 100th epoch. and moreover you had not saved your optimizer state while saving your model.
This made your code to start with the starting lr i.e 0.1 after resuming your training.
And this spiked your loss again.
Vote if you find this useful
Related
I am working on reproducing the results reported in this paper. A UNET based network is used for estimating sound speed map from raw Ultrasound channel data. I have been stuck in further reducing the train/val loss for a long time.
Basically, I followed their methods of data simulation, preprocessing and used the same network architecture, hyperparameters (include kernel initializer, batch size, decay rate, etc.). The input size is 128*1024 rather than 192*2048 according to my ultrasound probe (based on their recent paper, the input size won't affect the performance).
So my question is do you have any suggestion to further investigate this problem based on your experience?
I attached my results (loss curves and images).
RMSE loss
estimated sound speed
And their results
RMSE loss
estimated sound speed
It seems my network failed to have a comparative convergence at the background, that could explain that I got a larger initial loss.
PS: Unfortunately, the paper didn't provide codes, so I have no clue to some details in terms of data simulation and training. I have contacted the author but hasn't got response yet.
The author mentioned somewhere instead of using a pixel-wise MSE, try a larger window size 3*3 or 5*5, I am not clear whether it is used for training or metric evaluation, any reference for the former?
i'm training a CNN U-net model for semantic segmentation of images, however the training loss seems to decrease in a much faster rate than the validation loss, is this normal?
I'm using a loss of 0.002
The training and validation loss can be seen in the image bellow:
Yes, this is perfectly normal.
As the NN learns, it infers from the training samples, that it knows better at each iteration. The validation set is never used during training, this is why it is so important.
Basically:
as long as the validation loss decreases (even slightly), it means the NN is still able to learn/generalise better,
as soon as the validation loss stagnates, you should stop training,
if you keep training, the validation loss will likely increase again, this is called overfitting. Put simply, it means the NN learns "by heart" the training data, instead of really generalising to unknown samples (such as in the validation set)
We usually use early stopping to avoid the last: basically, if your validation loss doesn't improve in X iterations, stop training (X being a value such as 5 or 10).
I have a DQN agent which is trained on a specific network to perform a task. However, when training the agent I noticed that after an initial number of epochs where the agent shows a general growth in the score of the task, there suddenly occurs a drastic decrease in the performance of the agent as if it is starting out afresh. This happens a number of times.
My agent shows fluctuations in performance from bad to good and so on. Is this normal for DQN agents. What diagnosis should I perform to enable remove such fluctuations? I have used experience replay and exploration-exploitation for the agent. I am relatively new to the field so the question may be pretty trivial.
These fluctuations are normal until it reach at optimal level. In most of the reinforcement experiments and papers, results are shown by weighted average with window size of 15-30. Here is graph of mydqnimplementation.
My apologies since my question may sound stupid question. But I am quite new in deep learning and caffe.
How can we detect how many iterations are required to fine-tune a pre-trained on our own dataset? For example, I am running fcn32 for my own data with 5 classes. When can I stop the fine-tuning process by looking at the loss and accuracy of training phase?
Many thanks
You shouldn't do it by looking at the loss or accuracy of training phase. Theoretically, the training accuracy should always be increasing (also means the training loss should always be decreasing) because you train the network to decrease the training loss. But a high training accuracy doesn't necessary mean a high test accuracy, that's what we referred as over-fitting problem. So what you need to find is a point where the accuracy of test set (or validation set if you have it) stops increasing. And you can simply do it by specifying a relatively larger number of iteration at first, then monitor the test accuracy or test loss, if the test accuracy stops increasing (or the loss stops decreasing) in consistently N iterations (or epochs), where N could be 10 or other number specified by you, then stop the training process.
The best thing to do is to track training and validation accuracy and store snapshots of the weights every k iterations. To compute validation accuracy you need to have a sparate set of held out data which you do not use for training.
Then, you can stop once the validation accuracy stops increasing or starts decreasing. This is called early stopping in the literature. Keras, for example, provides functionality for this: https://keras.io/callbacks/#earlystopping
Also, it's good practice to plot the above quantities, because it gives you important insights into the training process. See http://cs231n.github.io/neural-networks-3/#accuracy for a great illustration (not specific to early stopping).
Hope this helps
Normally you converge to a specific validation accuracy for your model. In practice you normally stop training, if the validation loss did not increase in x epochs. Depending on your epoch duration x may vary most commonly between 5 and 20.
Edit:
An epoch is one iteration over your dataset for trainig in ML terms. You do not seem to have a validation set. Normally the data is split into training and validation data so you can see how well your model performs on unseen data and made decisions about which model to take by looking at this data. You might want to take a look at http://caffe.berkeleyvision.org/gathered/examples/mnist.html to see the usage of a validation set, even though they call it test set.
I have a pre-trained network with which I would like to test my data. I defined the network architecture using a .prototxt and my data layer is a custom Python Layer that receives a .txt file with the path of my data and its label, preprocess it and then feed to the network.
At the end of the network, I have a custom Python layer that get the class prediction made by the net and the label (from the first layer) and print, for example, the accuracy regarding all batches.
I would like to run the network until all examples have passed through the net.
However, while searching for the command to test a network, I've found:
caffe test -model architecture.prototxt -weights model.caffemodel -gpu 0 -iterations 100
If I don't set the -iterations, it uses the default value (50).
Does any of you know a way to run caffe test without setting the number of iterations?
Thank you very much for your help!
No, Caffe does not have a facility to detect that it has run exactly one epoch (use each input vector exactly once). You could write a validation input routine to do that, but Caffe expects you to supply the quantity. This way, you can generate easily comparable results for a variety of validation data sets. However, I agree that it would be a convenient feature.
The lack of this feature might be related to its lack for training and the interstitial testing.
In training, we tune the hyper-parameters to get the most accurate model for a given application. As it turns out, this is more closely dependent on TOTAL_NUM than on the number of epochs (given a sufficiently large training set).
With a fixed training set, we often graph accuracy (y-axis) against epochs (x-axis), because that gives tractable results as we adjust batch size. However, if we cut the size of the training set in half, the most comparable graph would scale on TOTAL_NUM rather than the epoch number.
Also, by restricting the size of the test set, we avoid long waits for that feedback during training. For instance, in training against the ImageNet data set (1.2M images), I generally test with around 1000 images, typically no more than 5 times per epoch.