Resume training in Caffe from the previous training point

Resume training in Caffe from the previous training point - caffe

Im facing severe power cuts in my hometown,and i had to restart my training multiple times,any suggestions on how i can resume my training from my last iteration point?
I am using caffe,and lmdb files.
Thanks in advance

Caffe can save a "snapshot" every once in a while. You can resume your training from the last snapshot you have by simply:
$CAFFE_ROOT/build/tools/caffe train -model /path/to/solver.prototxt -snapshot /path/to/latest.solverstate
In your solver.prototxt you can define how often a snapshot is taken by setting
snapshot: 2500 # take a snapshot every 2500 iterations
The snapshot file is saved to the same location defined by
snapshot_prefix: "/path/to/snaps"
There you can find both .solverstate and .caffemodel saved for each 2500 iterations.

Related

Why loaded Pytorch model's loss highly increased?

I'm trying to train Arcface with reference to.
As far as I know, Arcface requires more than 200 training epochs on CASIA-webface with a large batch size.
Within 100 epochs of training, I stopped the training for a while because I was needed to use GPU for other tasks. And the checkpoints of the model(Resnet) and margin are saved. Before it was stopped, its loss recorded a value between 0.3~1.0, and training accuracy was mount to 80~95%.
When I resume the Arcface training by loading the checkpoint files using load_sate, it seems normal when the first batch is processed. But suddenly the loss increased sharply and the accuracy became very low.
Its loss suddenly became increased. How did this happen? I had no other way so anyway continued the training, but I don't think the loss is decreasing well even though it is a trained model over 100 epochs...
When I searched for similar issues, they told the problem was that the optimizer was not saved (Because the reference github page didn't save the optimizer, so did I. Is it true?
My losses after loading

if you see this line!
you are Decaying the learning rate of each parameter group by gamma.
This has altered your learning rate as you had reached 100th epoch. and moreover you had not saved your optimizer state while saving your model.
This made your code to start with the starting lr i.e 0.1 after resuming your training.
And this spiked your loss again.
Vote if you find this useful

YOLOv4 Transfer Learning/ Fine tuning

I have trained a model of YOLOv4 by using my original dataset and the custom yolov4 configuration file, which I will refer to as my 'base' YOLOv4 model.
Now I want to use this base model that I have created to train the model again using images that I have manually augmented. I am trying to retrain my models to try and increase the mAP and AP. So I want to use the weights from my base model to train a new yolov4 model with the manually augmented images.
I have seen on the YOLOv4 wiki page that using stopbackward = 1 freezes the layers so weights in these layers would not be updated, however this reduces accuracy. Also there was another piece of information that I read where ./darknet partial cfg/yolov4.cfg yolov4.weights yolov4.conv.137 137 takes out the first 137 layers. Does this mean that the first 137 layers are frozen in the network or does this mean you are only training on the 137 layers?
My questions are:
Which code actually does freeze layers so I can do transfer learning
on the base YOLOv4 model I have created?
Which layers would you recommend freezing,the first 137
before the first YOLO layer in the network?
Thank you in advance!

To answer your questions:
If you want to use transfer learning, you don't have to freeze any layers. You should simply start training with the weights you have stored from your first run. So instead of darknet.exe detector train data/obj.data yolo-obj.cfg yolov4.conv.137 you can run darknet.exe detector train data/obj.data yolo-obj.cfg backup/your_weights_file. The weights are stored in the backup folder build\darknet\x64\backup\. So for example, the command could look like this: darknet.exe detector train data/obj.data yolo-obj.cfg backup/yolov4_2000.weights
Freezing layers can save time during training. What is a good solution is to first train the model with the first layers frozen, and later unfreeze the layers to finetune your learning. I am not sure what is a good amount of layers to freeze in the first run, maybe can you test it with trial and error.

The command "./darknet partial cfg/yolov4.cfg yolov4.weights yolov4.conv.137 137" dumps the weights from the first 137 layers in "yolov4.weights" into the file "yolov4.conv.137", and has nothing to do with training.

What happens if I change solver or train prototxt while training

In Caffe, what happens if I change some parameters in the solver or train prototxt while training a network using the given files (and e.g. run another training using the updated solver/train prototxt)? Does it affect the training or is the content of the files loaded in the beginning and the training is unaffected by the later changes?

Prototexts are read from disk the moment you call caffe train from command line or caffe.Net/caffe.get_solver via Python interface, and never again. The solver or network is instantiated using those parameters, and any further changes to the files are irrelevant (until you manually reload, of course).

Control caffe display while training

I am using caffe to train a deep network, and have set the display at 200 iterations in my solver prototext file. However, instead of getting the loss and accuracy, I am getting a large number of lines of the form
solver.cpp:245] Train net output #{no.}: prob = {no.}.
The prob values are the probabilities of output of softmax layer(the last layer in my network).
This display is of no real use to me. I am interested in seeing only the rate at which the accuracy is evolving with the number of iterations. Can someone suggest a way in which I may print only the relevant parameters to stdout? Is there a way to control in general what is printed by caffe to stdout?
Thanks.
(Note: I am using caffe executable for training on ubuntu)

Measuring Training Error in Caffe

I'm working with Caffe and am interested in comparing my training and test errors in order to determine if my network is overfitting or underfitting. However, I can't seem to figure out how to have Caffe report training error. It will show training loss (the value of the loss function computed over the batch), but this is not useful in determining if the network is overfitting/underfitting. Is there a straightforward way to do this?
I'm using the Python interface to Caffe (pycaffe). If I could get access to the raw training set somehow, I could just put batches through with forward passes and evaluate the results. But, I can't seem to figure out how to access more than the currently-processing batch of training data. Is this possible? My data is in a LMDB format.

In the train_val.prototxt file change the source in the TEST phase to point to the training LMDB database (by default it points to the validation LMDB database) and then run this command:
$ ./build/tools/caffe test -solver models/bvlc_reference_caffenet/solver.prototxt -weights models/bvlc_reference_caffenet/<caffenet_train_iter>.caffemodel -gpu 0

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Resume training in Caffe from the previous training point - caffe

Im facing severe power cuts in my hometown,and i had to restart my training multiple times,any suggestions on how i can resume my training from my last iteration point? I am using caffe,and lmdb files. Thanks in advance

Related

Why loaded Pytorch model's loss highly increased?

YOLOv4 Transfer Learning/ Fine tuning

What happens if I change solver or train prototxt while training

Control caffe display while training

Measuring Training Error in Caffe

Categories

Resources