What happens if I change solver or train prototxt while training - caffe

In Caffe, what happens if I change some parameters in the solver or train prototxt while training a network using the given files (and e.g. run another training using the updated solver/train prototxt)? Does it affect the training or is the content of the files loaded in the beginning and the training is unaffected by the later changes?

Prototexts are read from disk the moment you call caffe train from command line or caffe.Net/caffe.get_solver via Python interface, and never again. The solver or network is instantiated using those parameters, and any further changes to the files are irrelevant (until you manually reload, of course).

Related

When we validate the model after one epoch training in Pytorch Lightning, can we just use one GPU?

I try to run a code written by Pytorch Lightning. I want to run it on one machine multi GPUs.
##The running setting is
--gpus=1,2,3,4
--strategy=ddp
There is no problem when training. But when we validate the model after one epoch training. It still runs on multi-GPUs(multi-processings), so that, the validation dataset will be split and assigned to different GPUs. So when the code try to write the prdict file and compute the scores with the gold file, it will have problems. Source Code
So I just want to shut down the ddp when I validate the model. Just run it on local_rank 0.

net surgery on a custom caffe model

I'm trying to modify the weights of a caffemodel which is part of a caffe-branch called Deep Lab. Although there is a tutorial on how to do net surgery, when I try to do the same with my custom caffemodel the python kernel dies always on the following line:
# Load the original network and extract the fully connected layers' parameters.
net = caffe.Net('../models/deeplab/train.prototxt',
'../models/deeplab/train.caffemodel',
caffe.TRAIN)
I think its because pycaffe doesn't know their custom layers such as ImageSegData, Silence and SegAccuracy so I removed these layers from the prototxt file, but still the python kernel keeps on dying when I try to load the network model. Does anyone know how to load these weights into python?
I found it already. I had literally to remove every custom layer and especially adapt the data layer such that it could read all the input images and thereby calculate the input dimensions.

Testing a network without setting number of iterations

I have a pre-trained network with which I would like to test my data. I defined the network architecture using a .prototxt and my data layer is a custom Python Layer that receives a .txt file with the path of my data and its label, preprocess it and then feed to the network.
At the end of the network, I have a custom Python layer that get the class prediction made by the net and the label (from the first layer) and print, for example, the accuracy regarding all batches.
I would like to run the network until all examples have passed through the net.
However, while searching for the command to test a network, I've found:
caffe test -model architecture.prototxt -weights model.caffemodel -gpu 0 -iterations 100
If I don't set the -iterations, it uses the default value (50).
Does any of you know a way to run caffe test without setting the number of iterations?
Thank you very much for your help!
No, Caffe does not have a facility to detect that it has run exactly one epoch (use each input vector exactly once). You could write a validation input routine to do that, but Caffe expects you to supply the quantity. This way, you can generate easily comparable results for a variety of validation data sets. However, I agree that it would be a convenient feature.
The lack of this feature might be related to its lack for training and the interstitial testing.
In training, we tune the hyper-parameters to get the most accurate model for a given application. As it turns out, this is more closely dependent on TOTAL_NUM than on the number of epochs (given a sufficiently large training set).
With a fixed training set, we often graph accuracy (y-axis) against epochs (x-axis), because that gives tractable results as we adjust batch size. However, if we cut the size of the training set in half, the most comparable graph would scale on TOTAL_NUM rather than the epoch number.
Also, by restricting the size of the test set, we avoid long waits for that feedback during training. For instance, in training against the ImageNet data set (1.2M images), I generally test with around 1000 images, typically no more than 5 times per epoch.

Caffe snapshots: .solverstate vs .caffemodel

When training a network, the snapshots taken every N iterations come in two forms together. One is the .solverstate file, which I presume is exactly what it sounds like, storing the state of the loss functions and gradients, etc. The other is the .caffemodel file which I know stores the trained parameters.
The .caffemodel is the file you need if you want a pre-trained model, so I imagine it's also the file you want if you are going to test your network.
WWhat is the .solverstate good for? In this tutorial it looks like you can restart training from it, but how does that differ than using the .caffemodel? Does .solverstate also include the same info as .caffemodel? Put another way, is .caffemodel just a subset of .solverstate?
The solverstate file, as its name conveys, stores the state of the solver and not any information related to classification results. The model is saved as caffemodel file, which you can use to obtain classification results for your data. If you want to fine-tune your network you may use a pre-trained caffemodel file. This will save time as your network does not need to learn from scratch. But, in case your present training needs to be halted, due to a power cut or an unexpected reboot, you may resume your training form the previous snapshot of the solverstate. The difference between using the solverstate and the caffemodel files is that the former allows you to complete your training in the pre-determined manner while the latter may require changes in certain training parameters such as the maximum number of iterations.

Measuring Training Error in Caffe

I'm working with Caffe and am interested in comparing my training and test errors in order to determine if my network is overfitting or underfitting. However, I can't seem to figure out how to have Caffe report training error. It will show training loss (the value of the loss function computed over the batch), but this is not useful in determining if the network is overfitting/underfitting. Is there a straightforward way to do this?
I'm using the Python interface to Caffe (pycaffe). If I could get access to the raw training set somehow, I could just put batches through with forward passes and evaluate the results. But, I can't seem to figure out how to access more than the currently-processing batch of training data. Is this possible? My data is in a LMDB format.
In the train_val.prototxt file change the source in the TEST phase to point to the training LMDB database (by default it points to the validation LMDB database) and then run this command:
$ ./build/tools/caffe test -solver models/bvlc_reference_caffenet/solver.prototxt -weights models/bvlc_reference_caffenet/<caffenet_train_iter>.caffemodel -gpu 0