Which iterations weights are saved for deployment, testing? - caffe

I'm training a unet neural network. During training, each iteration has a "loss value". This value generally converges, but sometimes jumps around. What weights are finally saved in the .caffemodel file?
What happens if I save it at iteration 20000, and that just so happens to be a point where the loss jumped up a bit, and isn't the lowest loss that it has seen? Are the weights and biases saved from the last iteration or something smarter like the lowest of last 5% iterations?
Thank you

Solver.prototxt has one parameter called "snapshot"
net: "path/to/train.prototxt"
.
.
max_iter: 20000
snapshot: 1000
snapshot_prefix: "path/to/caffemodel/"
solver_mode: GPU
For example, if you fix snapshot: 1000, then each 1000 iterations it will be saved one file .caffemodel with the weights corresponding to that iteration, regardless of whether the loss was less in the previous iteration.

Related

Lower batch size in the last iteration of first training epoch than the other iteration

I'm trying to train an deep neural network model, the output dimensions of each iteration in one epoch is like [64,1600,8] (64 is the batch size). But in the last iteration of first epoch, this output changed to [54,1600,8] and faced with dimension error. Why in the last iteration batch size had changed??
Additionally, if I change the batch size to 32 the last iteration's output is [22,1600,8].
I think that the output of the last iteration must be same as the other iteration.
The last iteration batch size changed because you did not have enough data to completely fill the batch. If you have a batch size of 10, for example, and you have 101 entries total in your data, then you will have 10 batches of 10 and 1 batch of 1.
The solution is to either drop the batch if it is not the correct size, or to adapt your model so that it will detect the size of the batch and change accordingly, instead of having the batch size hard-coded in to your model parameters.
Seeing that you are using pytorch, I'll add to the answer by Richard by saying that pytorch DataLoaders have the functionality built-in to drop the last (incomplete) batch. Checking the documentation, you can specify drop_last=True while instantiating the DataLoader.

Is it possible to execute from the point where the neural network model is interrupted?

Assume that I am training a neural network model. I am storing the tensor file of the neural network model for every 15 epochs in .pth format.
I need to run 1000 epochs in total. Suppose I stopped my program during the 501st epoch, then I have the following files
15.pth, 30.pth, 45.pth, 60.pth, 75.pth,.... 420.pth, 435.pth, 450.pth, 465.pth, 480.pth, 495.pth
Then my doubt is
Is it possible to use the last stored model 495.pth and continue execution as it generally happens if done without any interruption? In short, I am asking for something similar to the "resumption" of the training phase with a few modifications to the existing code. I am just asking for such a possibility.
I am asking for general practice and not particular to any code. If such a method exists, I will be free to stop any program under execution and can resume later. Currently, I cannot use resources for shorter programs if longer programs are in execution and hence I am asking this question.
I order to resume training from a checkpoint, you need to save the entire state of your training process. This includes:
Current weights of the model.
State of the optimizer: most optimizers keep track of different statistics of the updates, e.g., momentum, variance etc.
State of the learning rate scheduler.
Additional "state" variables unique to your code.
If you saved all this information, you should be able to fully restore the "state" of your training process and resume from that point.
So what I do is the following:
After each epoch I save my models weights into a .pt file and each time I run my program in gerneral I check if the resume argument is set to True. If so, I initialize the model using the weights in the .pt file as just continue training, if not I initialize random weights as normal. This could look like this:
def train(resume: bool=False):
model = Model()
if resume:
model.load_state_dict(torch.load("weights.pt"))
criterion = Loss()
optimizer = Optimizer()
for epoch in range(100):
for data, targets in dataloader:
optimizer.zero_grad()
predictions = model.train()(data)
loss = criterion(predicitions, targets)
loss.backward()
optimizer.step()
torch.save(model.state_dict(), "weights.pt")
So if I interrupt the training, I can still continue after my last epoch that I saved.
Normally you are logging more stuff than only the weights, for example the learning-rate scheduler or simply the loss and accuracy history. For that you could save the training history into a json file and read it out if resume is True.

Deep learning model stuck in local minima or overfit?

I trained an image classification model of 10 classes by finetuning EfficientNet-B4 for 100 epochs. I split my training data to 70/30. I used stochastic gradient descent with Nesterov momentum of 0.9 and the starting learning rate is 0.001. The batch size is 10. The test loss seemed to stuck at 84% for the next 50 epochs (51st - 100th). I do not know whether the model was stuck in local minima or the model was overfitted. Below is an image of the test and train loss from 51st epoch to 100th. I need your help a lot. Thanks. Train test loss image from 51st to 100th epoch.
From the graph you provided, both validation and training losses are still going down so your model is still training and there is no overfit. If your test set is stuck at the same accuracy, the reason is probably that the data you are using for your training/validation dataset does not generalize well enough on your test dataset (in your graph the validation only reached 50% accuracy while your test set reached 84% accuracy).
I looked into your training and validation graph. yes, your model is training and the losses are going down, but your validation error is near 50%, which means 'random guess'.
Possible reasons-
1- From your train error (which is presented in the image between 50-100 epoch), the error in average is going down, but it's random. like your error at epoch 100 is pretty much the same at epoch 70. This could be because your either dataset is too simple and you are forcing huge network like an efficient net to overfit it.
2- it could also be because of the way you are finetuning it, there could be any problem. like which all layers you froze and for which layer you are taking the gradients while doing BP. I am assuming you are using pre-trained weights.
3- Optimizer issue. try to use Adam
It would be great if you can provide total losses (from epoch 1 - 100).

The last steps of each epochs take too long time

I'm using Keras. When I run model.fit_generator(...), it goes 1 step per about 1.5 second, but the last step takes a few minutes.
Epoch 1/50
30/31 [============================>.] - ETA: 0s - loss: 2.0676 - acc: 0.2010
Why?
This happens because you are giving validation data to Keras, through a parameter in model.fit or model.fit_generator.
After each epoch, Keras takes the validation data and evaluates the model on this data, which implies one forward pass for each validation data point, which might take a lot of time and might seem that Keras is stuck, but it is necessary when training a model.
I faced this issue while training a CNN , and found that decreasing the image dimensions speeds up the training. The processing time is reduced due to reduced input dimension during both forward pass and backpropagation (while updating weights). If for example, you are using a CNN for image classification, image size of 64*64 would be processed much faster than of size 256*256, though obviously at the cost of losing out information due to lower resolution.

How do I prevent Keras from always predicting the underlying distribution of my data?

I am training a Deep CNN on a very unbalanced data set for a binary classification problem. I have 90% 0's and 10% 1's. To penalize the misclassification of 1, I am using a class_weight that was determined by sklearn's compute_class_weight(). In the validation tuple passed to the fit_generator(), I am using a sample_weight that was computed by sklearn's compute_sample_weight().
The network seems to be learning fine but the validation accuracy continues to be 90% or 10% after every epoch. How can I solve this data unbalance issue in Keras considering the steps I have already taken to overcome it?
Picture of fit_generator: fit_generator()
Picture of log outputs: log outputs
It's ver y strange that your val_accuracy jumps from 0.9 to 0.1 and back. Do you have right learning rate? Try to lower it even more.
And my advice: use f1 metric also.
How did you split the data - train set classes have the same rate in test set?