Questions about start training from checkpoint using --recover in allennlp - allennlp

For some reason, during training:
I want to store the checkpoint after every epoch and start training the next epoch from the stored checkpoint.
I want the training state to remain continued among epochs. For example, when training epoch 2 from the checkpoint of epoch 1, the learning_rate_schedule, epoch nums...should be the same as if I train epoch 2 and epoch 1 together (the vanilla training process).
My implementation is using the argument --recover. Allennlp will store the checkpoint after every epoch. So, for epochs after the first, I add --recover to the training commands, wishing the model's parameters and training states will be restored.
However, the above implementation seems wrong because, in my testing, training epoch 2 from the checkpoint of epoch 1 gives different results from training epoch 2 and 1 together.
I tried hard to read the allennlp document but find difficult to figure the problem out. Any guys have comments on my implementation, or other ways to fulfill my requirements? Thanks a lot!!!

Related

Compute the number of epoch from iteration in training data set?

I'm trying to solve a DeepLearning PB in which I took 8000 input data points in the training set and a batch size of 10 and the number of epochs is 100, so per epoch my iterations come to look like 5359/5359 at every epoch.
What is the relation between
batch size,
epoch,
the total number of iterations,
number of input features and
the number of passes in the case deep learning model training?

At which point adding new data to a training set, will not improve training accuracy

This is more a general question about training a CNN but the one i'm using is YOLO.
I've started my training set for 'person' detections by labelling some data from different cameras videos (in similar environment).... Every time I was adding new data for a new camera I was retraining YOLO, which actually improved the detection for this camera. For the training, I split my data randomly into training/validation set. I use the validation set to compute accuracy. This is not overfitting as all the previous data are also used in the training.
Now, I've gathered more than 100 000 labelled data. I was expecting to not have to train anymore at this point as my data set is pretty big. But looks like I still need to do it. if i'm getting a new camera video, labelling 500-1000 samples, adding them to my huge data set and training again, the accuracy is improving for this camera.
I don't understand really understand why. Why do i still need to add new data to my set? Why is the accuracy improving a lot on the new data, while there are 'drawn' in the thousands of already existing data? Is there a point where I will be able to stop training because adding new data will not improve the accuracy?
Thanks for sharing your thoughts and ideas!
Interesting question. If your data quality is good and the training procedure is 'perfect' you will always be able to generalize better. Think about all the possible infite different images that you will want to detect. You are only using a sample of that, hoping that it is enough to generalize. You can keep increasing your dataset and might gain a 0.01% more, the question is when you want to stop. Your model accuracy will never be 100%.
My opinion: if you have a nice above 95% of accuracy stop generating more data if your project is personal and no one's life depends on it. Think about post processing to improve the results. Since you are detecting on video maybe try to follow the person movement so if in one frame it is not detected and you have info from the previous and posterior frame you might be able to do something fancy.
Hope it helps, cheers!
To create a good model of course you will need as many images as possible. But you have to pay attention whether your model become overfit, which is your model is not learning anymore and the average loss getting higher and the mAP getting lower, when overfitting occurs you have to stop the training and choose the best weight that has been saved in darknet/backup/ folder.
For YOLO, there are some guidelines that you can follow about when you should to stop training. The most obvious is :
During training, you will see varying indicators of error, and you should stop when no longer decreases 0.XXXXXXX avg:
Region Avg IOU: 0.798363, Class: 0.893232, Obj: 0.700808, No Obj: 0.004567, Avg Recall: 1.000000, count: 8 Region Avg IOU: 0.800677, Class: 0.892181, Obj: 0.701590, No Obj: 0.004574, Avg Recall: 1.000000, count: 8
9002: 0.211667, 0.060730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds
9002 - iteration number (number of batch)
0.060730 avg - average loss (error) - the lower, the better
When you see that average loss 0.xxxxxx avg no longer decreases at many iterations then you should stop training. The final average loss can be from 0.05 (for a small model and easy dataset) to 3.0 (for a big model and a difficult dataset). I personally think that model with avg loss 0.06 is good enough.
AlexeyAB explained everything in detail on his github repo, read this section please https://github.com/AlexeyAB/darknet#when-should-i-stop-training

Keras: Why is my testing accuracy unstable?

I design and train an Inception-ResNet model for image recognition. The network has learned well from the training dataset. However, the test accuracy is very unstable.
Here are some parameters and important information I have used for learning process:
The number of training samples: 40,000 images.
The number of test samples: 15,000 images.
Learning rate is set to 0.001 for the first 50 epochs, 0.0001 for the next 50 epochs and 0.00001 for the rest.
Batchsize: 128
Dropout rate: 0.2
After 150 epochs, learning curves, including training loss and test accuracy look like that:
Training loss and test accuracy
I tried to increase the batch size. However, it is not the solution to my problem.
Thank you in advance for any help you might be able to provide.
Regards,
An Nhien./.

Estimating the training time in convolutional neural network

I want to know whether it is possible to estimate the training time of a convolutional neural network, given parameters like depth, filter, size of input, etc.
For instance, I am working on a 3D convolutional neural network whose structure is like:
a (20x20x20) convolutional layer with stride of 1 and 8 filters
a (20x20x20) max-pooling layer with stride of 20
a fully connected layer mapping to 8 nodes
a fully connected layer mapping to 1 output
I am running 100 epochs and print the loss(mean squared error) every 10 epochs. Now it has run 24 hours and no loss printed(I suppose it has not run 10 epochs yet). By the way, I am not using GPU.
Is it possible to estimate the training time like a formula or something like that? Is it related to time complexity or my hardware? I also found the following paper, will it give me some information?
https://ai2-s2-pdfs.s3.amazonaws.com/140f/467566e799f32831db6913d84ccdbdcac0b2.pdf
Thanks in advance.

Keras pass data through layers explicitly

I am trying to implement a Pairwise Learning to rank model with keras where features are being computed by deep neural network.
In the pairwise L2R model, while training, I am giving the query, one positive and one negative result. And it is trained on the classification loss by difference of feature vector.
I am able to do compile and fit model successfully but the problem is to actually use this model on test data.
As in Pairwise L2R model, at testing time I would have only query and sample pair (no separate negative and positives). And I can use the calculated value before softmax to rank samples.
Is there any way I can use keras to pass data manually at test time through particular trained layers. (In short I have 3 set of inputs at train time and 2 at testing time.)