At which point adding new data to a training set, will not improve training accuracy - deep-learning

This is more a general question about training a CNN but the one i'm using is YOLO.
I've started my training set for 'person' detections by labelling some data from different cameras videos (in similar environment).... Every time I was adding new data for a new camera I was retraining YOLO, which actually improved the detection for this camera. For the training, I split my data randomly into training/validation set. I use the validation set to compute accuracy. This is not overfitting as all the previous data are also used in the training.
Now, I've gathered more than 100 000 labelled data. I was expecting to not have to train anymore at this point as my data set is pretty big. But looks like I still need to do it. if i'm getting a new camera video, labelling 500-1000 samples, adding them to my huge data set and training again, the accuracy is improving for this camera.
I don't understand really understand why. Why do i still need to add new data to my set? Why is the accuracy improving a lot on the new data, while there are 'drawn' in the thousands of already existing data? Is there a point where I will be able to stop training because adding new data will not improve the accuracy?
Thanks for sharing your thoughts and ideas!

Interesting question. If your data quality is good and the training procedure is 'perfect' you will always be able to generalize better. Think about all the possible infite different images that you will want to detect. You are only using a sample of that, hoping that it is enough to generalize. You can keep increasing your dataset and might gain a 0.01% more, the question is when you want to stop. Your model accuracy will never be 100%.
My opinion: if you have a nice above 95% of accuracy stop generating more data if your project is personal and no one's life depends on it. Think about post processing to improve the results. Since you are detecting on video maybe try to follow the person movement so if in one frame it is not detected and you have info from the previous and posterior frame you might be able to do something fancy.
Hope it helps, cheers!

To create a good model of course you will need as many images as possible. But you have to pay attention whether your model become overfit, which is your model is not learning anymore and the average loss getting higher and the mAP getting lower, when overfitting occurs you have to stop the training and choose the best weight that has been saved in darknet/backup/ folder.
For YOLO, there are some guidelines that you can follow about when you should to stop training. The most obvious is :
During training, you will see varying indicators of error, and you should stop when no longer decreases 0.XXXXXXX avg:
Region Avg IOU: 0.798363, Class: 0.893232, Obj: 0.700808, No Obj: 0.004567, Avg Recall: 1.000000, count: 8 Region Avg IOU: 0.800677, Class: 0.892181, Obj: 0.701590, No Obj: 0.004574, Avg Recall: 1.000000, count: 8
9002: 0.211667, 0.060730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds
9002 - iteration number (number of batch)
0.060730 avg - average loss (error) - the lower, the better
When you see that average loss 0.xxxxxx avg no longer decreases at many iterations then you should stop training. The final average loss can be from 0.05 (for a small model and easy dataset) to 3.0 (for a big model and a difficult dataset). I personally think that model with avg loss 0.06 is good enough.
AlexeyAB explained everything in detail on his github repo, read this section please https://github.com/AlexeyAB/darknet#when-should-i-stop-training

Related

Ways to prevent underfitting and overfitting to when using data augmentation to train a transposed CNN

I'm training a CNN (one using a series of ConvTranspose2D in pytorch) that uses input data from JSON to constitute an image. Unlike natural language, the input data can be in any order, as it contains info about various sprites in a scene.
In my first attempts to train the model, I didn't change the order of the input data (meaning, on each epoch, each sprite was represented in the same place in the input data). The model learned for about 10 epochs, but then there started to be divergence between the training loss (which continued to go down) and the test loss. So classic overfitting.
I tried to solve this by doing a form of data augmentation where the output data (in this case an image) stayed the same but I shuffled the order of the input data. As I have around 400 sprites, the maximum shuffling is 400!, so theoretically this can vastly expand the amount of training data. For example, instead of 100k JSON documents corresponding to 100K images, by shuffling the order of sprites in the input data, you have 400!*100000 training data points. In practice of course this amount of data is impractical, so I went with around 2m data points for an initial test. The issue I ran into here was that the model was not learning at all - after getting to a certain loss very quickly (after the first few mini-batches), it didn't learn at all for around 4 epochs. So classic underfitting.
Like Goldilocks, I'd like to find "just right" between the initial overfitting and subsequent underfitting. I'm wondering other strategies I could try out. One idea I had was letting the model train on a predetermined order of sprites (the overfitting case) and then, once overfitting starts (ie two straight epochs with divergence between the test and training loss) shuffling the data. I can also play with changing the model, although it can only be so big because of constraints with the hardware and the fact that inference needs to happen in under 20ms.
Are there any papers or techniques that are recommended in this scenario where data augmentation can lead to vastly more data points but results in a model ceasing to learn? Thanks in advance for any tips!

Deep learning model test accuracy unstable

I am trying to train and test a pytorch GCN model that is supposed to identify person. But the test accuracy is quite jumpy like it gives 49% at 23 epoch then goes below near 45% at 41 epoch. So it's not increasing all the time though loss seems to decrease at every epoch.
My question is not about implementation errors rather I want to know why this happens. I don't think there is something wrong in my coding as I saw SOTA architecture has this type of behavior as well. The author just picked the best result and published saying that their models gives that result.
Is it normal for the accuracy to be jumpy (up-down) and am I just to take the best ever weights that produce that?
Accuracy is naturally more "jumpy", as you put it. In terms of accuracy, you have a discrete outcome for each sample - you either get it right or wrong. This makes it so that the result fluctuate, especially if you have a relatively low number of samples (as you have a higher sampling variance).
On the other hand, the loss function should vary more smoothly. It is based on the probabilities for each class calculated at your softmax layer, which means that they vary continuously. With a small enough learning rate, the loss function should vary monotonically. Any bumps you see are due to the optimization algorithm taking discrete steps, with the assumption that the loss function is roughly linear in the vicinity of the current point.

Regression problem getting much better results when dividing values by 100

I'm working on a regression problem in pytorch. My target values can be either between 0 to 100 or 0 to 1 (they represent % or % divided by 100).
The data is unbalanced, I have much more data with lower targets.
I've noticed that when I run the model with targets in the range 0-100, it doesn't learn - the validation loss doesn't improve, and the loss on the 25% large targets is very big, much bigger than the std in this group.
However, when I run the model with targets in the range 0-1, it does learn and I get good results.
If anyone can explain why this happens, and if using the ranges 0-1 is "cheating", that will be great.
Also - should I scale the targets? (either if I use the larger or the smaller range).
Some additional info - I'm trying to fine tune bert for a specific task. I use MSEloss.
Thanks!
I think your observation relates to batch normalization. There is a paper written on the subject, an numerous medium/towardsdatascience posts, which i will not list here. Idea is that if you have a no non-linearities in your model and loss function, it doesn't matter. But even in MSE you do have non-linearity, which makes it sensitive to scaling of both target and source data. You can experiment with inserting Batch Normalization Layers into your models, after dense or convolutional layers. In my experience it often improves accuracy.

Regarding overfitting a model

Suppose I want to learn a dataset, and I pick a deep-NN model with some complexity (say 5-layers).
Now, I know that this model is overfitting my data when I notice that during the training, the train-loss and validation-loss both decrease until N-epochs, and then the valid-loss starts to go up after that.
My question is: Is this model good for my data if I stop at Nth epoch? Or it is an over-complicated model in the first-place, even if I stop the training at N-epochs or not. Do I just discard this architecture and hunt for a better one?

training small amount of data on the large capacity network

Currently I am using the convolutional neural networks to solve the binary classification problem. The data I use is 2D-images and the number of training data is only about 20,000-30,000. In deep learning, it is generally known that overfitting problems can arise if the model is too complex relative to the amount of the training data. So, to prevent overfitting, the simplified model or transfer learning is used.
Previous developers in the same field did not use high-capacity models (high-capacity means a large number of model parameters) due to the small amount of training data. Most of them used small-capacity models and transfer learning.
But, when I was trying to train the data on high-capacity models (based on ResNet50, InceptionV3, DenseNet101) from scratch, which have about 10 million to 20 million parameters in, I got a high accuracy in the test set.
(Note that the training set and the test set were exclusively separated, and I used early stopping to prevent overfitting)
In the ImageNet image classification task, the training data is about 10 million. So, I also think that the amount of my training data is very small compared to the model capacity.
Here I have two questions.
1) Even though I got high accuracy, is there any reason why I should not use a small amount of data on the high-capacity model?
2) Why does it perform well? Even if there is a (very) large gap between the amount of data and the number of model parameters, the techniques like early stopping overcome the problems?
1) You're completely right that small amounts of training data can be problematic when working with a large model. Given that your ultimate goal is to achieve a "high accuracy" this theoretical limitation shouldn't bother you too much if the practical performance is satisfactory for you. Of course, you might always do better but I don't see a problem with your workflow if the score on the test data is legit and you're happy with it.
2) First of all, I believe ImageNet consists of 1.X million images so that puts you a little closer in terms of data. Here are a few ideas I can think of:
Your problem is easier to solve than ImageNet
You use image augmentation to synthetically increase your image data
Your test data is very similar to the training data
Also, don't forget that 30,000 samples means (30,000 * 224 * 224 * 3 =) 4.5 billion values. That should make it quite hard for a 10 million parameter network to simply memorize your data.
3) Welcome to StackOverflow