Single Prediction when using Batch Normalization - deep-learning

I have a CNN that is learning pretty well on a dataset I created. I added Batch Normalization to this network to try to improve the performances.
But .. when I try to make a prediction on a single image I always end up with the same result (whatever the image). I think it is because I need batches to actually do batch normalization.
So is it possible to do a prediction on a single image with a CNN using BN ?
I thought about deleting BN layers once my network is done training, is it the way to go ?
Thank you :)

I found the exact answer and the problem I face here : https://r2rt.com/implementing-batch-normalization-in-tensorflow.html
in the "Making predictions with the model" it is explained that when using BN, during training time you need to estimate the population mean and population variance on your training set so you don't have to use batch when doing testing (which would be "cheating") :)

Related

At which point adding new data to a training set, will not improve training accuracy

This is more a general question about training a CNN but the one i'm using is YOLO.
I've started my training set for 'person' detections by labelling some data from different cameras videos (in similar environment).... Every time I was adding new data for a new camera I was retraining YOLO, which actually improved the detection for this camera. For the training, I split my data randomly into training/validation set. I use the validation set to compute accuracy. This is not overfitting as all the previous data are also used in the training.
Now, I've gathered more than 100 000 labelled data. I was expecting to not have to train anymore at this point as my data set is pretty big. But looks like I still need to do it. if i'm getting a new camera video, labelling 500-1000 samples, adding them to my huge data set and training again, the accuracy is improving for this camera.
I don't understand really understand why. Why do i still need to add new data to my set? Why is the accuracy improving a lot on the new data, while there are 'drawn' in the thousands of already existing data? Is there a point where I will be able to stop training because adding new data will not improve the accuracy?
Thanks for sharing your thoughts and ideas!
Interesting question. If your data quality is good and the training procedure is 'perfect' you will always be able to generalize better. Think about all the possible infite different images that you will want to detect. You are only using a sample of that, hoping that it is enough to generalize. You can keep increasing your dataset and might gain a 0.01% more, the question is when you want to stop. Your model accuracy will never be 100%.
My opinion: if you have a nice above 95% of accuracy stop generating more data if your project is personal and no one's life depends on it. Think about post processing to improve the results. Since you are detecting on video maybe try to follow the person movement so if in one frame it is not detected and you have info from the previous and posterior frame you might be able to do something fancy.
Hope it helps, cheers!
To create a good model of course you will need as many images as possible. But you have to pay attention whether your model become overfit, which is your model is not learning anymore and the average loss getting higher and the mAP getting lower, when overfitting occurs you have to stop the training and choose the best weight that has been saved in darknet/backup/ folder.
For YOLO, there are some guidelines that you can follow about when you should to stop training. The most obvious is :
During training, you will see varying indicators of error, and you should stop when no longer decreases 0.XXXXXXX avg:
Region Avg IOU: 0.798363, Class: 0.893232, Obj: 0.700808, No Obj: 0.004567, Avg Recall: 1.000000, count: 8 Region Avg IOU: 0.800677, Class: 0.892181, Obj: 0.701590, No Obj: 0.004574, Avg Recall: 1.000000, count: 8
9002: 0.211667, 0.060730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds
9002 - iteration number (number of batch)
0.060730 avg - average loss (error) - the lower, the better
When you see that average loss 0.xxxxxx avg no longer decreases at many iterations then you should stop training. The final average loss can be from 0.05 (for a small model and easy dataset) to 3.0 (for a big model and a difficult dataset). I personally think that model with avg loss 0.06 is good enough.
AlexeyAB explained everything in detail on his github repo, read this section please https://github.com/AlexeyAB/darknet#when-should-i-stop-training

training small amount of data on the large capacity network

Currently I am using the convolutional neural networks to solve the binary classification problem. The data I use is 2D-images and the number of training data is only about 20,000-30,000. In deep learning, it is generally known that overfitting problems can arise if the model is too complex relative to the amount of the training data. So, to prevent overfitting, the simplified model or transfer learning is used.
Previous developers in the same field did not use high-capacity models (high-capacity means a large number of model parameters) due to the small amount of training data. Most of them used small-capacity models and transfer learning.
But, when I was trying to train the data on high-capacity models (based on ResNet50, InceptionV3, DenseNet101) from scratch, which have about 10 million to 20 million parameters in, I got a high accuracy in the test set.
(Note that the training set and the test set were exclusively separated, and I used early stopping to prevent overfitting)
In the ImageNet image classification task, the training data is about 10 million. So, I also think that the amount of my training data is very small compared to the model capacity.
Here I have two questions.
1) Even though I got high accuracy, is there any reason why I should not use a small amount of data on the high-capacity model?
2) Why does it perform well? Even if there is a (very) large gap between the amount of data and the number of model parameters, the techniques like early stopping overcome the problems?
1) You're completely right that small amounts of training data can be problematic when working with a large model. Given that your ultimate goal is to achieve a "high accuracy" this theoretical limitation shouldn't bother you too much if the practical performance is satisfactory for you. Of course, you might always do better but I don't see a problem with your workflow if the score on the test data is legit and you're happy with it.
2) First of all, I believe ImageNet consists of 1.X million images so that puts you a little closer in terms of data. Here are a few ideas I can think of:
Your problem is easier to solve than ImageNet
You use image augmentation to synthetically increase your image data
Your test data is very similar to the training data
Also, don't forget that 30,000 samples means (30,000 * 224 * 224 * 3 =) 4.5 billion values. That should make it quite hard for a 10 million parameter network to simply memorize your data.
3) Welcome to StackOverflow

Is BatchNorm turned off when inferencing?

I read from several sources that implicitly suggest batchnorm being turned off for inference but I have no definite answer for this.
Most common is to use a moving average of mean and std for your batch normalization as used by Keras for example (https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py). If you just turn it off the network will perform worse on the same data, due to changes in how the images are processed.
This is done by storing the average mean and the average std of all the batches used during training the network. Then in inference this moving average is used for normalization.

Difference between training and testing phase in caffe

this might seem like a silly question, but I am trying to understand to what extent the testing phase in caffe is important for good results. Of course the training phase is important, but is the testing phase simply to test out how much loss is obtained periodically on a set that is not trained? If this is the case, does the size of my test set really matter? Does testing even matter at all? I ask because I currently have some serious overfit problems. If I have a large dataset (>50 000 images), how should I go about splitting them between test and train?
Caffe never use the result of the test sets while doing training and modify some parameter to fix some issues like overfitting.
The usage of a validation set (test set during training) is for us to visualize whether the model overfits the data by looking at the accuracy or loss values, by plotting them or looking at the outputs.
For example, if the loss of the training set keeps reducing at every iterations and the loss of the test set keeps increasing, this is a solid case of the model overfitting the training set. For getting such conclusions, the images selected for the test set shouldn't be the same as that of the training set. Its ideal to keep a 1:10 ratio for test-train image count. If the test set was using a subset of the trainset, the loss of the testset would have decreased and we may not detect the overfitting behaviour of the model.

Semantic Segmentation using deep learning

I have a 512x512 image. I want to perform per-pixel classification of that image. I have already trained the model using 80x80 patches. So, at the test time I have 512x512=262144 patches each with dimension 80x80 and this classification is too slow. How to improve the testing time? Please help me out.
I might be wrong, but there are not a lot of solution to speed up the testing phase, the main one being to reduce the NN number of neurons in order to reduce the number of operations:
80x80 patches are really big, you may want to try to reduce their size and retrain your NN. It will already reduces a lot the number of neurons.
Analyze the NN weights/inputs/outputs in order to detect the neurons that do not matter in your NN. They may for example always return 0, then they can be deleted from your NN. Then you retrain your NN with the simplified architecture.
If you have not done that already, it's much faster to give a batch (the bigger the better) of patches instead of one patch at a time.