As far as I understand, a model is certainly overfitting if 1. converges too soon 2. validation loss keeps increasing
Again, to my knowledge, there is no way around this unless you make the validation loss converge to a similar trend to your training loss, so you can do more data augmentation etc.
However, so many papers I have read claims a 10 fold is a sign of robustness and shows the model is not overfitting. When I recreate those experiments though, I can say that they do overfit whether they show robust accuracies or not. Also, many people seem to think that they will just add a 10 fold and that is good to go. In the reviews also, they only ask for 10 fold experimentation to address overfitting.
Is my take wrong? Is there hope for a validation loss that does not converge but go up? Or is there a measure besides validation loss?
I assume, by 10 fold test you mean 10 fold cross-validation.
Usually, cross-validation is useful only on very small datasets, i.e. with less than 1000 samples.
Overfitting means that the complexity of your model is much higher than necessary. A typical sign of overfitting is a very high learning accuracy vs low validation accuracy.
Therefore, use of 10 fold cross-validation may not prevent from overfitting per se.
Consider two examples:
First, learning accuracy 99.8%, 10 fold cross-validation accuracy 70%.
Second, learning accuracy 77%, 10 fold cross-validation accuracy 70%.
In both cases, the same 10 fold cross-validation resulted in 70% accuracy. However, the first case is clearly overfitting, whereas, the second is not.
I hope, this clarifies the situation.
Related
Why dqn algorithm performs only one gradient descent step, i.e. trains for only one epoch? Would not it benefit from more epochs, won’t its accuracy improve with more epochs?
Time efficiency.
In theory, in the policy iteration / evaluation scheme, you should wait until convergence before moving to the next update. However, this can (a) never happen, (b) take too much.
So people typically do one single step with a small learning rate in the hope that the critic (Q) is not "too wrong".
You could try more steps, but in general how many gradient steps to do is a design choice, and they probably found that this works the best.
Call for experts in deep learning.
Hey, I am recently working on training images using tensorflow in python for tone mapping. To get the better result, I focused on using perceptual loss introduced from this paper by Justin Johnson.
In my implementation, I made the use of all 3 parts of loss: a feature loss that extracted from vgg16; a L2 pixel-level loss from the transferred image and the ground true image; and the total variation loss. I summed them up as the loss for back propagation.
From the function
yˆ=argminλcloss_content(y,yc)+λsloss_style(y,ys)+λTVloss_TV(y)
in the paper, we can see that there are 3 weights of the losses, the λ's, to balance them. The value of three λs are probably fixed throughout the training.
My question is that does it make sense if I dynamically change the λ's in every epoch(or several epochs) to adjust the importance of these losses?
For instance, the perceptual loss converges drastically in the first several epochs yet the pixel-level l2 loss converges fairly slow. So maybe the weight λs should be higher for the content loss, let's say 0.9, but lower for others. As the time passes, the pixel-level loss will be increasingly important to smooth up the image and to minimize the artifacts. So it might be better to adjust it higher a bit. Just like changing the learning rate according to the different epochs.
The postdoc supervises me straightly opposes my idea. He thought it is dynamically changing the training model and could cause the inconsistency of the training.
So, pro and cons, I need some ideas...
Thanks!
It's hard to answer this without knowing more about the data you're using, but in short, dynamic loss should not really have that much effect and may have opposite effect altogether.
If you are using Keras, you could simply run a hyperparameter tuner similar to the following in order to see if there is any effect (change the loss accordingly):
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
I've only done this on smaller models (way too time consuming) but in essence, it's best to keep it constant and also avoid angering off your supervisor too :D
If you are running a different ML or DL library, there are optimizer for each, just Google them. It may be best to run these on a cluster and overnight, but they usually give you a good enough optimized version of your model.
Hope that helps and good luck!
I read everywhere that, in addition to improving performances regarding accuracy, "Batch Normalization makes Training Faster".
I probably misunderstand something (cause BN has been proven efficient more than once) but it seems king of unlogical to me.
Indeed, adding BN to a network, increases the number of parameters to learn : With BN comes "Scales" and "offset" parameters that are to learn. See: https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
How can the network train faster while having "more work to do" ?
(I hope my question is legitimate or at least not too stupid).
Thank you :)
Batch normalization accelerates training by requiring less iterations to converge to a given loss value. This can be done by using higher learning rates, but with smaller learning rates you can still see an improvement. The paper shows this pretty clearly.
Using ReLU also has this effect when compared to a sigmoid activation, as shown in the original AlexNet paper (without BN).
Batch normalization also makes the optimization problem "easier", as minimizing the covariate shift avoid lots of plateaus where the loss stagnates or decreases slowly. It can still happen but it is much less frequent.
Batch normalization fixes the distributions of a lower layer activation to its next layer. The Scales and offset just "move" that distribution to a more effective position, but it is still a fixed distribution at every training step. This fixation means the parameters adjustment on the higher layer do not need to worry about the modification of parameters in the lower layer(s), which makes the training more efficient.
My apologies since my question may sound stupid question. But I am quite new in deep learning and caffe.
How can we detect how many iterations are required to fine-tune a pre-trained on our own dataset? For example, I am running fcn32 for my own data with 5 classes. When can I stop the fine-tuning process by looking at the loss and accuracy of training phase?
Many thanks
You shouldn't do it by looking at the loss or accuracy of training phase. Theoretically, the training accuracy should always be increasing (also means the training loss should always be decreasing) because you train the network to decrease the training loss. But a high training accuracy doesn't necessary mean a high test accuracy, that's what we referred as over-fitting problem. So what you need to find is a point where the accuracy of test set (or validation set if you have it) stops increasing. And you can simply do it by specifying a relatively larger number of iteration at first, then monitor the test accuracy or test loss, if the test accuracy stops increasing (or the loss stops decreasing) in consistently N iterations (or epochs), where N could be 10 or other number specified by you, then stop the training process.
The best thing to do is to track training and validation accuracy and store snapshots of the weights every k iterations. To compute validation accuracy you need to have a sparate set of held out data which you do not use for training.
Then, you can stop once the validation accuracy stops increasing or starts decreasing. This is called early stopping in the literature. Keras, for example, provides functionality for this: https://keras.io/callbacks/#earlystopping
Also, it's good practice to plot the above quantities, because it gives you important insights into the training process. See http://cs231n.github.io/neural-networks-3/#accuracy for a great illustration (not specific to early stopping).
Hope this helps
Normally you converge to a specific validation accuracy for your model. In practice you normally stop training, if the validation loss did not increase in x epochs. Depending on your epoch duration x may vary most commonly between 5 and 20.
Edit:
An epoch is one iteration over your dataset for trainig in ML terms. You do not seem to have a validation set. Normally the data is split into training and validation data so you can see how well your model performs on unseen data and made decisions about which model to take by looking at this data. You might want to take a look at http://caffe.berkeleyvision.org/gathered/examples/mnist.html to see the usage of a validation set, even though they call it test set.
I am trying to build a 11 class image classifier with 13000 training images and 3000 validation images. I am using deep neural network which is being trained using mxnet. Training accuracy is increasing and reached above 80% but validation accuracy is coming in range of 54-57% and its not increasing.
What can be the issue here? Should I increase the no of images?
The issue here is that your network stop learning useful general features at some point and start adapting to peculiarities of your training set (overfitting it in result). You want to 'force' your network to keep learning useful features and you have few options here:
Use weight regularization. It tries to keep weights low which very often leads to better generalization. Experiment with different regularization coefficients. Try 0.1, 0.01, 0.001 and see what impact they have on accuracy.
Corrupt your input (e.g., randomly substitute some pixels with black or white). This way you remove information from your input and 'force' the network to pick up on important general features. Experiment with noising coefficients which determines how much of your input should be corrupted. Research shows that anything in the range of 15% - 45% works well.
Expand your training set. Since you're dealing with images you can expand your set by rotating / scaling etc. your existing images (as suggested). You could also experiment with pre-processing your images (e.g., mapping them to black and white, grayscale etc. but the effectiveness of this technique will depend on your exact images and classes)
Pre-train your layers with denoising critera. Here you pre-train each layer of your network individually before fine tuning the entire network. Pre-training 'forces' layers to pick up on important general features that are useful for reconstructing the input signal. Look into auto-encoders for example (they've been applied to image classification in the past).
Experiment with network architecture. Your network might not have sufficient learning capacity. Experiment with different neuron types, number of layers, and number of hidden neurons. Make sure to try compressing architectures (less neurons than inputs) and sparse architectures (more neurons than inputs).
Unfortunately the process of training network that generalizes well involves a lot of experimentation and almost brute force exploration of parameter space with a bit of human supervision (you'll see many research works employing this approach). It's good to try 3-5 values for each parameter and see if it leads you somewhere.
When you experiment plot accuracy / cost / f1 as a function of number of iterations and see how it behaves. Often you'll notice a peak in accuracy for your test set, and after that a continuous drop. So apart from good architecture, regularization, corruption etc. you're also looking for a good number of iterations that yields best results.
One more hint: make sure each training epochs randomize the order of images.
This clearly looks like a case where the model is overfitting the Training set, as the validation accuracy was improving step by step till it got fixed at a particular value. If the learning rate was a bit more high, you would have ended up seeing validation accuracy decreasing, with increasing accuracy for training set.
Increasing the number of training set is the best solution to this problem. You could also try applying different transformations (flipping, cropping random portions from a slightly bigger image)to the existing image set and see if the model is learning better.