Data augmentation stops overfitting by preventing learning entirely? - deep-learning

I am training a network to classify psychosis (binary classification as either healthy or psychosis) given an MRI scan of a subject. My dataset is 500 items, where I am using 350 for training and 150 for validation. Around 44% of the dataset is healthy, and ~56% has psychosis.
When I train the network without data augmentation, the training loss begins decreasing immediately while validation loss never changes. The red line in the accuracy graph below is the dominant class percentage (56%).
When I re-train using data augmentation 80% of the time (random affine, blur, noise, flip), overfitting is prevented, but now nothing is learned at all.
So I suppose my question is: What are some ideas for how to get the validation accuracy to increase? i.e. get the network to learn things without overfitting...

Related

Ways to prevent underfitting and overfitting to when using data augmentation to train a transposed CNN

I'm training a CNN (one using a series of ConvTranspose2D in pytorch) that uses input data from JSON to constitute an image. Unlike natural language, the input data can be in any order, as it contains info about various sprites in a scene.
In my first attempts to train the model, I didn't change the order of the input data (meaning, on each epoch, each sprite was represented in the same place in the input data). The model learned for about 10 epochs, but then there started to be divergence between the training loss (which continued to go down) and the test loss. So classic overfitting.
I tried to solve this by doing a form of data augmentation where the output data (in this case an image) stayed the same but I shuffled the order of the input data. As I have around 400 sprites, the maximum shuffling is 400!, so theoretically this can vastly expand the amount of training data. For example, instead of 100k JSON documents corresponding to 100K images, by shuffling the order of sprites in the input data, you have 400!*100000 training data points. In practice of course this amount of data is impractical, so I went with around 2m data points for an initial test. The issue I ran into here was that the model was not learning at all - after getting to a certain loss very quickly (after the first few mini-batches), it didn't learn at all for around 4 epochs. So classic underfitting.
Like Goldilocks, I'd like to find "just right" between the initial overfitting and subsequent underfitting. I'm wondering other strategies I could try out. One idea I had was letting the model train on a predetermined order of sprites (the overfitting case) and then, once overfitting starts (ie two straight epochs with divergence between the test and training loss) shuffling the data. I can also play with changing the model, although it can only be so big because of constraints with the hardware and the fact that inference needs to happen in under 20ms.
Are there any papers or techniques that are recommended in this scenario where data augmentation can lead to vastly more data points but results in a model ceasing to learn? Thanks in advance for any tips!

A flat validation loss and a decreasing training loss can be considered a symptom of overfitting?

My questions are about underfitting/overfitting and it's related with the following results :
here
In this scenario, a flat validation loss and a decreasing training loss can be considered a symptom of overfitting? I'd have expected a validation loss that starts to increase.
Moreover, at the end, the training loss was flattening, so is it correct to say that the model can't learn more with these hyperparameters? Is this, instead, a symptom of underfitting?
I'm working on this dataset (here).
I implemented a convolutional neural network with 7 conv layers and 2 FC (similar to VGG, 64-P-128-128-P-256-256-P-512-512, a hidden FC of 256 neurons and the last for classifcation), clearly not to obtain a state-of-the-art score (currently about 75%).
It seems strange to me to talk about underfitting and overfitting in the same training process, so I'm pretty sure there's something I'm missing. Could you help me to understand these results?
Thanks for your attention
I've found a similar question but it didn't help (here).

Large number of training steps results in poor performance in transfer learning

I have a question. I have used transfer learning to retrain googlenet on my image classification problem. I have 80,000 images which belong to 14 categories. I set number of training steps equal to 200,000. I think the code provided by Tensorflow should have drop out implimented and it trains based on random shuffling of dataset and cross validation approach, and and I do not see any overfiting in training and classification curves, and I get high cross validation accuracy and high test accuracy but when I apply my model to new dataset then I get poor classification result. Anybodey know what is going on?Thanks!

What does it mean to normalize based on mean and standard deviation of images in the imagenet training dataset?

In implementation of densenet model as in CheXNet paper, in section 3.1 it is mentioned that:
Before inputting the images into the network, we downscale the images to 224x224 and normalize based on the mean and standard edviation of images in the ImageNet training set.
Why would we want to normalize new set of images with mean and std of different dataset?
How do we get the mean and std of ImageNet dataset? Is it provided somewhere?
Subtracting the mean centers the input to 0, and dividing by the standard deviation makes any scaled feature value the number of standard deviations away from the mean.
Consider how a neural network learns its weights. C(NN)s learn by continually adding gradient error vectors (multiplied by a learning rate) computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The thing to notice here is the "multiplied by a learning rate".
If we didn't scale our input training vectors, the ranges of our distributions of feature values would likely be different for each feature, and thus the learning rate would cause corrections in each dimension that would differ (proportionally speaking) from one another. We might be over compensating a correction in one weight dimension while undercompensating in another.
This is non-ideal as we might find ourselves in a oscillating (unable to center onto a better maxima in cost(weights) space) state or in a slow moving (traveling too slow to get to a better maxima) state.
Original Post: https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn
They used mean and std dev of the ImageNet training set because the weights of their model were pretrained on ImageNet (see Model Architecture and Training section of the paper).

RNN L2 Regularization stops learning

I use Bidirectional RNN to detect an event of unbalanced occurence. The positive class is 100times less often than the negative class.
While no regularization use I can get 100% accuracy on train set and 30% on validation set.
I turn on l2 regularization and the result is only 30% accuracy on train set too instead of longer learning and 100% accuracy on validation set.
I was thinking that maybe my data is too small so just for experiment I merged train set with test set which I did not use before. Situation was the same as I would use l2 regularization, which I did not now. I get 30% accuracy on train+test and validation.
In use 128hidden units and 80 timesteps in the mentioned experiments
When I increased the number of hidden units to 256 I can again overfit on train+test set to get 100% accuracy but still only 30% on validation set.
I did try so many options for hyperparameters and almost no result. Maybe the weighted cross entropy is causing the problem, in given experiments the weight on positive class is 5. While trying larger weights the results are often worse around 20% of accuracy.
I tried LSTM and GRU cells, no difference.
The best results I got. I tried 2 hidden layers with 256 hidden units, it took around 3 days of computation and 8GB of GPU memory. I got around 40-50% accuracy before it starts overfitting again while l2 regularization was on but no so strong.
Is there some general guideline what to do in this situation? I was not able to find anything.
Too much hidden units can overfit your model. You can try with smaller number of hidden units. As you mentioned, training with more data might improve the performance. If you don't have enough data, you can generate some artificial data. Researchers add distortions to their training data to increase their data size but in a controlled way. This type of strategy is pretty good for image data but certainly if you are dealing with text data, probably you can use some knowledge base that can improve the performance.
There are many works going on using Knowledge-bases to solve NLP and deep learning related tasks.