NN converges very quick but performs well. Should I be concerned? - deep-learning

I have LSTM model I'm using for time series predictions. In training it converges already after 3 epochs. The model performs quite well on the test data, but should I still be concerned about the fast convergence or should performance on test set be the overruling factor to decide if a model is good or not?
There is plenty of data points(100k) and two hidden layers with 124 and 64 nodes, so I don't think the model lacks complexity or data.

Related

Ways to prevent underfitting and overfitting to when using data augmentation to train a transposed CNN

I'm training a CNN (one using a series of ConvTranspose2D in pytorch) that uses input data from JSON to constitute an image. Unlike natural language, the input data can be in any order, as it contains info about various sprites in a scene.
In my first attempts to train the model, I didn't change the order of the input data (meaning, on each epoch, each sprite was represented in the same place in the input data). The model learned for about 10 epochs, but then there started to be divergence between the training loss (which continued to go down) and the test loss. So classic overfitting.
I tried to solve this by doing a form of data augmentation where the output data (in this case an image) stayed the same but I shuffled the order of the input data. As I have around 400 sprites, the maximum shuffling is 400!, so theoretically this can vastly expand the amount of training data. For example, instead of 100k JSON documents corresponding to 100K images, by shuffling the order of sprites in the input data, you have 400!*100000 training data points. In practice of course this amount of data is impractical, so I went with around 2m data points for an initial test. The issue I ran into here was that the model was not learning at all - after getting to a certain loss very quickly (after the first few mini-batches), it didn't learn at all for around 4 epochs. So classic underfitting.
Like Goldilocks, I'd like to find "just right" between the initial overfitting and subsequent underfitting. I'm wondering other strategies I could try out. One idea I had was letting the model train on a predetermined order of sprites (the overfitting case) and then, once overfitting starts (ie two straight epochs with divergence between the test and training loss) shuffling the data. I can also play with changing the model, although it can only be so big because of constraints with the hardware and the fact that inference needs to happen in under 20ms.
Are there any papers or techniques that are recommended in this scenario where data augmentation can lead to vastly more data points but results in a model ceasing to learn? Thanks in advance for any tips!

Should I be using whole available data for training my deep learning model ? What are the pros and cons of using only a subset?

I have a very complex LSTM based neural network model which I'm training on Quora Duplicate Question pairs. There are approximately 400 000 sentence pairs in the original dataset. It would take a lot of processing power and computation time to train on the entire (or 80%) dataset. Would it be unwise if I choose a random subset of the dataset (say 8000 pairs only) for training and 2000 for testing? Would it have a severe impact on the performance? Is always "more the data, better the model" true?
As a Rule of Thumb, Deep Neural Networks usually benefit from more data.
If you have a well described model and properly engineered your inputs, you will lose if you chose a smaller subset of your dataset.
However, you could always evaluate this by using metrics. Check how your loss decreases at every sample size, starting from your 8000 pairs.
For big problems, you always have to keep in mind that computation time is usually also big.

training small amount of data on the large capacity network

Currently I am using the convolutional neural networks to solve the binary classification problem. The data I use is 2D-images and the number of training data is only about 20,000-30,000. In deep learning, it is generally known that overfitting problems can arise if the model is too complex relative to the amount of the training data. So, to prevent overfitting, the simplified model or transfer learning is used.
Previous developers in the same field did not use high-capacity models (high-capacity means a large number of model parameters) due to the small amount of training data. Most of them used small-capacity models and transfer learning.
But, when I was trying to train the data on high-capacity models (based on ResNet50, InceptionV3, DenseNet101) from scratch, which have about 10 million to 20 million parameters in, I got a high accuracy in the test set.
(Note that the training set and the test set were exclusively separated, and I used early stopping to prevent overfitting)
In the ImageNet image classification task, the training data is about 10 million. So, I also think that the amount of my training data is very small compared to the model capacity.
Here I have two questions.
1) Even though I got high accuracy, is there any reason why I should not use a small amount of data on the high-capacity model?
2) Why does it perform well? Even if there is a (very) large gap between the amount of data and the number of model parameters, the techniques like early stopping overcome the problems?
1) You're completely right that small amounts of training data can be problematic when working with a large model. Given that your ultimate goal is to achieve a "high accuracy" this theoretical limitation shouldn't bother you too much if the practical performance is satisfactory for you. Of course, you might always do better but I don't see a problem with your workflow if the score on the test data is legit and you're happy with it.
2) First of all, I believe ImageNet consists of 1.X million images so that puts you a little closer in terms of data. Here are a few ideas I can think of:
Your problem is easier to solve than ImageNet
You use image augmentation to synthetically increase your image data
Your test data is very similar to the training data
Also, don't forget that 30,000 samples means (30,000 * 224 * 224 * 3 =) 4.5 billion values. That should make it quite hard for a 10 million parameter network to simply memorize your data.
3) Welcome to StackOverflow

Predicting rare events and their strength with LSTM autoencoder

I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,

Large number of training steps results in poor performance in transfer learning

I have a question. I have used transfer learning to retrain googlenet on my image classification problem. I have 80,000 images which belong to 14 categories. I set number of training steps equal to 200,000. I think the code provided by Tensorflow should have drop out implimented and it trains based on random shuffling of dataset and cross validation approach, and and I do not see any overfiting in training and classification curves, and I get high cross validation accuracy and high test accuracy but when I apply my model to new dataset then I get poor classification result. Anybodey know what is going on?Thanks!