How to reduce the difference between training and validation in the loss curve? - deep-learning

I have used the Transformer model to train the time series dataset, but there is always a gap between training and validation in my loss curve. I have tried using different learning rates, batch sizes, dropout, heads, dim_feedforward, and layers, but they don't work. Can anyone give me some ideas on reducing the gap between them?
I also tried to ask the question on the Pytorch forum but didn't get any reply.
How to design a decoder for time series regression in Transformer?

Since you are overfitting your model here
1.Try using more data.
2.Try to add dropOut layers
3. Try using lasso or Ridge

Related

Cant overfit Neural Network

I have a simple encoder-decoder network. The encoder has several layers of conv1d with linear at the end and Relu between them, the decoder is consists of conv1d layers and Relu between them(no batch norm or dropout).
Using this model I try to overfit one example,I work with batch size=1 and always give the same input and same desired output, however no success. The loss indeed goes down until some threshold, but no matter what I do I can't get the loss lower than this bound and the output is useless. I tried more sophisticated encoder/decoder, change hyperparameters, make different preprocessing on my data, but I never can't get the loss lower than that threshold.
Just for the protocol, if I give it as input the desired output(so it will learn the id function) the network works, but that doesn't help me.
I will appreciate any help with it with any idea what might be the problem.
Try more number of epochs, with a lower learning rate.
Try increasing the size of your Dense layers.
Try avoiding any Dropout layers.
These can make the model more vulnerable to overfitting, if this is what you wanted.

how to train pre-trained CNN on new dataset which is not organised in classes (Unsupervised)

I have a pretrained CNN (Resnet-18) trained on Imagenet, now i want to extend it on my own dataset of video frames , now the point is all tutorials i found on Finetuning required dataset to be organised in classes like
class1/train/
class1/test/
class2/train/
class2/test/
but i have only frames on many videos , how will i train my CNN on it.
So can anyone point me in right direction , any tutorial or paper etc ?
PS: My final task is to get deep features of all frames that i provide at the time of testing
for training network, you should have some 'label'(sometimes called y) of your input data. from there, network calculate loss between logit(answer of network) and the given label.
And the network will self-revise using that loss value by backpropagating. that process is what we call 'training'.
Because you only have input data, not label, so you can get the logit only. that means a loss cannot be calculated.
Fine tuning is almost same word with 'additional training', so that you cannot fine tuning your pre-trained network without labeled data.
About train set & test set, that is not the problem right now.
If you have enough labeled input data, you can divide it with some ratio.
(e.g. 80% of data for training, 20% of data for testing)
the reason why divide data into these two sets, we want to check the performance of our trained network more general, unseen situation.
However, if you just input your data into pre-trained network(encoder part), it will give a deep feature. It doesn't exactly fit to your task, still it is deep feature.
Added)
Unsupervised pre-training for convolutional neural network in theano
here is the method you need, deep feature encoder in unsupervised situation. I hope it will help.

How do I verify that my model is actually functioning correctly in deep learning?

I have a dataset of around 6K chemical formulas which I am preprocessing via Keras' tokenization to perform binary classification. I am currently using a 1D convolutional neural network with dropouts and am obtaining an accuracy of 82% and validation accuracy of 80% after only two epochs. No matter what I try, the model just plateaus there and doesn't seem to be improving at all. Those same exact accuracies are reached with a vanilla LSTM too. What else can I try to improve my accuracies? Losses only have a difference of 0.04... Anyone have any ideas? Both models use an embedding layer and changing the output dimension isn't having an effect either.
According to your answer, I believe your model has a high bias and low variance (see this link for further details). Thus, your model is not fitting your data very well and it is causing underfitting. So, I suggest you 3 things:
Train your model a little longer: I believe two epoch are too few to give a chance to your model understand the patterns in the data. Try to minimize learning rate and increase the number of epochs.
Try a different architecture: you may change the amount of convolutions, filters and layers, You can also use different activation functions and other layers like max pooling.
Make an error analysis: once you finished your training, apply your model to test set and take a look into the errors. How much false positives and false negatives do you have? Is your model better to classify one class than the other? You can see a pattern in the errors that may be related to your data?
Finally, if none of these suggestions helped you, you may also try to increase the number of features, if possible.

May I use CaffeNet for 3 labels? [duplicate]

I trained GoogLeNet model from scratch. But it didn't give me the promising results.
As an alternative, I would like to do fine tuning of GoogLeNet model on my dataset. Does anyone know what are the steps should I follow?
Assuming you are trying to do image classification. These should be the steps for finetuning a model:
1. Classification layer
The original classification layer "loss3/classifier" outputs predictions for 1000 classes (it's mum_output is set to 1000). You'll need to replace it with a new layer with appropriate num_output. Replacing the classification layer:
Change layer's name (so that when you read the original weights from caffemodel file there will be no conflict with the weights of this layer).
Change num_output to the right number of output classes you are trying to predict.
Note that you need to change ALL classification layers. Usually there is only one, but GoogLeNet happens to have three: "loss1/classifier", "loss2/classifier" and "loss3/classifier".
2. Data
You need to make a new training dataset with the new labels you want to fine tune to. See, for example, this post on how to make an lmdb dataset.
3. How extensive a finetuning you want?
When finetuning a model, you can train ALL model's weights or choose to fix some weights (usually filters of the lower/deeper layers) and train only the weights of the top-most layers. This choice is up to you and it ususally depends on the amount of training data available (the more examples you have the more weights you can afford to finetune).
Each layer (that holds trainable parameters) has param { lr_mult: XX }. This coefficient determines how susceptible these weights to SGD updates. Setting param { lr_mult: 0 } means you FIX the weights of this layer and they will not be changed during the training process.
Edit your train_val.prototxt accordingly.
4. Run caffe
Run caffe train but supply it with caffemodel weights as an initial weights:
~$ $CAFFE_ROOT/build/tools/caffe train -solver /path/to/solver.ptototxt -weights /path/to/orig_googlenet_weights.caffemodel
Fine-tuning is a very useful trick to achieve a promising accuracy compared to past manual feature. #Shai already posted a good tutorial for fine-tuning the Googlenet using Caffe, so I just want to give some recommends and tricks for fine-tuning for general cases.
In most of time, we face a task classification problem that new dataset (e.g. Oxford 102 flower dataset or Cat&Dog) has following four common situations CS231n:
New dataset is small and similar to original dataset.
New dataset is small but is different to original dataset (Most common cases)
New dataset is large and similar to original dataset.
New dataset is large but is different to original dataset.
In practice, most of time we do not have enough data to train the network from scratch, but may be enough for pre-trained model. Whatever which cases I mentions above only thing we must care about is that do we have enough data to train the CNN?
If yes, we can train the CNN from scratch. However, in practice it is still beneficial to initialize the weight from pre-trained model.
If no, we need to check whether data is very different from original datasets? If it is very similar, we can just fine-tune the fully connected neural network or fine-tune with SVM. However, If it is very different from original dataset, we may need to fine-tune the convolutional neural network to improve the generalization.

Caffe Autoencoder

I wanna compare the performance of CNN and autoencoder in caffe. I'm completely familiar with cnn in caffe but I wanna is the autoencoder also has deploy.prototxt file ? is there any differences in using this two models rather than the architecture?
Yes it also has a deploy.prototxt.
both train_val.prototxt and 'deploy.prototxt' are cnn architecture description files. The sole difference between them is, train_val.prototxt takes training data and loss as input/output, but 'deploy.prototxt' takes testing image as input, and predicted value as out put.
Here is an example of a cnn and autoencoder for MINST: Caffe Examples. (I have not tried the examples.) Using the models is generally the same. Learning rates etc. depend on the model.
You need to implement an auto-encoder example using python or matlab. The example in Caffe is not true auto-encoder because it doesn't set layer-wise training stage and during training stage, it doesn't fix W{L->L+1} = W{L+1->L+2}^T. It is easily to find a 1D auto-encoder in github, but 2D auto-encoder may be hard to find.
The main difference between the Auto encoders and conventional network is
In Auto encoder your input is your label image for training.
Auto encoder tries to approximate the output similar as input.
Auto encoders does not have softmax layer while training.
It can be used as a pre-trained model for your network which converge faster comparing to other pre-trained models. It is because your network has already extracted the features for your data.
The Conventional training and testing you can perform on pre trained auto encoder network for faster convergence and accuracy.