Still confused about model.train() - deep-learning

I read all the posts here regarding model.train() and still didn't understand what is up with it. Specifically, when I use a pre-trained model like DenseNet or VGG with all parameters frozen beside the last layer not using drop-out nor Batch Normalization, the training loss starts off a lot smaller when using model.train(), but then decreases at about the same rate as when without it.
Why?

There are just three options: just model(inputs), model.train()(inputs) and model.eval()(inputs). The only difference is, that when using .eval() all the dropout and normalization is ignored because its just used for training and not for tesing.
Now you asked why it is still training when you just use model(inputs)? Because when you dont use train() nor eval() the model will be automatically in train-mode. So model(inputs) is the same as model.train()(inputs).

Related

How can you increase the accuracy of ResNet50?

I'm using Resnet50 model to classify images into two classes: normal cells and cancer cells.
so I want to to increase the accuracy but i don't know what to modify.
# we are using resnet50 for transfer learnin here. So we have imported it
from tensorflow.keras.applications import resnet50
# initializing model with weights='imagenet'i.e. we are carring its original weights
model_name='resnet50'
base_model=resnet50.ResNet50(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
last_layer=base_model.output # we are taking last layer of the model
# Add flatten layer: we are extending Neural Network by adding flattn layer
flatten=layers.Flatten()(last_layer)
# Add dense layer
dense1=layers.Dense(100,activation='relu')(flatten)
# Add dense layer to the final output layer
output_layer=layers.Dense(class_count,activation='softmax')(flatten)
# Creating modle with input and output layer
model=Model(inputs=base_model.inputs,outputs=output_layer)
model.compile(Adamax(learning_rate=.001), loss='categorical_crossentropy', metrics=['accuracy'])
There were 48 errors in 534 test cases Model accuracy= 91.01 %
Also what do you think about the results of the graph?
this is the classification report
i got good results but is there a possibility to increase accuracy more than that?
This is a broad question as there are many ways one can attempt to generally improve the network's accuracy. some of which may be
Increase the dimension of the layers that are learned in transfer learning (make sure not to overfit)
Use transfer learning with Convolution layers and not MLP
let the optimization algorithm choose the learning rate on its own
Play with additional augmentations to the dataset
and the list goes on.
Also, if possible, I would suggest comparing your results to other publicly available benchmarks - by doing so you might understand the upper bounds of the accuracies better

Both validation loss and accuracy are increasing using a pre-trained VGG-16

So, I'm doing a 4 label x-ray images classification on around 12600 images:
Class1:4000
Class2:3616
Class3:1345
Class4:4000
I'm using VGG-16 architecture pertained on the imageNet dataset with cross-entrpy and SGD and a batch size of 32 and a learning rate of 1e-3 running on pytorch
[[749., 6., 50., 2.],
[ 5., 707., 9., 1.],
[ 56., 8., 752., 0.],
[ 4., 1., 0., 243.]]
I know since both train loss/acc are relatively 0/1 the model is overfitting, though I'm surprised that the val acc is still around 0.9!
How to properly interpret that and what causing it and how to prevent it?
I know it's something like because the accuracy is the argmax of softmax like the actual predictions are getting lower and lower but the argmax always stays the same, but I'm really confused about it! I even let it train for +64 epochs same results flat acc while loss increases gradually!
PS. I have seen other questions with answers and didn't really get an explanation
I think your question already says about what is going on. Your model is overfitting as you have also figured out. Now, as you are training more your model slowly becoming more specialized to the train set and loosing the the capability to generalize gradually. So the softmax probabilities are getting more and more flat. But still it is showing more or less the same accuracy for validation set as still now the correct class has at least slightly more probability than the others. So in my opinion there can be some possible reasons for this:
Your train set and validation set may not be from the same distribution.
Your validation set doesn't cover all cases need to be evaluated, it probably contains similar types of images but they do not differ too much. So, when the model can identify one, it can identify many of them from the validation set. If you add more heterogeneous images in validation set, you will no longer see such a large accuracy in validation set.
Similarly, we can say your train set has images which are heterogeneous i.e, they have a lot of variations, and the validation set is covering only a few varieties, so as training goes on, those minorities are getting less priority as the model yet to have many things to learn and generalize. This can happen if you augment your train-set and your model finds the validation set is relatively easier initially (until overfitting), but as training goes on the model gets lost itself while learning a lot of augmented varieties available in the train set. In this case don't make the augmentation too much wild. Think, if the augmented images are still realistic or not. Do augmentation on images as long as they remain realistic and each type of these images' variations occupy enough representative examples in the train set. Don't include unnecessary situations in augmentation those will never occur in reality, as these unrealistic examples will just increase burden on the model than doing any help.

Trade off between losses?

I have been working on a super-resolution task. I have this question about determining loss function, So in the case of the task at hand I felt like going with SSIM as a loss function to train my model. I did get a good set of results. Recently I come across perceptual loss function where we compare how a pretrained model looks at the Ground truth(GT) Images and the Super Resolution(SR) Image(Image generated by the model). My question is, I am thinking of using both ((1-SSIM(SR,GT))+Perceptual loss(SR,GT)) loss for backpropagation, so should I use a trade-off parameter between these two losses? if so how can I set up these trade-off parameters? or should I add these losses with equal weights.
PS: the perceptual loss is calculated by finding SSIMs between the feature maps of GT and SR images from the pre-trained model

Training and test diverge while running catboost

When I run the catboost regressor my training and test plots diverge with weird kinks at ~1000 iterations. The plot is appended below and my regressor setup is as follows:
cat_model=CatBoostRegressor(iterations=2500, depth=4, learning_rate=0.01, loss_function='RMSE', thread_count=-1, use_best_model = True, random_seed=12, random_strength=10, rsm=0.5)
I tried different values of leaf_estimation_iterations & bagging_temperature but did not get any success. Any suggestions on what i should try to get better results.
Model Fit Plot
The diverge is normal. you will always perform better on the train set, as the model overfits the training set, and your objective is to regulate it with the validation set.
First I would recommend to read on bias vs variance tradeoff for a general intuition on how to tackle this issue.
specifically for catboost, you would like to regularize the training procedure so it would generalize better.
you can start with adding more data, and set higher l2_leaf_reg parameter.
The official documentation have much more good suggestions on model tuning:
https://catboost.ai/docs/concepts/parameter-tuning.html

deep autoencoder training, small data vs. big data

I am training a deep autoencoder (for now 5 layers encoding and 5 layers decoding, using leaky ReLu) to reduce the dimensionality of the data from about 2000 dims to 2. I can train my model on 10k data, and the outcome is acceptable.
The problem arises when I am using bigger data (50k to 1M). Using the same model with the same optimizer and drop out etc does not work and the training gets stuck after a few epochs.
I am trying to do some hyper-parameter search on the optimizer (I am using adam), but I am not sure if this will solve the problem.
Should I look for something else to change/check? Does the batch size matter in this case? Should I solve the problem by fine tuning the optimizer? Shoul I play with the dropout ratio? ...
Any advice is very much appreciated.
p.s. I am using Keras. It is very convenient. If you do not know about it, then check it out: http://keras.io/
I would have the following questions when trying to find a cause of the problem:
1) What happens if you change the size of the middle layer from 2 to something bigger? Does it improve the performance of the model trained on >50k training set?
2) Are 10k training examples and test examples randomly selected from 1M dataset?
My guess is that your training model is simply not able to decompress your 50K-1M data using just 2 dimensions in the middle layer. So, it's easier for the model to fit their params for 10k data, activations from middle layer are more sensible in that case, but for >50k data activations are random noise.
After some investigation, I have realized that the layer configuration I am using is somehow ill for the problem, and this seems to cause -at least parts of the- problem.
I have been using sequence of layers for encoding and decoding. The layer sizes where chosen to decrease linearly, for example:
input: 1764 (dims)
hidden1: 1176
hidden2: 588
encoded: 2
hidden3: 588
hidden4: 1176
output: 1764 (same as input)
However this seems to work only occasionally and it is sensitive to the choice of hyper parameters.
I tried to replace this with an exponentially decreasing layer size (for encoding) and the other way for decoding. so:
1764, 128, 16, 2, 16, 128, 1764
Now in this case the training seems to be happening more robustly. I still have to make a hyper parameter search to see if this one is sensitive or not, but a few manual trials seems to show its robustness.
I will post an update if I encounter some other interesting points.