I have an interesting problem. I am working on a project which I am trying to classify 15 logos (14 logo + 1 nonlogo class). The dataset is our own. I am using Digits 5 /6 that employs caffe. My caffe is 0.15.14 flavored by NVIDIA.
I have trained it with Alexnet and Googlenet which have been shipped with Digits. The models built by from scratch and finetuning seem ok. (GLT: 90% accuracy, alexnet: 80%) Meanwhile, these models have been created by finetuning the pretrained models (IMAGENET)
My problem is, I wanted to extend my study to cover resnet 32, densenet 121 and vgg16-19. Whenever I try to model these models, their top 1 accuracies get very poor results. (Generally 0) You might guess (as I did) that sources from by building from scratch. However, as far as i know, the model should converge to some limit but i always see a straight line after 2,3 epochs (the line is accuracy line and it is generally 0) the loss value increases to 87 after a few epochs.
I have searched the possible reasons that i may encounter.
1. I changed the weight_filler param to "xavier" nothing has changed.
2. I increased the learning rate but nothing has changed.
3. I even used a pretrained model to finetune vgg16 but it is still the same.
4. I used the cifar10 dataset by upscaling the image sizes to 224*224 and tried but the values are very similar to my logo dataset
I am struggling in finding the correct way. I am not an expert but it seems soo odd to me to have such bad results after having nice ones in alexnet and googlenet.
Why my models do not converge on these recent networks. I need your advices.
btw my training data contains 400 images per class and for non logo class i have collected 1200 non logo images. The validation data contains different numbers of images per logo class and a different 1200 non logo images for validation. So I have totally 5204 training, 579 test(10% of training) and 4729 validation images
Here i am attaching a trainval.txt for my resnet 32 model.
So what is my problem?
Thanks in advance
resnet32_train_val.prototxt
Related
So, I'm doing a 4 label x-ray images classification on around 12600 images:
Class1:4000
Class2:3616
Class3:1345
Class4:4000
I'm using VGG-16 architecture pertained on the imageNet dataset with cross-entrpy and SGD and a batch size of 32 and a learning rate of 1e-3 running on pytorch
[[749., 6., 50., 2.],
[ 5., 707., 9., 1.],
[ 56., 8., 752., 0.],
[ 4., 1., 0., 243.]]
I know since both train loss/acc are relatively 0/1 the model is overfitting, though I'm surprised that the val acc is still around 0.9!
How to properly interpret that and what causing it and how to prevent it?
I know it's something like because the accuracy is the argmax of softmax like the actual predictions are getting lower and lower but the argmax always stays the same, but I'm really confused about it! I even let it train for +64 epochs same results flat acc while loss increases gradually!
PS. I have seen other questions with answers and didn't really get an explanation
I think your question already says about what is going on. Your model is overfitting as you have also figured out. Now, as you are training more your model slowly becoming more specialized to the train set and loosing the the capability to generalize gradually. So the softmax probabilities are getting more and more flat. But still it is showing more or less the same accuracy for validation set as still now the correct class has at least slightly more probability than the others. So in my opinion there can be some possible reasons for this:
Your train set and validation set may not be from the same distribution.
Your validation set doesn't cover all cases need to be evaluated, it probably contains similar types of images but they do not differ too much. So, when the model can identify one, it can identify many of them from the validation set. If you add more heterogeneous images in validation set, you will no longer see such a large accuracy in validation set.
Similarly, we can say your train set has images which are heterogeneous i.e, they have a lot of variations, and the validation set is covering only a few varieties, so as training goes on, those minorities are getting less priority as the model yet to have many things to learn and generalize. This can happen if you augment your train-set and your model finds the validation set is relatively easier initially (until overfitting), but as training goes on the model gets lost itself while learning a lot of augmented varieties available in the train set. In this case don't make the augmentation too much wild. Think, if the augmented images are still realistic or not. Do augmentation on images as long as they remain realistic and each type of these images' variations occupy enough representative examples in the train set. Don't include unnecessary situations in augmentation those will never occur in reality, as these unrealistic examples will just increase burden on the model than doing any help.
I trained an image classification model of 10 classes by finetuning EfficientNet-B4 for 100 epochs. I split my training data to 70/30. I used stochastic gradient descent with Nesterov momentum of 0.9 and the starting learning rate is 0.001. The batch size is 10. The test loss seemed to stuck at 84% for the next 50 epochs (51st - 100th). I do not know whether the model was stuck in local minima or the model was overfitted. Below is an image of the test and train loss from 51st epoch to 100th. I need your help a lot. Thanks. Train test loss image from 51st to 100th epoch.
From the graph you provided, both validation and training losses are still going down so your model is still training and there is no overfit. If your test set is stuck at the same accuracy, the reason is probably that the data you are using for your training/validation dataset does not generalize well enough on your test dataset (in your graph the validation only reached 50% accuracy while your test set reached 84% accuracy).
I looked into your training and validation graph. yes, your model is training and the losses are going down, but your validation error is near 50%, which means 'random guess'.
Possible reasons-
1- From your train error (which is presented in the image between 50-100 epoch), the error in average is going down, but it's random. like your error at epoch 100 is pretty much the same at epoch 70. This could be because your either dataset is too simple and you are forcing huge network like an efficient net to overfit it.
2- it could also be because of the way you are finetuning it, there could be any problem. like which all layers you froze and for which layer you are taking the gradients while doing BP. I am assuming you are using pre-trained weights.
3- Optimizer issue. try to use Adam
It would be great if you can provide total losses (from epoch 1 - 100).
https://github.com/slavaglaps/ResNet_cifar10/blob/master/resnet.ipynb
This is my model trained in 100 epochs
Accuracy on similar models and similar data reaches 90%
What is my problem?
I think it's worth reducing the learning rate with the passage of the epochs.
What do you think that can help me?
There are a few subtle differences.
You are trying to apply ImageNet style architecture to Cifar-10. First convolution is 3 x 3, not 7 x 7. There is no max-pooling layer. The image is downsampled purely by using stride-2 convolutions.
You should probably do mean-centering by keeping featurewise_center = True in ImageDataGenerator.
Do not use very high number of filters such as [512, 1024, 2048]. There are only 50,000 images for you to train unlike ImageNet which has about a million.
In short, read up section 4.2 in the deep residual network paper and try to replicate the network. You may also read this blog.
I am training a deep autoencoder (for now 5 layers encoding and 5 layers decoding, using leaky ReLu) to reduce the dimensionality of the data from about 2000 dims to 2. I can train my model on 10k data, and the outcome is acceptable.
The problem arises when I am using bigger data (50k to 1M). Using the same model with the same optimizer and drop out etc does not work and the training gets stuck after a few epochs.
I am trying to do some hyper-parameter search on the optimizer (I am using adam), but I am not sure if this will solve the problem.
Should I look for something else to change/check? Does the batch size matter in this case? Should I solve the problem by fine tuning the optimizer? Shoul I play with the dropout ratio? ...
Any advice is very much appreciated.
p.s. I am using Keras. It is very convenient. If you do not know about it, then check it out: http://keras.io/
I would have the following questions when trying to find a cause of the problem:
1) What happens if you change the size of the middle layer from 2 to something bigger? Does it improve the performance of the model trained on >50k training set?
2) Are 10k training examples and test examples randomly selected from 1M dataset?
My guess is that your training model is simply not able to decompress your 50K-1M data using just 2 dimensions in the middle layer. So, it's easier for the model to fit their params for 10k data, activations from middle layer are more sensible in that case, but for >50k data activations are random noise.
After some investigation, I have realized that the layer configuration I am using is somehow ill for the problem, and this seems to cause -at least parts of the- problem.
I have been using sequence of layers for encoding and decoding. The layer sizes where chosen to decrease linearly, for example:
input: 1764 (dims)
hidden1: 1176
hidden2: 588
encoded: 2
hidden3: 588
hidden4: 1176
output: 1764 (same as input)
However this seems to work only occasionally and it is sensitive to the choice of hyper parameters.
I tried to replace this with an exponentially decreasing layer size (for encoding) and the other way for decoding. so:
1764, 128, 16, 2, 16, 128, 1764
Now in this case the training seems to be happening more robustly. I still have to make a hyper parameter search to see if this one is sensitive or not, but a few manual trials seems to show its robustness.
I will post an update if I encounter some other interesting points.
After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.
Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.
For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.