I have tried to finetune a pretrained model using a training set that has 20 classes. The important thing to mention is that even though I have 20 classes, one class consist the 1/3 of the training images. Is that a reason that my loss does not decrease and testing accuracy is almost 30%?
Thank you for any advise
I had similar problem. I resolved it by increasing the variance of the initial values for the neural network weights. This serves as pre-conditioning for the neural network, to prevent the weights from dying out during back-prop.
I came across neural network lectures from Prof. Jenny Orr's course and found it very informative. (Just realized that Jenny co-authored many papers with Yann LeCun and Leon bottou in the early years on neural network training).
Hope it helps!
Yes it is very possible that your net is overfitting to the unbalanced labels. One solution is you can perform data augmentation on the other labels to balance them out. For example, if you have image data: you can do random crops, take horizontal/vertical flips, a variety of techniques.
Edit:
One way to check if you are overfitting to the unbalanced labels is to compute a histogram of your nets predicted labels. If it's highly skewed towards the unbalanced class, you should try the above data augmentation method and retrain your net and see if that helps.
Related
I just read the paper of Mnih (2013) and was really wondering about the aspect that he talks about using RMSprop with minibatches of size 32 (page 6).
My understanding of these kinds of reinforcement learning algorithms is, that there is only 1 or at least very little amount of training samples per fit, and in every fit I update the network.
Whereas in supervised learning I have up to millions of samples and divide them in minibatches of e.g. 32 and update the network after every minibatch, which makes sense.
So my question is: If I put only one sample into the neural network at a time, how does minibatches make sense? Did I understand something wrong about that concept?
Thanks in advance!
The answer provided by Filip is correct. Just to add intuition to his answer, the reason why an experience replay is used is to decorrelate the experiences that the RL experienced. This is essential when non-linear function approximation is used such as neural networks.
Example: Imagine if you had 10 days to study for a chemistry and math test, and both test were on the same day. If you spend the first 5 days on chemistry and last 5 days on math, you would have forgotten most of the chemistry you studied. A neural network behaves similarly.
By decorrelating the experiences, a more general policy can be identified through the training data.
And while training the neural network, we have a batch of memory (i.e., data), and we sample random mini-batches of 32 from them to do supervised learning, just as any other neural network is trained.
The paper you mentioned introduces two mechanisms that stabilize Q-Learning method when used with a deep neural network function approximator. One of the mechanisms is called Experience Replay, and it is basically a memory buffer for observed experiences. You can find the description in the paper in the end of the fourth page. Instead of learning from the single experience you have just seen, you save it to the buffer. Learning is done every N iterations and you sample a random minibatch of experiences from the replay buffer.
I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.
I have a dataset of around 6K chemical formulas which I am preprocessing via Keras' tokenization to perform binary classification. I am currently using a 1D convolutional neural network with dropouts and am obtaining an accuracy of 82% and validation accuracy of 80% after only two epochs. No matter what I try, the model just plateaus there and doesn't seem to be improving at all. Those same exact accuracies are reached with a vanilla LSTM too. What else can I try to improve my accuracies? Losses only have a difference of 0.04... Anyone have any ideas? Both models use an embedding layer and changing the output dimension isn't having an effect either.
According to your answer, I believe your model has a high bias and low variance (see this link for further details). Thus, your model is not fitting your data very well and it is causing underfitting. So, I suggest you 3 things:
Train your model a little longer: I believe two epoch are too few to give a chance to your model understand the patterns in the data. Try to minimize learning rate and increase the number of epochs.
Try a different architecture: you may change the amount of convolutions, filters and layers, You can also use different activation functions and other layers like max pooling.
Make an error analysis: once you finished your training, apply your model to test set and take a look into the errors. How much false positives and false negatives do you have? Is your model better to classify one class than the other? You can see a pattern in the errors that may be related to your data?
Finally, if none of these suggestions helped you, you may also try to increase the number of features, if possible.
all experts
I am new in CNN and Caffe. I have a task in classification between 2 classes. The data set that I have collected is very small about 50 for class A and 50 for class B (I know that it is very very small). It is a human images.
I took the BVLC model and made a change such as Batch size for testing and training and also the maximum iteration. I try with many various setup, but the model doesn't work.
Any idea about how to come up with appropriate model or setting or other solutions ?
remark** I once randomly made a change to the BVLC model setup and it worked, but i lost the set up file.
For the train.prototxt and Solve.prototxt, I get it from this guy Adil Moujahid
I did try training batch size as 32,64,128,256 and testing for 5,20,30 but failed
For the data set, it is images of normal women and beautiful women and i will classify it, but Stackoverflow does not allowed me to add more than 2 links
I wonder that is there any formula , equation or steps that I can come up with and choose the right model setting.
Thank you in advance.
What is your meaning in "doesn't work"? Loss stays too high? Training is converged, but accuracy is low? Andrew Ng has an excellent session on "debugging" CNNs - Nuts and Bolts of Building Applications using Deep Learning (NIPS slides, summary, additional summary).
My humble guess is that your network has an overfitting problem - it learns the specific examples and can't generalize - so increasing the train dataset / regularization / data augmentation can help.
I'm enrolled in Coursera ML class and I just started learning about neural networks.
One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.
It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.
Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.
For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.
I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.
Is there some neural network demo that does just that?
This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).
However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.
Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin,
Kai Chen and
Greg Corrado paper (emphasis mine):
In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus
...
These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.
In other words, they take a neuron that is best-performing at recognizing faces and
select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.
It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.
Interestingly, here are generated “optimal input” images for cat heads and human bodies: