I encountered an oscillating training loss curve problem. So I followed some tutorials and tried to solve it.
I did a learning rate test, and the LR ranges from 1E-7 to 0.1. The network I used is ResNet-50, and the optimizer is Adam.
Below is the LRTest result I got.
What I expected is a curve that first decreases slowly, then decreases steeply, and at last goes up. But I got this different result. I don't know what this trend of curve means. What should I do to select an appropriate learning rate?
Related
I am trying to train and test a pytorch GCN model that is supposed to identify person. But the test accuracy is quite jumpy like it gives 49% at 23 epoch then goes below near 45% at 41 epoch. So it's not increasing all the time though loss seems to decrease at every epoch.
My question is not about implementation errors rather I want to know why this happens. I don't think there is something wrong in my coding as I saw SOTA architecture has this type of behavior as well. The author just picked the best result and published saying that their models gives that result.
Is it normal for the accuracy to be jumpy (up-down) and am I just to take the best ever weights that produce that?
Accuracy is naturally more "jumpy", as you put it. In terms of accuracy, you have a discrete outcome for each sample - you either get it right or wrong. This makes it so that the result fluctuate, especially if you have a relatively low number of samples (as you have a higher sampling variance).
On the other hand, the loss function should vary more smoothly. It is based on the probabilities for each class calculated at your softmax layer, which means that they vary continuously. With a small enough learning rate, the loss function should vary monotonically. Any bumps you see are due to the optimization algorithm taking discrete steps, with the assumption that the loss function is roughly linear in the vicinity of the current point.
I am having trouble with understanding deep learning.
I see that deep learning is basically about inductive process, and so the function must be adjusted enough until it hits the right target.
But I can not figure out how much those w and b values should be changed in each trials. Is there any rule for the adjustment?
If there is not, then is there any trick? like, some formulas those are normally used.
And, do more networks always perform better?
I understand that single layer can not hit as many target as multiple layer does, but I don't know if 3-layer is better than 2-layer.
The changes of w and b are based on the gradient of them.
You can calculate the gradient by taking derivative from the error (depending on the loss function). As you decrease the gradient, the error also decreases.
The maximum change of your gradient is gradient/gradient_magnitude * total_error/gradient_magnitude.
When you increase or decrease your function by its unit gradient, the output will be increased or decreased about the magnitude of its gradient. For that reason, the maximum changes of w and b are gradient*err/mag^2.
However, changing gradients to their limits is not recommended because problem of local minimum could occur. Therefore, learning rate or dropout algorithms are usually implemented.
The method above is not the only way to adjust the factors. Genetic algorithm, RBM, or reinforced learning methods could be implemented to replace or help above method.
I am able to get pretty good results from batch gradient descent(batch size 37000) but when i try out mini-batch gradient descent i get very poor results (even with adam and dropout).
In batch gd i'm able to get 100% train and 97% dev/cv accuracy.
Whereas in mini-batch of size 128 i'm getting only around 88% accuracy in both.
The train loss seems to revolve around 1.6 and doesn't decrease with any further iteration but slowly decreases when i increase the batch size(hence improving accuracy).And eventually i arrive at batch size of 37000 for max accuracy.
I tried tweaking alpha but still same accuracy.
I'm training the mnist digits dataset.
What could be the reason? please help
In Batch Gradient Descent, all the training data is taken into consideration to take a single step. In mini batch gradient descent you consider some of data before taking a single step so the model update frequency is higher than batch gradient descent.
But mini-batch gradient descent comes with a cost:
Firstly, mini-batch makes some learning problems from technically untackleable to be tackleable due to the reduced computation demand with smaller batch size.
Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases.
I believe that because of the oscillations in mini-batch you might fell into a local minima. Try to increase the learning rate with mini-batch it may solve the problem. also try to normalize the pictures it may help too.
I found the solution
The lmbda value i used for batch gd (i.e 10) seems to to be too big for mini batch gd.
And by decreasing it to 0.1 , i fixed the problem.
I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.
This is more a general question about training a CNN but the one i'm using is YOLO.
I've started my training set for 'person' detections by labelling some data from different cameras videos (in similar environment).... Every time I was adding new data for a new camera I was retraining YOLO, which actually improved the detection for this camera. For the training, I split my data randomly into training/validation set. I use the validation set to compute accuracy. This is not overfitting as all the previous data are also used in the training.
Now, I've gathered more than 100 000 labelled data. I was expecting to not have to train anymore at this point as my data set is pretty big. But looks like I still need to do it. if i'm getting a new camera video, labelling 500-1000 samples, adding them to my huge data set and training again, the accuracy is improving for this camera.
I don't understand really understand why. Why do i still need to add new data to my set? Why is the accuracy improving a lot on the new data, while there are 'drawn' in the thousands of already existing data? Is there a point where I will be able to stop training because adding new data will not improve the accuracy?
Thanks for sharing your thoughts and ideas!
Interesting question. If your data quality is good and the training procedure is 'perfect' you will always be able to generalize better. Think about all the possible infite different images that you will want to detect. You are only using a sample of that, hoping that it is enough to generalize. You can keep increasing your dataset and might gain a 0.01% more, the question is when you want to stop. Your model accuracy will never be 100%.
My opinion: if you have a nice above 95% of accuracy stop generating more data if your project is personal and no one's life depends on it. Think about post processing to improve the results. Since you are detecting on video maybe try to follow the person movement so if in one frame it is not detected and you have info from the previous and posterior frame you might be able to do something fancy.
Hope it helps, cheers!
To create a good model of course you will need as many images as possible. But you have to pay attention whether your model become overfit, which is your model is not learning anymore and the average loss getting higher and the mAP getting lower, when overfitting occurs you have to stop the training and choose the best weight that has been saved in darknet/backup/ folder.
For YOLO, there are some guidelines that you can follow about when you should to stop training. The most obvious is :
During training, you will see varying indicators of error, and you should stop when no longer decreases 0.XXXXXXX avg:
Region Avg IOU: 0.798363, Class: 0.893232, Obj: 0.700808, No Obj: 0.004567, Avg Recall: 1.000000, count: 8 Region Avg IOU: 0.800677, Class: 0.892181, Obj: 0.701590, No Obj: 0.004574, Avg Recall: 1.000000, count: 8
9002: 0.211667, 0.060730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds
9002 - iteration number (number of batch)
0.060730 avg - average loss (error) - the lower, the better
When you see that average loss 0.xxxxxx avg no longer decreases at many iterations then you should stop training. The final average loss can be from 0.05 (for a small model and easy dataset) to 3.0 (for a big model and a difficult dataset). I personally think that model with avg loss 0.06 is good enough.
AlexeyAB explained everything in detail on his github repo, read this section please https://github.com/AlexeyAB/darknet#when-should-i-stop-training