Deep learning model test accuracy unstable - deep-learning

I am trying to train and test a pytorch GCN model that is supposed to identify person. But the test accuracy is quite jumpy like it gives 49% at 23 epoch then goes below near 45% at 41 epoch. So it's not increasing all the time though loss seems to decrease at every epoch.
My question is not about implementation errors rather I want to know why this happens. I don't think there is something wrong in my coding as I saw SOTA architecture has this type of behavior as well. The author just picked the best result and published saying that their models gives that result.
Is it normal for the accuracy to be jumpy (up-down) and am I just to take the best ever weights that produce that?

Accuracy is naturally more "jumpy", as you put it. In terms of accuracy, you have a discrete outcome for each sample - you either get it right or wrong. This makes it so that the result fluctuate, especially if you have a relatively low number of samples (as you have a higher sampling variance).
On the other hand, the loss function should vary more smoothly. It is based on the probabilities for each class calculated at your softmax layer, which means that they vary continuously. With a small enough learning rate, the loss function should vary monotonically. Any bumps you see are due to the optimization algorithm taking discrete steps, with the assumption that the loss function is roughly linear in the vicinity of the current point.

Related

Regression problem getting much better results when dividing values by 100

I'm working on a regression problem in pytorch. My target values can be either between 0 to 100 or 0 to 1 (they represent % or % divided by 100).
The data is unbalanced, I have much more data with lower targets.
I've noticed that when I run the model with targets in the range 0-100, it doesn't learn - the validation loss doesn't improve, and the loss on the 25% large targets is very big, much bigger than the std in this group.
However, when I run the model with targets in the range 0-1, it does learn and I get good results.
If anyone can explain why this happens, and if using the ranges 0-1 is "cheating", that will be great.
Also - should I scale the targets? (either if I use the larger or the smaller range).
Some additional info - I'm trying to fine tune bert for a specific task. I use MSEloss.
Thanks!
I think your observation relates to batch normalization. There is a paper written on the subject, an numerous medium/towardsdatascience posts, which i will not list here. Idea is that if you have a no non-linearities in your model and loss function, it doesn't matter. But even in MSE you do have non-linearity, which makes it sensitive to scaling of both target and source data. You can experiment with inserting Batch Normalization Layers into your models, after dense or convolutional layers. In my experience it often improves accuracy.

At which point adding new data to a training set, will not improve training accuracy

This is more a general question about training a CNN but the one i'm using is YOLO.
I've started my training set for 'person' detections by labelling some data from different cameras videos (in similar environment).... Every time I was adding new data for a new camera I was retraining YOLO, which actually improved the detection for this camera. For the training, I split my data randomly into training/validation set. I use the validation set to compute accuracy. This is not overfitting as all the previous data are also used in the training.
Now, I've gathered more than 100 000 labelled data. I was expecting to not have to train anymore at this point as my data set is pretty big. But looks like I still need to do it. if i'm getting a new camera video, labelling 500-1000 samples, adding them to my huge data set and training again, the accuracy is improving for this camera.
I don't understand really understand why. Why do i still need to add new data to my set? Why is the accuracy improving a lot on the new data, while there are 'drawn' in the thousands of already existing data? Is there a point where I will be able to stop training because adding new data will not improve the accuracy?
Thanks for sharing your thoughts and ideas!
Interesting question. If your data quality is good and the training procedure is 'perfect' you will always be able to generalize better. Think about all the possible infite different images that you will want to detect. You are only using a sample of that, hoping that it is enough to generalize. You can keep increasing your dataset and might gain a 0.01% more, the question is when you want to stop. Your model accuracy will never be 100%.
My opinion: if you have a nice above 95% of accuracy stop generating more data if your project is personal and no one's life depends on it. Think about post processing to improve the results. Since you are detecting on video maybe try to follow the person movement so if in one frame it is not detected and you have info from the previous and posterior frame you might be able to do something fancy.
Hope it helps, cheers!
To create a good model of course you will need as many images as possible. But you have to pay attention whether your model become overfit, which is your model is not learning anymore and the average loss getting higher and the mAP getting lower, when overfitting occurs you have to stop the training and choose the best weight that has been saved in darknet/backup/ folder.
For YOLO, there are some guidelines that you can follow about when you should to stop training. The most obvious is :
During training, you will see varying indicators of error, and you should stop when no longer decreases 0.XXXXXXX avg:
Region Avg IOU: 0.798363, Class: 0.893232, Obj: 0.700808, No Obj: 0.004567, Avg Recall: 1.000000, count: 8 Region Avg IOU: 0.800677, Class: 0.892181, Obj: 0.701590, No Obj: 0.004574, Avg Recall: 1.000000, count: 8
9002: 0.211667, 0.060730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds
9002 - iteration number (number of batch)
0.060730 avg - average loss (error) - the lower, the better
When you see that average loss 0.xxxxxx avg no longer decreases at many iterations then you should stop training. The final average loss can be from 0.05 (for a small model and easy dataset) to 3.0 (for a big model and a difficult dataset). I personally think that model with avg loss 0.06 is good enough.
AlexeyAB explained everything in detail on his github repo, read this section please https://github.com/AlexeyAB/darknet#when-should-i-stop-training

RNN L2 Regularization stops learning

I use Bidirectional RNN to detect an event of unbalanced occurence. The positive class is 100times less often than the negative class.
While no regularization use I can get 100% accuracy on train set and 30% on validation set.
I turn on l2 regularization and the result is only 30% accuracy on train set too instead of longer learning and 100% accuracy on validation set.
I was thinking that maybe my data is too small so just for experiment I merged train set with test set which I did not use before. Situation was the same as I would use l2 regularization, which I did not now. I get 30% accuracy on train+test and validation.
In use 128hidden units and 80 timesteps in the mentioned experiments
When I increased the number of hidden units to 256 I can again overfit on train+test set to get 100% accuracy but still only 30% on validation set.
I did try so many options for hyperparameters and almost no result. Maybe the weighted cross entropy is causing the problem, in given experiments the weight on positive class is 5. While trying larger weights the results are often worse around 20% of accuracy.
I tried LSTM and GRU cells, no difference.
The best results I got. I tried 2 hidden layers with 256 hidden units, it took around 3 days of computation and 8GB of GPU memory. I got around 40-50% accuracy before it starts overfitting again while l2 regularization was on but no so strong.
Is there some general guideline what to do in this situation? I was not able to find anything.
Too much hidden units can overfit your model. You can try with smaller number of hidden units. As you mentioned, training with more data might improve the performance. If you don't have enough data, you can generate some artificial data. Researchers add distortions to their training data to increase their data size but in a controlled way. This type of strategy is pretty good for image data but certainly if you are dealing with text data, probably you can use some knowledge base that can improve the performance.
There are many works going on using Knowledge-bases to solve NLP and deep learning related tasks.

Difference between training and testing phase in caffe

this might seem like a silly question, but I am trying to understand to what extent the testing phase in caffe is important for good results. Of course the training phase is important, but is the testing phase simply to test out how much loss is obtained periodically on a set that is not trained? If this is the case, does the size of my test set really matter? Does testing even matter at all? I ask because I currently have some serious overfit problems. If I have a large dataset (>50 000 images), how should I go about splitting them between test and train?
Caffe never use the result of the test sets while doing training and modify some parameter to fix some issues like overfitting.
The usage of a validation set (test set during training) is for us to visualize whether the model overfits the data by looking at the accuracy or loss values, by plotting them or looking at the outputs.
For example, if the loss of the training set keeps reducing at every iterations and the loss of the test set keeps increasing, this is a solid case of the model overfitting the training set. For getting such conclusions, the images selected for the test set shouldn't be the same as that of the training set. Its ideal to keep a 1:10 ratio for test-train image count. If the test set was using a subset of the trainset, the loss of the testset would have decreased and we may not detect the overfitting behaviour of the model.

Benchmarking: When can I stop making measurements?

I have a series of functions that are all designed to do the same thing. The same inputs produce the same outputs, but the time that it takes to do them varies by function. I want to determine which one is 'fastest', and I want to have some confidence that my measurement is 'statistically significant'.
Perusing Wikipedia and the interwebs tells me that statistical significance means that a measurement or group of measurements is different from a null hypothesis by a p-value threshold. How would that apply here? What is the null hypothesis between function A being faster than function B?
Once I've got that whole setup defined, how do I figure out when to stop measuring? I'll typically see that a benchmark is run three times, and then the average is reported; why three times and not five or seven? According to this page on Statistical Significance (which I freely admit I do not understand fully), Fisher used 8 as the number of samples that he needed to measure something with 98% confidence; why 8?
I would not bother applying statistics principles to benchmarking results. In general, the term "statistical significance" refers to the likelihood that your results were achieved accidentally, and do not represent an accurate assessment of the true values. In statistics, as a result of simple probability, the likelihood of a result being achieved by chance decreases as the number of measurements increases. In the benchmarking of computer code, it is a trivial matter to increase the number of trials (the "n" in statistics) so that the likelihood of an accidental result is below any arbitrary threshold you care to define (the "alpha" or level of statistical significance).
To simplify: benchmark by running your code a huge number of times, and don't worry about statistical measurements.
Note to potential down-voters of this answer: this answer is somewhat of a simplification of the matter, designed to illustrate the concepts in an accessible way. Comments like "you clearly don't understand statistics" will result in a savage beat-down. Remember to be polite.
You are asking two questions:
How do you perform a test of statistical significance that the mean time of function A is greater than the mean time of function B?
If you want a certain confidence in your answer, how many samples should you take?
The most common answer to the first question is that you either want to compute a confidence interval or perform a t-test. It's not different than any other scientific experiment with random variation. To compute the 95% confidence interval of the mean response time for function A simply take the mean and add 1.96 times the standard error to either side. The standard error is the square root of the variance divided by N. That is,
95% CI = mean +/- 1.96 * sqrt(sigma2/N))
where sigma2 is the variance of speed for function A and N is the number of runs you used to calculate mean and variance.
Your second question relates to statistical power analysis and the design of experiments. You describe a sequential setup where you are asking whether to continue sampling. The design of sequential experiments is actually a very tricky problem in statistics, since in general you are not allowed to calculate confidence intervals or p-values and then draw additional samples conditional on not reaching your desired significance. If you wish to do this, it would be wiser to set up a Bayesian model and calculate your posterior probability that speed A is greater than speed B. This, however, is massive overkill.
In a computing environment it is generally pretty trivial to achieve a very small confidence interval both because drawing large N is easy and because the variance is generally small -- one function obviously wins.
Given that Wikipedia and most online sources are still horrible when it comes to statistics, I recommend buying Introductory Statistics with R. You will learn both the statistics and the tools to apply what you learn.
The research you site sounds more like a highly controlled environment. This is purely a practical answer that has proven itself time and again to be effective for performance testing.
If you are benchmarking code in a modern, multi-tasking, multi-core, computing environment, the number of iterations required to achieve a useful benchmark goes up as the length of time of the operation to be measured goes down.
So, if you have an operation that takes ~5 seconds, you'll want, typically, 10 to 20 iterations. As long as the deviation across the iterations remains fairly constant, then your data is sound enough to draw conclusions. You'll often want to throw out the first iteration or two because the system is typically warming up caches, etc...
If you are testing something in the millisecond range, you'll want 10s of thousands of iterations. This will eliminate noise caused by other processes, etc, firing up.
Once you hit the sub-millisecond range -- 10s of nanoseconds -- you'll want millions of iterations.
Not exactly scientific, but neither is testing "in the real world" on a modern computing system.
When comparing the results, consider the difference in execution speed as percentage, not absolute. Anything less than about 5% difference is pretty close to noise.
Do you really care about statistical significance or plain old significance? Ultimately you're likely to have to form a judgement about readability vs performance - and statistical significance isn't really going to help you there.
A couple of rules of thumb I use:
Where possible, test for enough time to make you confident that little blips (like something else interrupting your test for a short time) won't make much difference. Usually I reckon 30 seconds is enough for this, although it depends on your app. The longer you test for, the more reliable the test will be - but obviously your results will be delayed :)
Running a test multiple times can be useful, but if you're timing for long enough then it's not as important IMO. It would alleviate other forms of error which made a whole test take longer than it should. If a test result looks suspicious, certainly run it again. If you see significantly different results for different runs, run it several more times and try to spot a pattern.
The fundamental question you're trying to answer is how likley is it that what you observe could have happened by chance? Is this coin fair? Throw it once: HEADS. No it's not fair it always comes down heads. Bad conclusion! Throw it 10 times and get 7 Heads, now what do you conclude? 1000 times and 700 heads?
For simple cases we can imagine how to figure out when to stop testing. But you have a slightly different situation - are you really doing a statistical analysis?
How much control do you have of your tests? Does repeating them add any value? Your computer is deterministic (maybe). Eistein's definition of insanity is to repeat something and expect a different outcome. So when you run your tests do you get repeatable answers? I'm not sure that statistical analyses help if you are doing good enough tests.
For what you're doing I would say that the first key thing is to make sure that you really are measuring what you think. Run every test for long enough that any startup or shutdown effects are hidden. Useful performance tests tend to run for quite extended periods for that reason. Make sure that you are not actually measuing the time in your test harness rather than the time in your code.
You have two primary variables: how many iterations of your method to run in one test? How many tests to run?
Wikipedia says this
In addition to expressing the
variability of a population, standard
deviation is commonly used to measure
confidence in statistical conclusions.
For example, the margin of error in
polling data is determined by
calculating the expected standard
deviation in the results if the same
poll were to be conducted multiple
times. The reported margin of error is
typically about twice the standard
deviation.
Hence if your objective is to be sure that one function is faster than another you could run a number of tests of each, compute the means and standard deviations. My expectation is that if your number of iterations within any one test is high then the standard deviation is going to be low.
If we accept that defintion of margin of error, you can see whether the two means are further apart than their total margin's of error.