YOLOv5 decreasing inference speed - yolov5

I am using YOLOv5x model on my custom dataset. Inference time is initially 0.055s, then it increases up to 2 seconds gradually. Same thing happens in the validation too. Iterations start from 6 seconds and end as much as 34 seconds.
This performance drop happens in every training setting so I don't think it is about the dataset. I can train it without performance drop in the ssh server.
My current gpu is RTX 2070. I have 16gb ram and i7-9750h cpu.
edit:
If I split images into small parts and wait between the inferences, I get optimal performance. Also, If I run detect for the same part without waiting, I get worse inference time for the same images.

It was because of the thermal throttling. Cleaning and new thermal paste solved the problem. You can also see the original answer from the GitHub page.

Related

Why loaded Pytorch model's loss highly increased?

I'm trying to train Arcface with reference to.
As far as I know, Arcface requires more than 200 training epochs on CASIA-webface with a large batch size.
Within 100 epochs of training, I stopped the training for a while because I was needed to use GPU for other tasks. And the checkpoints of the model(Resnet) and margin are saved. Before it was stopped, its loss recorded a value between 0.3~1.0, and training accuracy was mount to 80~95%.
When I resume the Arcface training by loading the checkpoint files using load_sate, it seems normal when the first batch is processed. But suddenly the loss increased sharply and the accuracy became very low.
Its loss suddenly became increased. How did this happen? I had no other way so anyway continued the training, but I don't think the loss is decreasing well even though it is a trained model over 100 epochs...
When I searched for similar issues, they told the problem was that the optimizer was not saved (Because the reference github page didn't save the optimizer, so did I. Is it true?
My losses after loading
if you see this line!
you are Decaying the learning rate of each parameter group by gamma.
This has altered your learning rate as you had reached 100th epoch. and moreover you had not saved your optimizer state while saving your model.
This made your code to start with the starting lr i.e 0.1 after resuming your training.
And this spiked your loss again.
Vote if you find this useful

Schedule jobs between GPUs for PyTorch Models

I'm trying to build up a system that trains deep models on requests. A user comes to my web site, clicks a button and a training process starts.
However, I have two GPUs and I'm not sure which is the best way to queue/handle jobs between the two GPUs: start a job when at least one GPU is available, queue the job if there are currently no GPUs available. I'd like to use one GPU per job request.
Is this something I can do in combination with Celery? I've used this in the past but I'm not sure how to handle this GPU related problem.
Thanks a lot!
Not sure about celery as I've never used it, but conceptually what seems reasonable (and the question is quite open ended anyway):
create thread(s) responsible solely for distributing tasks to certain GPUs and receiving requests
if any GPU is free assign task immediately to it
if both are occupied estimate time it will probably take to finish the task (neural network training)
add it to the GPU will smallest approximated time
Time estimation
ETA of current task can be approximated quite well given fixed number of samples and epochs. If that's not the case (e.g. early stopping) it will be harder/way harder and would need some heuristic.
When GPUs are overloaded (say each has 5 tasks in queue), what I would do is:
Stop process currently on-going on GPU
Run new process for a few batches of data to make rough estimation how long it might take to finish this task
Ask it to the estimation of all tasks
Now, this depends on the traffic. If it's big and would interrupt on-going process too often you should simply add new tasks to GPU queue which has the least amount of tasks (some heuristic would be needed here as well, you should have estimated possible amount of requests by now, assuming only 2 GPUs it cannot be huge probably).

Using nvidia-smi what is the best strategy to capture power

I am using Tesla K20c and measuring power with nvidia-smi as my application is run. My problem is power consumption does not reach a steady state but keeps rising. For example, if my application runs for 100 iterations, power reaches 106W(in 4 seconds), for 1000 iterations 117 W (in 41 seconds), for 10000 iterations 122W (in 415 seconds) and so on increasing slightly every time. I am writing for some recommendation on which power value I should record. In my experimental setup I have over 400 experiments, and doing each one for 10000 iterations is not feasible at least for now. The application is matrix multiplication which is doable in just one iteration taking just a few milliseconds. Increasing the number of iterations does not bring any value to the results, but it increases the run time allowing power monitoring.
The reason you are seeing power consumption increase over time is that the GPU is heating up under a sustained load. Electronic components draw more power at increased temperature mostly due to an increase in Ohmic resistance. In addition, the Tesla K20c is an actively cooled GPU: as the GPU heats up, the fan on the card spins faster and therefore requires more power.
I have run experiments on a K20c that were very similar to yours, out to about 10 minutes. I found that the power draw plateaus after 5 to 6 minutes, and that there are only noise-level oscillations of +/-2 W after that. These may be due to hysteresis in the fan's temperature-controlled feedback loop, or due to short-term fluctuations from incomplete utilization of the GPU at the end of every kernel. Difference in power draw due to fan speed difference were about 5 W. The reason it takes fairly long for the GPU to reach steady state is the heat capacity of the entire assembly, which has quite a bit of mass, including a solid metal back plate.
Your measurements seem to be directed at determining the relative power consumption when running with 400 different variants of the code. It does not seem critical that steady-state power consumption is achieved, just that the conditions under which each variant is tested are as equal as is practically achievable. Keep in mind that the GPU's power sensors are not designed to provide high-precision measurements, so for comparison purposes you would want to assume a noise level on the order of 5%. For an accurate comparison you may even want to average measurements from more than one GPU of the same type, as manufacturing tolerances could cause variations in power draw between multiple "identical" GPUs.
I would therefore suggest the following protocol: Run each variant for 30 seconds, measuring power consumption close to the end of that interval. Then let the GPU idle for 30 seconds to let it cool down before running the next kernel. This should give roughly equal starting conditions for each variant. You may need to lengthen the proposed idle time a bit if you find that the temperature stays elevated for a longer time. The temperature data reported by nvidia-smi can guide you here. With this process you should be able to complete the testing of 400 variants in an overnight run.

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

Estimating increase in speed when changing NVIDIA GPU model

I am currently developing a CUDA application that will most certainly be deployed on a GPU much better than mine. Given another GPU model, how can I estimate how much faster my algorithm will run on it?
You're going to have a difficult time, for a number of reasons:
Clock rate and memory speed only have a weak relationship to code speed, because there is a lot more going on under the hood (e.g., thread context switching) that gets improved/changed for almost all new hardware.
Caches have been added to new hardware (e.g., Fermi) and unless you model cache hit/miss rates, you'll have a tough time predicting how this will affect the speed.
Floating point performance in general is very dependent on model (e.g.: Tesla C2050 has better performance than the "top of the line" GTX-480).
Register usage per device can change for different devices, and this can also affect performance; occupancy will be affected in many cases.
Performance can be improved by targeting specific hardware, so even if your algorithm is perfect for your GPU, it could be better if you optimize it for the new hardware.
Now, that said, you can probably make some predictions if you run your app through one of the profilers (such as the NVIDIA Compute Profiler), and you look at your occupancy and your SM utilization. If your GPU has 2 SMs and the one you will eventually run on has 16 SMs, then you will almost certainly see an improvement, but not specifically because of that.
So, unfortunately, it isn't easy to make the type of predictions you want. If you're writing something open source, you could post the code and ask others to test it with newer hardware, but that isn't always an option.
This can be very hard to predict for certain hardware changes and trivial for others. Highlight the differences between the two cards you're considering.
For example, the change could be as trivial as -- if I had purchased one of those EVGA water-cooled behemoths, how much better would it perform over a standard GTX 580? This is just an exercise in computing the differences in the limiting clock speed (memory or gpu clock). I've also encountered this question when wondering if I should overclock my card.
If you're going to a similar architecture, GTX 580 to Tesla C2070, you can make a similar case of differences in clock speeds, but you have to be careful of the single/double precision issue.
If you're doing something much more drastic, say going from a mobile card -- GTX 240M -- to a top of the line card -- Tesla C2070 -- then you may not get any performance improvement at all.
Note: Chris is very correct in his answer, but I wanted to stress this caution because I envision this common work path:
One says to the boss:
So I've heard about this CUDA thing... I think it could make function X much more efficient.
Boss says you can have 0.05% of work time to test out CUDA -- hey we already have this mobile card, use that.
One year later... So CUDA could get us a three fold speedup. Could I buy a better card to test it out? (A GTX 580 only costs $400 -- less than that intern fiasco...)
You spend the $$, buy the card, and your CUDA code runs slower.
Your boss is now upset. You've wasted time and money.
So what happened? Developing on an old card, think 8800, 9800, or even the mobile GTX 2XX with like 30 cores, leads one to optimize and design your algorithm in a very different way from how you would to efficiently utilize a card with 512 cores. Caveat Emptor You get what you pay for -- those awesome cards are awesome -- but your code may not run faster.
Warning issued, what's the walk away message? When you get that nicer card, be sure to invest time in tuning, testing, and possibly redesigning your algorithm from the ground up.
OK, so that said, rule of thumb? GPUs get twice as fast every six months. So if you're moving from a card that's two years old to a card that's top of the line, claim to your boss that it will run between 4 to 8 times faster (and if you get the full 16-fold improvement, bravo!!)