CUDA: performance throttling even though clocks are stable [closed] - cuda

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 17 days ago.
Improve this question
I'm doing some benchmarking of a kernel on an RTX 3070 Ti by running it a few thousand times. I've tried to set stable clocks using
nvidia-smi -lgc 1575
nvidia-smi -lmc 9251
Despite this, I find that performance varies randomly by up to 28%. I've used Nsight Systems to record what happens and sometimes I can see a sharp drop after a few thousand iterations (it's fast and stable until a step transition, after which it is slow and stable). However, I can't see any corresponding dip in clock speeds:
I've tried just watching nvidia-smi -q output (updated every 0.05 seconds) to check for either down-clocking or reports of throttling. Temperature stays below 50°C.
I've run nsys with --gpu-metrics-device=0; it shows the graphics clock stable at 1575 MHz
I've run the same benchmark using Nsight Compute to record details from every 1000th invocation, which shows that the memory clock is also stable.
I don't have a rigorous test, but it feels like it might be thermal in the sense that performance is worse if I repeat the test immediately after loading the GPU, whereas if I give it a minute to cool off then performance is better.
Any idea what sort of throttling this might be, and how to prevent or at least measure it?

Related

GPU Programming, CUDA or OpenCL or? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 9 months ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What is the best way to do programming for GPU?
I know:
CUDA is very good, much developer support and very nice zo debug, but only on NVidia Hardware
OpenCL is very flexible, run on NVidia, AMD and Intel Hardware, run on Accellerators, GPU and CPU but as far as I know not supported anymore by NVidia.
Coriander (https://github.com/hughperkins/coriander) which converts CUDA to OpenCL
HIP https://github.com/ROCm-Developer-Tools/HIP is made by AMD to have a possibility to write in a way to convert to AMD and NVidia CUDA. It also can convert CUDA to HIP.
OpenCL would my prefered way, I want to be very flexible in hardware support. But if not longer supported by NVidia, it is a knockout.
HIP sounds then best to me with different released files. But how will be the support of Intels soon coming hardware?
Are there any other options?
Important is for me many supported hardeware, long term support, so that can be compiled in some years also and manufacture independant.
Additional: Should be able to use more than obe compiler, on Linux and Windows supported.
Nvidia won't cancel OpenCL support anytime soon.
A newly emerging approach for portable code on GPU is SYCL. It enables higher level programming from a single source file that is then compiled twice, once for the CPU and once for GPU. The GPU part then runs on GPU via either OpenCL, CUDA or some other backend.
As of right now however, the best supported GPU framework across plattforms is OpenCL 1.2, which is very well established at this point. With that your code runs on 10 year old GPUs, on the latest and fastest data-center GPUs, on gaming and workstation GPUs and even on CPUs if you need more memory. On Nvidia GPUs there is no performance/efficiency tradeoff at all compared to CUDA; it runs just as fast.
The porting tools like HIP are great if you already have a large code base, but performance could possibly suffer. My advice is to go for either one framework and stay fully committed to it, rather than using some tool to then generate a possibly poorly optimized port.
If you choose to start with OpenCL, have a look at this OpenCL-Wrapper. The native OpenCL C++ bindings are a bit cumbersome to use, and this lightweight wrapper simplifies learning a lot, while keeping functionality and full performance.

Why I have to manually active my GPUs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I installed a new Intel Xeon Phi in a work station which already has 3 Nvidia GPUs installed. To make the Phi card work, I have to load the Intel's MIC kernel module into my Linux kernel. And by doing so the Phi card works fine. However, every time when we reboot the system, we just couldn't use the GPU. The error message is that the system couldn't find the CUDA driver.
However, the only thing I need to do to fix this is to use "SUDO" to run one CUDA binaries or some Nvidia's command just like "sudo nvida-smi". Then everything just works fine, both CUDA and Intel's Xeon phi.
Anybody knows why? Without my sudo command, other people just can not use the GPUs. This is kind of annoying. How can I fix this?
CUDA requires that certain resource files be established for GPU usage, and this is covered in the Linux getting started guide (step 6 under runfile installation -- note the recommended startup script).
You may also be interested in this article, which focuses on the same subject -- how to automatically establish the resource files at startup.
Once these files are established correctly, an ordinary user (non-root) will be able to use the GPUs without any other intervention.
I have no idea why Xeon Phi installation might have affected this in your particular setup.

How to connect NVIDIA CUDA PCI-E graphic card over USB? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I would like to run some CUDA calculations, but I have only simple notebook without NVIDIA.
Is there any USB adapter that allows to connect NVIDIA graphic card to my notebook?
That would be great if there is such a device, that I connect my NVIDIA card, plug it into my computer, run calculation, and disconnect from laptop until calculations are finished.
Unfortunately not.
USB is very-very slow compared to the internal bus to the graphics card in a PC, so the speed of the GPU for calculations would be wasted by the long time to copy the data there and back.
USB is alos message based, it doesn't allow your computer to see the GPU card memory (or the other way around) so you would effectively need another computer on the GPU end to unwrap things.
There is a new high speed connector called Thunderbolt which is (essentially) the PCIe bus inside your computer connected to a socket. This would allow an external device (like a GPU) to act like it was directly connected to the bus. But it's only on a few expensive models today and not many devices exist for it (yet).
Amazon do now offer GPUs on their cloud service, but this might be a bit expensive for just learnign / playing with.

How to shut graphic card output signal but still link it for CUDA? (it is a geforce) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
for using CUDA on PC's graphic card(usually single card), it's known that windows or linux will reset the graphic card if the card lost response for 5 sec or 2 sec(depending on OS version,this mechanism is called Timeout detection recovery,TDR).
msdn says a graphic card giving output signal will be restricted by TDR in case that video signal by graphic card is interrupted.
if windows does that, my CUDA program (takes much longer than 2 or 5 sec running on the graphic card)cannot be completed.
so as to avoid this, i enabled the onboard graphic card(biostar HD 880G Mainboard),attach monitor to onboard graphic card.
the system now recognised both graphic cards(NV gtx 460 and onboard AMD HD4250), but the 2 sec restriction on gtx 460 is still there. i switched my monitor on both graphic card, both cards give output signal.
How can I make the independent graphic card stop giving video signal(or stop OS giving it signal), but still links to system?
http://msdn.microsoft.com/zh-cn/library/windows/hardware/ff569918(v=vs.85).aspx

Open source segmented interrupt architecture RTOS? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
A segmented interrupt architecture RTOS can boast "zero interrupt latency" using clever partitioning of work between the interrupt handler and the scheduler. There are at least a couple proprietary closed source instances of this approach, e.g., AVIX and Quasarsoft- Q-Kernel.
A related SO question asked about open source RTOS links, but all of the suggested operating systems used unified interrupt architectures.
Is there any open source segmented interrupt architecture RTOS?
I believe this is also sometimes referred to as "deferred interrupt" servicing or handling, so it may be worth using that term to find candidates.
It is perhaps possible to 'fake' it by reserving the highest priority task levels for ISR servicing, so say you have 32 interrupt vectors, you would reserve priority levels 0 to 31 (assuming zero is high) for the ISR2 levels. Each real interrupt then simply sets an event flag signalling the ISR2 task. It remains your responsibility in this case not to call blocking functions in the ISR2 tasks, nut non-blocking kernel services can be used freely.
I am not sure whether this gives you exactly the same effect (I'd have to study it more fully than I have - or care to right now), but it does mean that you can do minimal work in the true ISR, and a true ISR will always preempt any ISR2.