Code to limit GPU usage - cuda

Is there a command/function/variable that can be set in CUDA code that limits the GPU usage percent? I'd like to modify an open-source project called Flam4CUDA so that that option exists. They way it is now, it uses as much of all GPUs present as possible, with the effect being that the temperatures skyrocket (obviously). In an effort to keep temps down over long periods of computing, I'd like to be able to tell the program to use, say, 50% of each GPU (or even have different percentages for different GPUs, or maybe also be able to select which GPU(s) to use). Any ideas?
If you want to see the code, it's available with "svn co https://flam4.svn.sourceforge.net/svnroot/flam4 flam4".

There is no easy way to do what you are asking to do. CPU usage is controlled via time-slicing of context switches, while GPUs do not have such fine-grained context switching. GPUs are cooperatively multitasked. This is why the nvidia-smi tool for workstation- and server-class boards has "exclusive" and "prohibited" modes to control the number of GPU contexts that are allowed on a given board.
Messing with the number of threads/block or blocks in a grid, as has been suggested, will break applications that are passing metadata to the kernel (not easily inferred by your software) that depends on the expected block and grid size.

You can look at the gpu properties using CUDA and find the number of multiprocessors and the number of cores per multiprocessor. What you basically need to do is change the block size and grid size of the kernel functions so that you use half of the total number of cores.

Related

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Strong scaling on GPUs

I'd like to investigate the strong scaling of my parallel GPU code (written with OpenACC). The concept of strong scaling with GPUs is - at least as far as I know - more murky than with CPUs. The only resource I found regarding strong scaling on GPUs suggests fixing the problem size and increasing the number of GPUs. However, I believe there is some amount of strong scaling within GPUs, for example scaling over streaming multiprocessors (in the Nvidia Kepler architecture).
The intent of OpenACC and CUDA is to explicitly abstract away the hardware to the parallel programmer, constraining her to their three-level programming model with gangs (thread blocks), workers (warps) and vectors (SIMT group of threads). It is my understanding that the CUDA model aims at offering scalability with respect to its thread blocks, which are independent and are mapped to SMXs. I therefore see two ways to investigate strong scaling with the GPU:
Fix the problem size, and set the thread block size and number of threads per block to an arbitrary constant number. Scale the number of thread blocks (grid size).
Given additional knowledge on the underlying hardware (e.g. CUDA compute capability, max warps/multiprocessor, max thread blocks/multiprocessor, etc.), set the thread block size and number of threads per block such that a block occupies an entire and single SMX. Therefore, scaling over thread blocks is equivalent to scaling over SMXs.
My questions are: is my train of thought regarding strong scaling on the GPU correct/relevant? If so, is there a way to do #2 above within OpenACC?
GPUs do strong scale, but not necessarily in the way that you're thinking, which is why you've only been able to find information about strong scaling to multiple GPUs. With a multi-core CPU you can trivially decide exactly how many CPU cores you want to run on, so you can fix the work and adjust the degree of threading across the cores. With a GPU the allocation across SMs is handled automatically and is completely out of your control. This is by design, because it means that a well-written GPU code will strong scale to fill whatever GPU (or GPUs) you throw at it without any programmer or user intervention.
You could run on some small number of OpenACC gangs/CUDA threadblocks and assume that 14 gangs will run on 14 different SMs, but there's a couple of problems with this. First, 1 gang/threadblock will not saturate a single Kepler SMX. No matter how many threads, no matter what the occupancy, you need more blocks per SM in order to fully utilize the hardware. Second, you're not really guaranteed that the hardware will choose to schedule the blocks that way. Finally, even if you find the optimal number of blocks or gangs per SM on the device you have, it won't scale to other devices. The trick with GPUs is to expose as much parallelism as possible so that you can scale from devices with 1 SM up to devices with 100, if they ever exist, or to multiple devices.
If you want to experiment with how varying the number of OpenACC gangs for a fixed amount of work affects performance, you'd do that with either the num_gangs clause, if you're using a parallel region, or the gang clause, if you're using kernels. Since you're trying to force a particular mapping of the loops, you're really better off using parallel, since that's the more prescriptive directive. What you'd want to do is something like the following:
#pragma acc parallel loop gang vector num_gangs(vary this number) vector_length(fix this number)
for(i=0; i<N; i++)
do something
This tells the compiler to vectorize the loop using some provided vector length and then partition the loop across OpenACC gangs. What I'd expect is that as you add gangs you'll see better performance up until some multiple of the number of SMs, at which point performance would become roughly flat (with outliers of course). As I said above, fixing the number of gangs at the point where you see optimal performance is not necessarily the best idea, unless this is the only device you're interested in. Instead, by either letting the compiler decide how to decompose the loop, which allows the compiler to make smart decisions based on the architecture you tell it to build for, or by exposing as many gangs as possible, which gives you additional parallelism that will strong scale to larger GPUs or multiple GPUs, you'd have more portable code.
For occupying a complete SMX I would suggest using shared memory as limiting resource for occupancy. Write a kernel that consumes all 32kB of shared memory and the block will occupy the entire SMX, because the SMX is out of resources for another block. Than you can scale up your blocks from 1 to 13 (for K20c) and the scheduler will (hopefully) schedule each block to a different SMX. Than you can scale up the therads per block first to 192 to get each CUDA core busy and then you can go further to get the warp scheduler happy. GPUs provide performance through latency hiding. So you have to move on from 1 block occupies a SMX to N blocks. You can do that by using less shared memory. And again scaling up your warps to cover latency hiding.
I never touched OpenACC and if you really want full control over your experimental code use CUDA instead of OpenACC. You cannot see inside the OpenACC compiler and what it is doing with the pragmas used in your code.

How does the speed of CUDA program scale with the number of blocks?

I am working on Tesla C1060, which contains 240 processor cores with compute capability 1.3. Knowing that each 8 cores are controlled by a single multi-processor, and that each block of threads is assigned to a single multi-processor, then I would expect that launching a grid of 30 blocks, should take the same execution time as one single block. However, things don't scale that nicely, and I never got this nice scaling even with 8 threads per block. Going to the other extreme with 512 threads per block, I get approximately the same time of one block, when the grid contains a maximum of 5 blocks. This was disappointing when I compared the performance with implementing the same task parallelized with MPI on an 8-core CPU machine.
Can some one explain that to me?
By the way, the computer actually contains two of this Tesla card, so does it distribute blocks between them automatically, or do I have to take further steps to ensure that both are fully exploited?
EDIT:
Regarding my last question, if I launch two independent MPI processes on the same computer, how can I make each work on a different graphics card?
EDIT2: Based on the request of Pedro, here is a plot depicting the total time on the vertical access, normalized to 1 , versus the number of parallel blocks. The number of threads/block = 512. The numbers are rough, since I observed quite large variance of the times for large numbers of blocks.
The speed is not a simple linear relation with the number of blocks. It depends on bunch of stuffs. For example, the memory usage, the number of instruction excuted in a block, etc.
If you want to do multi-GPU computing, you need to modify your code, otherwise you can only use one GPU card.
It seems to me that you have simply taken a C program and compiled it in CUDA without much tought.
Dear friend, this is not the way to go. You have to design your code to take advantage of the fact that CUDA cards have a different internal architecture than regular CPUs. In particular, take the following into account:
memory access pattern - there is a number of memory systems in a GPU and each requires consideration on how to use it best
thread divergence problems - performance will only be good if most of your threads follow the same code path most of the time
If your system has 2 GPUs, you can use both to accelerate some(suitable) problems. The thing is that the memory area of the two are split and not easily 'visible' by each other - you have to design your algorithm to take this into account.
A typical C program written in pre-GPU era will often not be easily transplantable unless originally written with MPI in mind.
To make each CPU MPI thread work with a different GPU card you can use cudaSetDevice()

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Maximum (shared memory per block) / (threads per block) in CUDA with 100% MP load

I'm trying to process array of big structures with CUDA 2.0 (NVIDIA 590). I'd like to use shared memory for it. I've experimented with CUDA occupancy calculator, trying to allocate maximum shared memory per thread, so that each thread can process whole element of array.
However maximum of (shared memory per block) / (threads per block) I can see in calculator with 100% Multiprocessor load is 32 bytes, which is not enough for single element (on the order of magnitude).
Is 32 bytes a maximum possible value for (shared memory per block) / (threads per block)?
Is it possible to say which alter4native is preferable - allocate part of array in global memory or just use underloaded multiprocessor? Or it can only be decided by experiment?
Yet another alternative I can see is to process array in several passes, but it looks like a last resort.
That is first time I'm trying something really complex with CUDA, so I could be missing some other options...
There are many hardware limitations you need to keep in mind when designing a CUDA kernel. Here are some of the constraints you need to consider:
maximum number of threads you can run in a single block
maximum number of blocks you can load on a streaming multiprocessor at once
maximum number of registers per streaming multiprocessor
maximum amount of shared memory per streaming multiprocessor
Whichever of these limits you hit first becomes a constraint that limits your occupancy (is maximum occupancy what you are referring to by "100% Multiprocessor load"?). Once you reach a certain threshold of occupancy, it becomes less important to pay attention to occupancy. For example, occupancy of 33% does not mean that you are only able to achieve 33% of the maximum theoretical performance of the GPU. Vasily Volkov gave a great talk at the 2010 GPU Technology Conference which recommends not worrying too much about occupancy, and instead trying to minimize memory transactions by using some explicit caching tricks (and other stuff) in the kernel. You can watch the talk here: http://www.gputechconf.com/gtcnew/on-demand-GTC.php?sessionTopic=25&searchByKeyword=occupancy&submit=&select=+&sessionEvent=&sessionYear=&sessionFormat=#193
The only real way to be sure that you are using a kernel design that gives best performance is to test all the possibilities. And you need to redo this performance testing for each type of device you run it on, because they all have different constraints in some way. This can obviously be tedious, especially when the different design patterns result in fundamentally different kernels. I get around this to some extent by using a templating engine to dynamically generate kernels at runtime according to the device hardware specifications, but it's still a bit of a hassle.