Strong scaling on GPUs - cuda

I'd like to investigate the strong scaling of my parallel GPU code (written with OpenACC). The concept of strong scaling with GPUs is - at least as far as I know - more murky than with CPUs. The only resource I found regarding strong scaling on GPUs suggests fixing the problem size and increasing the number of GPUs. However, I believe there is some amount of strong scaling within GPUs, for example scaling over streaming multiprocessors (in the Nvidia Kepler architecture).
The intent of OpenACC and CUDA is to explicitly abstract away the hardware to the parallel programmer, constraining her to their three-level programming model with gangs (thread blocks), workers (warps) and vectors (SIMT group of threads). It is my understanding that the CUDA model aims at offering scalability with respect to its thread blocks, which are independent and are mapped to SMXs. I therefore see two ways to investigate strong scaling with the GPU:
Fix the problem size, and set the thread block size and number of threads per block to an arbitrary constant number. Scale the number of thread blocks (grid size).
Given additional knowledge on the underlying hardware (e.g. CUDA compute capability, max warps/multiprocessor, max thread blocks/multiprocessor, etc.), set the thread block size and number of threads per block such that a block occupies an entire and single SMX. Therefore, scaling over thread blocks is equivalent to scaling over SMXs.
My questions are: is my train of thought regarding strong scaling on the GPU correct/relevant? If so, is there a way to do #2 above within OpenACC?

GPUs do strong scale, but not necessarily in the way that you're thinking, which is why you've only been able to find information about strong scaling to multiple GPUs. With a multi-core CPU you can trivially decide exactly how many CPU cores you want to run on, so you can fix the work and adjust the degree of threading across the cores. With a GPU the allocation across SMs is handled automatically and is completely out of your control. This is by design, because it means that a well-written GPU code will strong scale to fill whatever GPU (or GPUs) you throw at it without any programmer or user intervention.
You could run on some small number of OpenACC gangs/CUDA threadblocks and assume that 14 gangs will run on 14 different SMs, but there's a couple of problems with this. First, 1 gang/threadblock will not saturate a single Kepler SMX. No matter how many threads, no matter what the occupancy, you need more blocks per SM in order to fully utilize the hardware. Second, you're not really guaranteed that the hardware will choose to schedule the blocks that way. Finally, even if you find the optimal number of blocks or gangs per SM on the device you have, it won't scale to other devices. The trick with GPUs is to expose as much parallelism as possible so that you can scale from devices with 1 SM up to devices with 100, if they ever exist, or to multiple devices.
If you want to experiment with how varying the number of OpenACC gangs for a fixed amount of work affects performance, you'd do that with either the num_gangs clause, if you're using a parallel region, or the gang clause, if you're using kernels. Since you're trying to force a particular mapping of the loops, you're really better off using parallel, since that's the more prescriptive directive. What you'd want to do is something like the following:
#pragma acc parallel loop gang vector num_gangs(vary this number) vector_length(fix this number)
for(i=0; i<N; i++)
do something
This tells the compiler to vectorize the loop using some provided vector length and then partition the loop across OpenACC gangs. What I'd expect is that as you add gangs you'll see better performance up until some multiple of the number of SMs, at which point performance would become roughly flat (with outliers of course). As I said above, fixing the number of gangs at the point where you see optimal performance is not necessarily the best idea, unless this is the only device you're interested in. Instead, by either letting the compiler decide how to decompose the loop, which allows the compiler to make smart decisions based on the architecture you tell it to build for, or by exposing as many gangs as possible, which gives you additional parallelism that will strong scale to larger GPUs or multiple GPUs, you'd have more portable code.

For occupying a complete SMX I would suggest using shared memory as limiting resource for occupancy. Write a kernel that consumes all 32kB of shared memory and the block will occupy the entire SMX, because the SMX is out of resources for another block. Than you can scale up your blocks from 1 to 13 (for K20c) and the scheduler will (hopefully) schedule each block to a different SMX. Than you can scale up the therads per block first to 192 to get each CUDA core busy and then you can go further to get the warp scheduler happy. GPUs provide performance through latency hiding. So you have to move on from 1 block occupies a SMX to N blocks. You can do that by using less shared memory. And again scaling up your warps to cover latency hiding.
I never touched OpenACC and if you really want full control over your experimental code use CUDA instead of OpenACC. You cannot see inside the OpenACC compiler and what it is doing with the pragmas used in your code.

Related

Cuda dynamic parallelism: depth of children threads one can create

I am reading the CUDA programming guide which I find dense. I came to the section where they explain that a parent grid can create a child grid, and the parent grid is considered completed only when all its spawned child threads have completed.
My question is: how "deep" is the parent-child tree allowed to grow in Cuda: are these only constrained by the compute capabilities of the hardware in question, e.g one can for example spawn as many parent/children blocks of threads as he/she wants, provided we don't exceed the max number of threads that can run on the hardware at once, or are there further constraints? I am asking this because absent this capability I don't see how recursion can be implemented on GPUs.
thanks,
Amine
My question is: how "deep" is the parent-child tree allowed to grow in Cuda
The documentation indicates a maximum nesting depth of 24.
As indicated in the documentation, there typically will be other limits that you may hit first, before actually reaching a nesting depth of 24. One of these would be general limits on device kernel launches, including memory requirements as well as launch pending limits. Another possible limit is the synchronization limit. This has to do with whether a parent kernel is explicitly waiting on a child kernel to complete (e.g. via device-side cudaDeviceSynchronize(), and to what depth this synchronization is extended.
provided we don't exceed the max number of threads that can run on the hardware at once
None of this depends explicitly on how many threads are in the parent kernel, or child kernel(s). CUDA kernels don't have any basic limitation on the number of threads the hardware can run at once, and neither does CUDA Dynamic Parallelism (CDP).
As a practical matter then, large depth CDP launches may run into a variety of limits. Further, such design patterns may not be the best from a performance perspective. A CDP launch has time and resource overheads associated with it, and for any pattern that would subdivide the work this way, it's generally desirable in a CUDA kernel to do more work in the kernel, not less.

How does the speed of CUDA program scale with the number of blocks?

I am working on Tesla C1060, which contains 240 processor cores with compute capability 1.3. Knowing that each 8 cores are controlled by a single multi-processor, and that each block of threads is assigned to a single multi-processor, then I would expect that launching a grid of 30 blocks, should take the same execution time as one single block. However, things don't scale that nicely, and I never got this nice scaling even with 8 threads per block. Going to the other extreme with 512 threads per block, I get approximately the same time of one block, when the grid contains a maximum of 5 blocks. This was disappointing when I compared the performance with implementing the same task parallelized with MPI on an 8-core CPU machine.
Can some one explain that to me?
By the way, the computer actually contains two of this Tesla card, so does it distribute blocks between them automatically, or do I have to take further steps to ensure that both are fully exploited?
EDIT:
Regarding my last question, if I launch two independent MPI processes on the same computer, how can I make each work on a different graphics card?
EDIT2: Based on the request of Pedro, here is a plot depicting the total time on the vertical access, normalized to 1 , versus the number of parallel blocks. The number of threads/block = 512. The numbers are rough, since I observed quite large variance of the times for large numbers of blocks.
The speed is not a simple linear relation with the number of blocks. It depends on bunch of stuffs. For example, the memory usage, the number of instruction excuted in a block, etc.
If you want to do multi-GPU computing, you need to modify your code, otherwise you can only use one GPU card.
It seems to me that you have simply taken a C program and compiled it in CUDA without much tought.
Dear friend, this is not the way to go. You have to design your code to take advantage of the fact that CUDA cards have a different internal architecture than regular CPUs. In particular, take the following into account:
memory access pattern - there is a number of memory systems in a GPU and each requires consideration on how to use it best
thread divergence problems - performance will only be good if most of your threads follow the same code path most of the time
If your system has 2 GPUs, you can use both to accelerate some(suitable) problems. The thing is that the memory area of the two are split and not easily 'visible' by each other - you have to design your algorithm to take this into account.
A typical C program written in pre-GPU era will often not be easily transplantable unless originally written with MPI in mind.
To make each CPU MPI thread work with a different GPU card you can use cudaSetDevice()

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Maximum (shared memory per block) / (threads per block) in CUDA with 100% MP load

I'm trying to process array of big structures with CUDA 2.0 (NVIDIA 590). I'd like to use shared memory for it. I've experimented with CUDA occupancy calculator, trying to allocate maximum shared memory per thread, so that each thread can process whole element of array.
However maximum of (shared memory per block) / (threads per block) I can see in calculator with 100% Multiprocessor load is 32 bytes, which is not enough for single element (on the order of magnitude).
Is 32 bytes a maximum possible value for (shared memory per block) / (threads per block)?
Is it possible to say which alter4native is preferable - allocate part of array in global memory or just use underloaded multiprocessor? Or it can only be decided by experiment?
Yet another alternative I can see is to process array in several passes, but it looks like a last resort.
That is first time I'm trying something really complex with CUDA, so I could be missing some other options...
There are many hardware limitations you need to keep in mind when designing a CUDA kernel. Here are some of the constraints you need to consider:
maximum number of threads you can run in a single block
maximum number of blocks you can load on a streaming multiprocessor at once
maximum number of registers per streaming multiprocessor
maximum amount of shared memory per streaming multiprocessor
Whichever of these limits you hit first becomes a constraint that limits your occupancy (is maximum occupancy what you are referring to by "100% Multiprocessor load"?). Once you reach a certain threshold of occupancy, it becomes less important to pay attention to occupancy. For example, occupancy of 33% does not mean that you are only able to achieve 33% of the maximum theoretical performance of the GPU. Vasily Volkov gave a great talk at the 2010 GPU Technology Conference which recommends not worrying too much about occupancy, and instead trying to minimize memory transactions by using some explicit caching tricks (and other stuff) in the kernel. You can watch the talk here: http://www.gputechconf.com/gtcnew/on-demand-GTC.php?sessionTopic=25&searchByKeyword=occupancy&submit=&select=+&sessionEvent=&sessionYear=&sessionFormat=#193
The only real way to be sure that you are using a kernel design that gives best performance is to test all the possibilities. And you need to redo this performance testing for each type of device you run it on, because they all have different constraints in some way. This can obviously be tedious, especially when the different design patterns result in fundamentally different kernels. I get around this to some extent by using a templating engine to dynamically generate kernels at runtime according to the device hardware specifications, but it's still a bit of a hassle.

Code to limit GPU usage

Is there a command/function/variable that can be set in CUDA code that limits the GPU usage percent? I'd like to modify an open-source project called Flam4CUDA so that that option exists. They way it is now, it uses as much of all GPUs present as possible, with the effect being that the temperatures skyrocket (obviously). In an effort to keep temps down over long periods of computing, I'd like to be able to tell the program to use, say, 50% of each GPU (or even have different percentages for different GPUs, or maybe also be able to select which GPU(s) to use). Any ideas?
If you want to see the code, it's available with "svn co https://flam4.svn.sourceforge.net/svnroot/flam4 flam4".
There is no easy way to do what you are asking to do. CPU usage is controlled via time-slicing of context switches, while GPUs do not have such fine-grained context switching. GPUs are cooperatively multitasked. This is why the nvidia-smi tool for workstation- and server-class boards has "exclusive" and "prohibited" modes to control the number of GPU contexts that are allowed on a given board.
Messing with the number of threads/block or blocks in a grid, as has been suggested, will break applications that are passing metadata to the kernel (not easily inferred by your software) that depends on the expected block and grid size.
You can look at the gpu properties using CUDA and find the number of multiprocessors and the number of cores per multiprocessor. What you basically need to do is change the block size and grid size of the kernel functions so that you use half of the total number of cores.