Cuda dynamic parallelism: depth of children threads one can create - cuda

I am reading the CUDA programming guide which I find dense. I came to the section where they explain that a parent grid can create a child grid, and the parent grid is considered completed only when all its spawned child threads have completed.
My question is: how "deep" is the parent-child tree allowed to grow in Cuda: are these only constrained by the compute capabilities of the hardware in question, e.g one can for example spawn as many parent/children blocks of threads as he/she wants, provided we don't exceed the max number of threads that can run on the hardware at once, or are there further constraints? I am asking this because absent this capability I don't see how recursion can be implemented on GPUs.
thanks,
Amine

My question is: how "deep" is the parent-child tree allowed to grow in Cuda
The documentation indicates a maximum nesting depth of 24.
As indicated in the documentation, there typically will be other limits that you may hit first, before actually reaching a nesting depth of 24. One of these would be general limits on device kernel launches, including memory requirements as well as launch pending limits. Another possible limit is the synchronization limit. This has to do with whether a parent kernel is explicitly waiting on a child kernel to complete (e.g. via device-side cudaDeviceSynchronize(), and to what depth this synchronization is extended.
provided we don't exceed the max number of threads that can run on the hardware at once
None of this depends explicitly on how many threads are in the parent kernel, or child kernel(s). CUDA kernels don't have any basic limitation on the number of threads the hardware can run at once, and neither does CUDA Dynamic Parallelism (CDP).
As a practical matter then, large depth CDP launches may run into a variety of limits. Further, such design patterns may not be the best from a performance perspective. A CDP launch has time and resource overheads associated with it, and for any pattern that would subdivide the work this way, it's generally desirable in a CUDA kernel to do more work in the kernel, not less.

Related

How about the register resource situation when all threads quit(return) except one?

I'm writing a CUDA program with the dynamic parallelism mechanism, just like this:
{
if(tid!=0) return;
else{
anotherKernel<<<gridDim,blockDim>>>();
}
I know the parent kernel will not quit until the child kernel function finishes its work.is that mean other threads' register resource in the parent kernel(except tid==0) will not be retrieved? anyone can help me?
When and how a terminated thread's resources (e.g. register use) are returned to the machine for use by other blocks is unspecified, and empirically seems to vary by GPU architecture. The reasonable candidates here are that resources are returned at completion of the block, or at completion of the warp.
But that uncertainty need not go beyond the block level. A block that is fully retired returns its resources to the SM that it was resident on for future scheduling purposes. It does not wait for the completion of the kernel. This characteristic is self-evident(*) as being a necessity for the proper operation of a CUDA GPU.
Therefore for the example you have given, we can be sure that all threadblocks except the first threadblock will release their resources, at the point of the return statement. I cannot make specific claims about when exactly warps in the first threadblock may release their resources (except that when thread 0 terminates, resources will be released at that point, if not before).
(*) If it were not the case, a GPU would not be able to process a kernel with more than a relatively small number of blocks (e.g. for the latest GPUs, on the order of several thousand blocks.) Yet it is easy to demonstrate that even the smallest GPUs can process kernels with millions of blocks.

Can kernel change its block size?

The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Strong scaling on GPUs

I'd like to investigate the strong scaling of my parallel GPU code (written with OpenACC). The concept of strong scaling with GPUs is - at least as far as I know - more murky than with CPUs. The only resource I found regarding strong scaling on GPUs suggests fixing the problem size and increasing the number of GPUs. However, I believe there is some amount of strong scaling within GPUs, for example scaling over streaming multiprocessors (in the Nvidia Kepler architecture).
The intent of OpenACC and CUDA is to explicitly abstract away the hardware to the parallel programmer, constraining her to their three-level programming model with gangs (thread blocks), workers (warps) and vectors (SIMT group of threads). It is my understanding that the CUDA model aims at offering scalability with respect to its thread blocks, which are independent and are mapped to SMXs. I therefore see two ways to investigate strong scaling with the GPU:
Fix the problem size, and set the thread block size and number of threads per block to an arbitrary constant number. Scale the number of thread blocks (grid size).
Given additional knowledge on the underlying hardware (e.g. CUDA compute capability, max warps/multiprocessor, max thread blocks/multiprocessor, etc.), set the thread block size and number of threads per block such that a block occupies an entire and single SMX. Therefore, scaling over thread blocks is equivalent to scaling over SMXs.
My questions are: is my train of thought regarding strong scaling on the GPU correct/relevant? If so, is there a way to do #2 above within OpenACC?
GPUs do strong scale, but not necessarily in the way that you're thinking, which is why you've only been able to find information about strong scaling to multiple GPUs. With a multi-core CPU you can trivially decide exactly how many CPU cores you want to run on, so you can fix the work and adjust the degree of threading across the cores. With a GPU the allocation across SMs is handled automatically and is completely out of your control. This is by design, because it means that a well-written GPU code will strong scale to fill whatever GPU (or GPUs) you throw at it without any programmer or user intervention.
You could run on some small number of OpenACC gangs/CUDA threadblocks and assume that 14 gangs will run on 14 different SMs, but there's a couple of problems with this. First, 1 gang/threadblock will not saturate a single Kepler SMX. No matter how many threads, no matter what the occupancy, you need more blocks per SM in order to fully utilize the hardware. Second, you're not really guaranteed that the hardware will choose to schedule the blocks that way. Finally, even if you find the optimal number of blocks or gangs per SM on the device you have, it won't scale to other devices. The trick with GPUs is to expose as much parallelism as possible so that you can scale from devices with 1 SM up to devices with 100, if they ever exist, or to multiple devices.
If you want to experiment with how varying the number of OpenACC gangs for a fixed amount of work affects performance, you'd do that with either the num_gangs clause, if you're using a parallel region, or the gang clause, if you're using kernels. Since you're trying to force a particular mapping of the loops, you're really better off using parallel, since that's the more prescriptive directive. What you'd want to do is something like the following:
#pragma acc parallel loop gang vector num_gangs(vary this number) vector_length(fix this number)
for(i=0; i<N; i++)
do something
This tells the compiler to vectorize the loop using some provided vector length and then partition the loop across OpenACC gangs. What I'd expect is that as you add gangs you'll see better performance up until some multiple of the number of SMs, at which point performance would become roughly flat (with outliers of course). As I said above, fixing the number of gangs at the point where you see optimal performance is not necessarily the best idea, unless this is the only device you're interested in. Instead, by either letting the compiler decide how to decompose the loop, which allows the compiler to make smart decisions based on the architecture you tell it to build for, or by exposing as many gangs as possible, which gives you additional parallelism that will strong scale to larger GPUs or multiple GPUs, you'd have more portable code.
For occupying a complete SMX I would suggest using shared memory as limiting resource for occupancy. Write a kernel that consumes all 32kB of shared memory and the block will occupy the entire SMX, because the SMX is out of resources for another block. Than you can scale up your blocks from 1 to 13 (for K20c) and the scheduler will (hopefully) schedule each block to a different SMX. Than you can scale up the therads per block first to 192 to get each CUDA core busy and then you can go further to get the warp scheduler happy. GPUs provide performance through latency hiding. So you have to move on from 1 block occupies a SMX to N blocks. You can do that by using less shared memory. And again scaling up your warps to cover latency hiding.
I never touched OpenACC and if you really want full control over your experimental code use CUDA instead of OpenACC. You cannot see inside the OpenACC compiler and what it is doing with the pragmas used in your code.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.