The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.
Related
I'm writing a CUDA program with the dynamic parallelism mechanism, just like this:
{
if(tid!=0) return;
else{
anotherKernel<<<gridDim,blockDim>>>();
}
I know the parent kernel will not quit until the child kernel function finishes its work.is that mean other threads' register resource in the parent kernel(except tid==0) will not be retrieved? anyone can help me?
When and how a terminated thread's resources (e.g. register use) are returned to the machine for use by other blocks is unspecified, and empirically seems to vary by GPU architecture. The reasonable candidates here are that resources are returned at completion of the block, or at completion of the warp.
But that uncertainty need not go beyond the block level. A block that is fully retired returns its resources to the SM that it was resident on for future scheduling purposes. It does not wait for the completion of the kernel. This characteristic is self-evident(*) as being a necessity for the proper operation of a CUDA GPU.
Therefore for the example you have given, we can be sure that all threadblocks except the first threadblock will release their resources, at the point of the return statement. I cannot make specific claims about when exactly warps in the first threadblock may release their resources (except that when thread 0 terminates, resources will be released at that point, if not before).
(*) If it were not the case, a GPU would not be able to process a kernel with more than a relatively small number of blocks (e.g. for the latest GPUs, on the order of several thousand blocks.) Yet it is easy to demonstrate that even the smallest GPUs can process kernels with millions of blocks.
I have two questions regarding __syncwarp() in CUDA:
If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized? If so, what exactly does __syncwarp() do, and why is it necessary?
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?
If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized?
No. There can be warp level execution divergence (usually branching, but can be other things like warp shuffles, voting, and predicated execution), handled by instruction replay or execution masking. Note that in "modern" CUDA, implicit warp synchronous programming is no longer safe, thus warp level synchronization is not just desirable, it is mandatory.
If so, what exactly does __syncwarp() do, and why is it necessary?
Because there can be warp level execution divergence, and this is how synchronization within a divergent warp is achieved.
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?
By ensuring that the split is always performed explicitly using calculated warp boundaries (or a suitable thread mask).
I read that the number of threads in a warp can be 32 or more. why is that? if the number is less than 32 threads, does that mean the resources goes underutilized or we will not be able to tolerate memory latency?
Your question needs clarification - perhaps you are confusing the CUDA "warp" and "block" concepts?
Regarding warps, it's important to remember that warp and their size is a property of the hardware. Warps are a grouping of hardware threads that execute the same instruction (these days) every cycle. In other words, the size width indicates the SIMD-style execution width, something that the programmer can not change. In CUDA you launch blocks of threads which, when mapped to the hardware, get executed in warp-sized bunches. If you start blocks with thread count that is not divisible by the warp size, the hardware will simply execute the last warp with some of the threads "masked out" (i.e. they do have to execute, but without any effect on the state of the GPU/memory).
For more details I recommend reading carefully the hardware and execution-related sections of the CUDA programming guide.
This question is related to: Does Nvidia Cuda warp Scheduler yield?
However, my question is about forcing a thread block to yield by doing some controlled memory operation (which is heavy enough to make the thread block yield). The idea is to allow another ready-state thread block to execute on the now vacant multiprocessor.
The PTX manual v2.3 mentions (section 6.6):
...Much of the delay to memory can be hidden in a number of ways. The first is to have multiple threads of execution so that the hardware can issue a memory operation and then switch to other execution. Another way to hide latency is to issue the load instructions as early as possible, as execution is not blocked until the desired result is used in a subsequent (in time) instruction...
So it sounds like this can be achieved (despite being an ugly hack). Has anyone tried something similar? Perhaps with block_size = warp_size kind of setting?
EDIT: I've raised this question without clearly understanding the difference between resident and non-resident (but assigned to the same SM) thread blocks. So, the question should be about switching between two resident (warp-sized) thread blocks. Apologies!
In the CUDA programming model as it stands today, once a thread block starts running on a multiprocessor, it runs to completion, occupying resources until it completes. There is no way for a thread block to yield its resources other than returning from the global function that it is executing.
Multiprocessors will switch among warps of all resident thread blocks automatically, so thread blocks can "yield" to other resident thread blocks. But a thread block can't yield to a non-resident thread block without exiting -- which means it can't resume.
Starting from compute capability 7 (Volta), you have the __nanosleep() instruction, which will put a thread to sleep for a given nanosecond duration.
Another option (available since compute capability 3.5) is to start a grid with lower priority using the cudaStreamCreateWithPriority() call. This allows you to run one stream at lower priority. Do note that (some) GPU's only have 2 priorities, meaning that you may have to run your main code at high priority in order to be able to dodge the default priority.
Here's a code snippet:
// get the range of stream priorities for this device
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
// create streams with highest and lowest available priorities
cudaStream_t st_high, st_low;
cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);
As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds?
Thanks
First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.
So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.
The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.
In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.