As the title says, I wonder whether it is possible to launch a sort of __syncthreads() function, where the barrier is not at block level but at sub-block level, so that I can sync all threads having a particular threadIdx.x?
For instance, if I define a kernel launch as <<<1, (32, 32)>>>, is it possible to define something like __syncthreads(5) so that it syncs all threads having threadIdx.x == 5?
Following the documentation, it seems that CUDA does not define such a function; however, I wonder whether there exists some trick that can achieve the same result.
Generally, no this is not possible in CUDA. There are no provided methods to do this.
CUDA does provide __syncwarp() which allows synchronization of a warp (32 threads).
The CUDA cooperative groups mechanism does allow for synchronization "only" of subgroups of threads. But you do not have an arbitrary method to assign threads to groups.
At the PTX level, there is more flexibility in the use of barriers. But you don't have the ability to assign an arbitrary set of threads to a barrier. (Instead, for example, arriving threads may simply be "counted").
My suggestion would be to use one of the above methods. For example, if you wanted to assign all threads with threadIdx.x == 5 in a (32,32) threadblock, that is 32 threads the same as a warp. Reassign your thread assignment pattern so that those 32 threads belong to the same warp, and use __syncwarp().
Related
I have two questions regarding __syncwarp() in CUDA:
If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized? If so, what exactly does __syncwarp() do, and why is it necessary?
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?
If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized?
No. There can be warp level execution divergence (usually branching, but can be other things like warp shuffles, voting, and predicated execution), handled by instruction replay or execution masking. Note that in "modern" CUDA, implicit warp synchronous programming is no longer safe, thus warp level synchronization is not just desirable, it is mandatory.
If so, what exactly does __syncwarp() do, and why is it necessary?
Because there can be warp level execution divergence, and this is how synchronization within a divergent warp is achieved.
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?
By ensuring that the split is always performed explicitly using calculated warp boundaries (or a suitable thread mask).
The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.
indirectJ2[MAX_SUPER_SIZE] is a shared array.
My cuda device kernel contains following statement (executed by all threads in the thread block):
int nnz_col = indirectJ2[MAX_SUPER_SIZE - 1];
I suspect this would cause bank conflicts.
Is there any way I can implement the above Thread-block level broadcast efficiently using new shuffle instructions for the kepler GPUs? I understand how it works at warp level. Other solutions, which are beyond shuffle instruction (for instance use of CUB etc.), are also welcome.
There is no bank conflict for that line of code on K40. Shared memory accesses already offer a broadcast mechanism. Quoting from the programming guide
"A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32-bit word or within two 32-bit words whose indices i and j are in the same 64-word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 32-bit words are broadcast to the requesting threads "
There is no such concept as shared memory bank conflicts at the threadblock level. Bank conflicts only pertain to the access pattern generated by the shared memory request emanating from a single warp, for a single instruction in that warp.
If you like, you can write a simple test kernel and use profiler metrics (e.g. shared_replay_overhead) to test for shared memory bank conflicts.
Warp shuffle mechanisms do not extend beyond a single warp; therefore there is no short shuffle-only sequence that can broadcast a single quantity to multiple warps in a threadblock. Shared memory can be used to provide direct access to a single quantity to all threads in a warp; you are already doing that.
global memory, __constant__ memory, and kernel parameters can also all be used to "broadcast" the same value to all threads in a threadblock.
There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?
It allows recursion and lunching kernel from device code, does it mean that control path in different thread can be executed simultaneously?
Take a look to this paper
Dynamic parallelism, flow divergence and recursion are separated concepts. Dynamic parallelism is the ability to launch threads within a thread. This mean for example you may do this
__global__ void t_father(...) {
...
t_child<<< BLOCKS, THREADS>>>();
...
}
I personally investigated in this area, when you do something like this, when t_father launches the t_child, the whole vga resources are distributed again among those and t_father waits until all the t_child have finished before it can go on (look also this paper Slide 25)
Recursion is available since Fermi and is the ability for a thread to call itself without any other thread/block re-configuration
Regarding the flow divergence, I guess we will never see thread within a warp executing different code simultaneously..
No. Warp concept still exists. All the threads in a warp are SIMD (Single Instruction Multiple Data) that means at the same time, they run one instruction. Even when you call a child kernel, GPU designates one or more warps to your call. Have 3 things in your mind when you're using dynamic parallelism:
The deepest you can go is 24 (CC=3.5).
The number of dynamic kernels running at the same time is limited ( default 4096) but can be increased.
Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.
There's a sample cuda source in this NVidia presentation on slide 9.
__global__ void convolution(int x[])
{
for j = 1 to x[blockIdx]
kernel<<< ... >>>(blockIdx, j)
}
It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).
The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.
I have three questions to ask
If I create only one block of threads in cuda and execute the parallel program on it then is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform ? To be more clear, If I use only one block of threads then how many processors will be allocated to it because so far as I know (I might have misunderstood it) one warp is given only single processing element.
can I synchronize the threads of different blocks ? if yes please give some hints to do it.
How to find out warp size ? it is fixed for a particular hardware ?
1 is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform
Simple answer: No.
The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.
As you have seen, CUDA provides __syncthreads() to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.
2 can I synchronize the threads of different blocks ?
Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.
3 How to find out warp size ? it is fixed for a particular hardware ?
Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.
In device code you can just use the special variable warpSize. In host code you can query the warp size for a specific device with:
cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;
result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
...
}
int warpSize = prop.warpSize;
As of cuda 2.3 one processor per thread block. It might be different in cuda 3/Fermi processors, I do not remember
not really but... (depending on your requirements you may find workaround)
read this post CUDA: synchronizing threads
#3. You can query SIMDWidth using cuDeviceGetProperties - see doc
To synchronize threads across multiple blocks (at least as far as memory updates are concerned), you can use the new __threadfence_system() call, which is only available on Fermi devices (Compute Capability 2.0 and better). This function is described in the CUDA Programming guide for CUDA 3.0.
Can I synchronize threads of different block with following approach. Please do tell me if there is any problem in this approch (I think there will be some but since I'm not much experienced in cuda I might have not considered some facts)
__global__ void sync_func(int *glob_var){
int i = 0 ; //local variable to each thread
int total_threads = blockDim.x *threadDim.x
while(*glob_var != total_threads){
if(i == 0){
atomicAdd(int *glob_var, 1);
i = 1;
}
}
execute the code which is to be executed at the same time by all threads;
}