Do warp vote functions synchronize threads in the warp? - cuda

Do CUDA warp vote functions, such as __any() and __all(), synchronize threads in the warp?
In other words, is there any guarantee that all threads inside the warp execute instructions preceding warp vote function, especially the instruction(s) that manipulate the predicate?

The synchronization is implicit, since threads within a warp execute in lockstep. [*]
Code that relies on this behavior is known as "warp synchronous."
[*] If you are thinking that conditional code will cause threads within a warp to follow different execution paths, you have more to learn about how CUDA hardware works. Divergent conditional code (i.e. conditional code where the condition is true for some threads but not for others) causes certain threads within the warp to be disabled (either by predication or the branch synchronization stack), but each thread still occupies one of the 32 lanes available in the warp.

They don't. You can use warp vote functions within code branches. If they would synchronize in such a case there would be a possible deadlock. From the PTX ISA:
vote
Vote across thread group.
Syntax
vote.mode.pred d, {!}a;
vote.ballot.b32 d, {!}a; // 'ballot' form, returns bitmask
.mode = { .all, .any, .uni };
Description
Performs a reduction of the source predicate across threads in a warp. The destination > predicate value is the same across all threads in the warp.
The reduction modes are:
.all
True if source predicate is True for all active threads in warp. Negate the source predicate to compute .none.
.any
True if source predicate is True for some active thread in warp. Negate the source predicate to compute .not_all.
.uni
True if source predicate has the same value in all active threads in warp. Negating the source predicate also computes .uni.
In the ballot form, vote.ballot.b32 simply copies the predicate from each thread in a warp into the corresponding bit position of destination register d, where the bit position corresponds to the thread's lane id.
EDIT:
Since threads within a warp are implicit synchronized you don't have to manually ensure that the threads are properly synchronized when the vote takes place. Note that for __all only active threads participate within the vote. Active threads are threads that execute instructions where the condition is true. This explains why a vote can occur within code branches.

Related

'ballot' behavior on inactive lanes

Warp voting functions can be invoked within a diverging branch and its effects are considered only among active threads. However, I am unsure how ballot works in that case? Are inactive threads always contributing 0? Or maybe the result is undefined?
Similar question: Do warp vote functions synchronize threads in the warp?
One answer quotes PTX ISA, which contains a sentence
In the ballot form, vote.ballot.b32 simply copies the predicate from
each thread in a warp into the corresponding bit position of
destination register d, where the bit position corresponds to the
thread's lane id.
but it does not explain how inactive threads are treated.
From the documentation:
For each of these warp vote operations, the result excludes threads that are inactive (e.g., due to warp divergence). Inactive threads are represented by 0 bits in the value returned by __ballot() and are not considered in the reductions performed by __all() and __any().

Is there a way to avoid redundant computations by warps in a thread block on partially overlapped arrays

I want to design a CUDA kernel that has thread blocks where warps read their own 1-D arrays. Suppose that a thread block with two warps takes two arrays {1,2,3,4} and {2,4,6,8}. Then each of warps would perform some computations by reading their own arrays. The computation is done based on per-array element basis. This means that the thread block would have redundant computations for the elements 2 and 4 in the arrays.
Here is my question: How I can avoid such redundant computations?
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block may be considered. But I worry about performance degradation due to hash table accesses whenever a warp access elements of an array.
Any idea or comments?
In parallel computation on many-core co-processors, it is desired to perform arithmetic operations on an independent set of data, i.e. to eliminate any sort of dependency on the set of vectors which you provide to threads/warps. In this way, the computations can run in parallel. If you want to keep track of elements that you have been previously computed (in this case, 2 and 4 which are common in the two input arrays), you have to serialize and create branches which in turn diminishes the computing performance.
In conclusion, you need to check if it is possible to eliminate redundancy at the input level by reducing the input vectors to those with different components. If not, skipping redundant computations of repeated components may not necessarily improve the performance, since the computations are performed in batch.
Let us try to understand what happens on the hardware level.
First of all, computation in CUDA happens via warps. A warp is a group of 32 threads which are synchronized to each other. An instruction is executed on all the threads of a corresponding warp at the same time instance. So technically it's not threads but the warps that finally executes on the hardware.
Now, let us suppose that somehow you are able to keep track of which elements does not need computation so you put a condition in the kernel like
...
if(computationNeeded){
compute();
}
else{
...
}
...
Now let us assume that there are 5 threads in a particular warp for which "computationNeeded" is false and the don't need computation. But according to our definition, all threads executes the same instruction. So even if you put these conditions all threads have to execute both if and else blocks.
Exception:
It will be faster if all the threads of a particular block execute either if or else condition. But its exceptional case for almost all the real-world algorithms.
Suggestion
Put a pre-processing step either on CPU or GPU, which eliminate the redundant elements from input data. Also, check if this step is worth.
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block
may be considered
If you want inter-warp communication between ALL the warps of the grid then it is only possible via global memory not shared memory.

How is a warp formed and handled by the hardware warp scheduler?

My questions are about warps and scheduling. I'm using NVIDIA Fermi terminology here. My observations are below, are they correct?
A. Threads in the same warp execute the same instruction. Each warp includes 32 threads.
According to the Fermi Whitepaper:
"Fermi’s dual warp scheduler selects two warps, and issues one
instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. "
From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together. Each scheduler issues half of a warp to 16 cores in a cycle, and in all, two schedulers issue two warp-halves into two 16-core scheduling groups in a cycle. In another words, one warp needs to be scheduled twice, half by half, in this Fermi architecture. If a warp contains only SFU operations, then this warp needs to be issued 8 times(32/4), since there's only 4 SFPUs in an SM.
B. When a large amount of threads (say 1-D array, 320 threads) is launched, consecutive threads will be grouped into 10 warps automatically, each has 32 threads. Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.
Questions:
Q1. Which part handles the threads grouping (into warps)? software or hardware? if hardware, is it the warp scheduler? and how the hardware warp scheduler is implemented and work?
Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?
Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?(this is a hardware question) Since in this case(Fermi), a warp will be scheduled twice (in two cycles) anyway. If a warp is 16 wide, simply two warps will be scheduled (also in two cycles), which seems the same with the previous case.I wonder whether this organization is due to performance concern.
What I can imagine now is: threads in the same warp can be guaranteed synchronized which can be useful sometimes, or other resources such as registers and memory are organized in the warp size basis. I'm not sure whether this is correct.
Correcting some misconceptions:
A. ...From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together.
When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice.
B. ...Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.
It's true that all threads in a block (and therefore all warps) are executing from the same instruction stream, but they are not necessarily executing the same instruction. Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time. The diagram on page 10 of the Fermi whitepaper makes this evident.
Q1: Which part handles the threads grouping (into warps)? software or hardware?
It is done by hardware, as explained in the hardware implementation section of the programming guide: "The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block. "
and how the hardware warp scheduler is implemented and work?
I don't believe this is formally documented anywhere. Greg Smith has provided various explanations about it, and you may wish to seach on "user:124092 scheduler" or a similar search, to read some of his comments.
Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?
This question is predicated on misconceptions outlined earlier. The grouping of threads into a warp is not dynamic; it is fixed at threadblock launch time, and it follows the methodology described above in the answer to Q1. Furthermore, threads 0-15 will never be scheduled with any threads other than 16-31, as 0-31 comprise a warp, which is indivisible for scheduling purposes, on Fermi.
Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?
Again, I believe this question is predicated on previous misconceptions. The hardware units used to provide resources for a warp may exist in 16 units (or some other number) at some functional level, but from an operational level, the warp is scheduled as 32 threads, and each instruction is scheduled for the entire warp, and executed together, within some number of Fermi hotclocks.
As far as I know:
Q1 - scheduling is done at hardware level, warps are the scheduling units and warps, their lanes constituents (a laneid is the hardware equivalent of the thread index in a warp), SMs and other components at this level are all hardware units which are abstracted and programmed via the CUDA programming model.
Q2 - It also depends on the grid: if you're launching two blocks containing a single thread each, you will end up with two warps each of which contains only one active thread. As I said all scheduling and execution is done on a warp-basis and more warps the hardware has, the more it can schedule (although they may contain dummy NOP threads) and try to hide latency/less instruction pipeline stalls.
Q3 - Once resources are allocated threads are always divided into 32-thread warps. On Fermi warp schedulers pick two warp per cycle and dispatch them to execution units. On pre-Fermi architectures SMs had fewer than 32 thread processors. Right now Fermi has 32 thread processors. However, a full memory request can only retrieve 128 bytes at a time. Therefore, for data sizes larger than 32 bits per thread per transaction, the memory controller may still break the request down into a half-warp size (https://stackoverflow.com/a/14927626/1938163). Besides
The SM schedules threads in groups of 32 parallel threads called
warps. Each SM features two warp schedulers and two instruction
dispatch units, allowing two warps to be issued and executed
concurrently. Fermi’s dual warp scheduler selects two warps, and
issues one instruction from each warp to a group of sixteen cores,
sixteen load/store units, or four SFUs.
you don't have a "scheduling group size" at thread-level as you wrote, but if you re-read the above statement you'll have that 16 cores (or 16 load/store units or 4 SFUs) are readied with one instruction from a 32-thread warp each. If you were asking "why 16?" well.. that's another architectural story... and I suspect it's a carefully designed tradeoff. I'm sorry but I don't know more about this.

Does __syncthreads() synchronize all threads in the grid?

...or just the threads in the current warp or block?
Also, when the threads in a particular block encounter (in the kernel) the following line
__shared__ float srdMem[128];
will they just declare this space once (per block)?
They all obviously operate asynchronously so if Thread 23 in Block 22 is the first thread to reach this line, and then Thread 69 in Block 22 is the last one to reach this line, Thread 69 will know that it already has been declared?
The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4].
Example of using __syncthreads(): (source)
__global__ void globFunction(int *arr, int N)
{
__shared__ int local_array[THREADS_PER_BLOCK]; //local block memory cache
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
//...calculate results
local_array[threadIdx.x] = results;
//synchronize the local threads writing to the local memory cache
__syncthreads();
// read the results of another thread in the current thread
int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
//write back the value to global memory
arr[idx] = val;
}
To synchronize all threads in a grid currently there is not native API call. One way of synchronizing threads on a grid level is using consecutive kernel calls as at that point all threads end and start again from the same point. It is also commonly called CPU synchronization or Implicit synchronization. Thus they are all synchronized.
Example of using this technique (source):
Regarding the second question. Yes, it does declare the amount of shared memory specified per block. Take into account that the quantity of available shared memory is measured per SM. So one should be very careful how the shared memory is used along with the launch configuration.
I agree with all the answers here but I think we are missing one important point here w.r.t first question. I am not answering second answer as it got answered perfectly in the above answers.
Execution on GPU happens in units of warps. A warp is a group of 32 threads and at one time instance each thread of a particular warp execute the same instruction. If you allocate 128 threads in a block its (128/32 = ) 4 warps for a GPU.
Now the question becomes "If all threads are executing the same instruction then why synchronization is needed?". The answer is we need to synchronize the warps that belong to the SAME block. __syncthreads does not synchronizes threads in a warp, they are already synchronized. It synchronizes warps that belong to same block.
That is why answer to your question is : __syncthreads does not synchronizes all threads in a grid, but the threads belonging to one block as each block executes independently.
If you want to synchronize a grid then divide your kernel (K) into two kernels(K1 and K2) and call both. They will be synchronized (K2 will be executed after K1 finishes).
__syncthreads() waits until all threads within the same block has reached the command and all threads within a warp - that means all warps that belongs to a threadblock must reach the statement.
If you declare shared memory in a kernel, the array will only be visible to one threadblock. So each block will have his own shared memory block.
Existing answers have done a great job answering how __syncthreads() works (it allows intra-block synchronization), I just wanted to add an update that there are now newer methods for inter-block synchronization. Since CUDA 9.0, "Cooperative Groups" have been introduced, which allow synchronizing an entire grid of blocks (as explained in the Cuda Programming Guide). This achieves the same functionality as launching a new kernel (as mentioned above), but can usually do so with lower overhead and make your code more readable.
In order to provide further details, aside of the answers, quoting seibert:
More generally, __syncthreads() is a barrier primitive designed to protect you from read-after-write memory race conditions within a block.
The rules of use are pretty simple:
Put a __syncthreads() after the write and before the read when there is a possibility of a thread reading a memory location that another thread has written to.
__syncthreads() is only a barrier within a block, so it cannot protect you from read-after-write race conditions in global memory unless the only possible conflict is between threads in the same block. __syncthreads() is pretty much always used to protect shared memory read-after-write.
Do not use a __syncthreads() call in a branch or a loop until you are sure every single thread will reach the same __syncthreads() call. This can sometimes require that you break your if-blocks into several pieces to put __syncthread() calls at the top-level where all threads (including those which failed the if predicate) will execute them.
When looking for read-after-write situations in loops, it helps to unroll the loop in your head when figuring out where to put __syncthread() calls. For example, you often need an extra __syncthreads() call at the end of the loop if there are reads and writes from different threads to the same shared memory location in the loop.
__syncthreads() does not mark a critical section, so don’t use it like that.
Do not put a __syncthreads() at the end of a kernel call. There’s no need for it.
Many kernels do not need __syncthreads() at all because two different threads never access the same memory location.

Which is better, the atomic's competition between: threads of the single Warp or threads of different Warps?

Which is better, the atomic's competition (concurrency) between threads of the single Warp or between threads of different Warps in one block? I think that when you access the shared memory is better when threads of one warp are competing with each other is less than the threads of different warps. And with access to global memory on the contrary, it is better that a threads of different warps of one block competed less than the threads of single warp, isn't it?
I need it to know how better to resolve competition (concurrency) and what better to separate store: between threads in single warp or between warps.
Incidentally it may be said that the team __ syncthreads (); synchronizes it warps in a single block and not the threads of one warp?
If a significant number of threads in a block perform atomic updates to the same value, you will get poor performance since those threads must all be serialized. In such cases, it is usually better to have each thread write its result to a separate location and then, in a separate kernel, process those values.
If each thread in a warp performs an atomic update to the same value, all the threads in the warp perform the update in the same clock cycle, so they must all be serialized at the point of the atomic update. This probably means that the warp is scheduled 32 times to get all the threads serviced (very bad).
On the other hand, if a single thread in each warp in a block performs an atomic update to the same value, the impact will be lower because the pairs of warps (the two warps processed at each clock by the two warp schedulers) are offset in time (by one clock cycle), as they move through the processing pipelines. So you end up with only two atomic updates (one from each of the two warps), getting issued within one cycle and needing to immediately be serialized.
So, in the second case, the situation is better, but still problematic. The reason is that, depending on where the shared value is, you can still get serialization between SMs, and this can be very slow since each thread may have to wait for updates to go all the way out to global memory, or at least to L2, and then back. It may be possible to refactor the algorithm in such a way that threads within a block perform atomic updates to a value in shared memory (L1), and then have one thread in each block perform an atomic update to a value in global memory (L2).
The atomic operations can be complete lifesavers but they tend to be overused by people new to CUDA. It is often better to use a separate step with a parallel reduction or parallel stream compaction algorithm (see thrust::copy_if).