'ballot' behavior on inactive lanes - cuda

Warp voting functions can be invoked within a diverging branch and its effects are considered only among active threads. However, I am unsure how ballot works in that case? Are inactive threads always contributing 0? Or maybe the result is undefined?
Similar question: Do warp vote functions synchronize threads in the warp?
One answer quotes PTX ISA, which contains a sentence
In the ballot form, vote.ballot.b32 simply copies the predicate from
each thread in a warp into the corresponding bit position of
destination register d, where the bit position corresponds to the
thread's lane id.
but it does not explain how inactive threads are treated.

From the documentation:
For each of these warp vote operations, the result excludes threads that are inactive (e.g., due to warp divergence). Inactive threads are represented by 0 bits in the value returned by __ballot() and are not considered in the reductions performed by __all() and __any().

Related

Control flow efficiency

As stated in the Branch statistic manual, there are two metrics: branch efficiency and control flow efficiency.
The former has a hardware counter branch_efficiency. However, it seems that there is no direct hardware counter for the latter. Is it possible to find the ratio of executed and issued control flow instructions and use that as the second efficiency metric? Or the control flow utilization metric cf_fu_utilization?
Since control flow efficiency can be interpreted as the number of threads that are active for one instruction in a warp, I guess that warp_execution_efficiency can also be used since the definition says
Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
Any comment on that?
Both branch efficiency and control flow efficiency are metrics. Branch efficiency can be collected in a single psd and is shown as per SM values. Control flow efficiency is smsp__thread_inst_executed / smsp__inst_executed / WARP_SIZE * 100.0. These counters cannot be collected from all SMs in a single pass on all hardware so the metric is shown on the chart as an average across all SMs.
If using CUPTI/NVPROF the hardware events are:
inst_executed: Number of instructions executed per warp.
WARNING: The description states "per warp". This should be the sum.
thread_inst_executed: Number of instructions executed by the active threads. For each instruction it increments by number of threads, including predicated-off threads, that execute the instruction. It does not include replays.
not_predicated_off_thread_inst_executed: Number of thread instructions executed that are not predicated off
These events can be used to calculate either average_threads_executed_per_inst_executed or average_threads_executed_not_predicated_off_per_inst_executed. This can be converted to a % by / 32 x 100.0.
The compiler will use predication instead of a branch if the body of the conditional is small (several instructions).

Is warp shuffling with less than a full warp safe?

The CUDA documentation tells us that the result of a warp shuffle is undefined if the origin thread is "inactive". Does that mean we can safely shuffle with only part of the threads, and only need to pay attention to the junk data coming from the inactive ones? Or might the entire shuffle output be garbage?
If the target thread is inactive, the retrieved value is undefined.
My understanding is that the value returned to the thread that targeted an inactive thread is undefined. Threads that target an active thread behave as normal.
So you can get correct answers from shuffle in diverged code, so long as your target has followed the same path through the divergence.

How is a warp formed and handled by the hardware warp scheduler?

My questions are about warps and scheduling. I'm using NVIDIA Fermi terminology here. My observations are below, are they correct?
A. Threads in the same warp execute the same instruction. Each warp includes 32 threads.
According to the Fermi Whitepaper:
"Fermi’s dual warp scheduler selects two warps, and issues one
instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. "
From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together. Each scheduler issues half of a warp to 16 cores in a cycle, and in all, two schedulers issue two warp-halves into two 16-core scheduling groups in a cycle. In another words, one warp needs to be scheduled twice, half by half, in this Fermi architecture. If a warp contains only SFU operations, then this warp needs to be issued 8 times(32/4), since there's only 4 SFPUs in an SM.
B. When a large amount of threads (say 1-D array, 320 threads) is launched, consecutive threads will be grouped into 10 warps automatically, each has 32 threads. Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.
Questions:
Q1. Which part handles the threads grouping (into warps)? software or hardware? if hardware, is it the warp scheduler? and how the hardware warp scheduler is implemented and work?
Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?
Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?(this is a hardware question) Since in this case(Fermi), a warp will be scheduled twice (in two cycles) anyway. If a warp is 16 wide, simply two warps will be scheduled (also in two cycles), which seems the same with the previous case.I wonder whether this organization is due to performance concern.
What I can imagine now is: threads in the same warp can be guaranteed synchronized which can be useful sometimes, or other resources such as registers and memory are organized in the warp size basis. I'm not sure whether this is correct.
Correcting some misconceptions:
A. ...From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together.
When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice.
B. ...Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.
It's true that all threads in a block (and therefore all warps) are executing from the same instruction stream, but they are not necessarily executing the same instruction. Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time. The diagram on page 10 of the Fermi whitepaper makes this evident.
Q1: Which part handles the threads grouping (into warps)? software or hardware?
It is done by hardware, as explained in the hardware implementation section of the programming guide: "The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block. "
and how the hardware warp scheduler is implemented and work?
I don't believe this is formally documented anywhere. Greg Smith has provided various explanations about it, and you may wish to seach on "user:124092 scheduler" or a similar search, to read some of his comments.
Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?
This question is predicated on misconceptions outlined earlier. The grouping of threads into a warp is not dynamic; it is fixed at threadblock launch time, and it follows the methodology described above in the answer to Q1. Furthermore, threads 0-15 will never be scheduled with any threads other than 16-31, as 0-31 comprise a warp, which is indivisible for scheduling purposes, on Fermi.
Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?
Again, I believe this question is predicated on previous misconceptions. The hardware units used to provide resources for a warp may exist in 16 units (or some other number) at some functional level, but from an operational level, the warp is scheduled as 32 threads, and each instruction is scheduled for the entire warp, and executed together, within some number of Fermi hotclocks.
As far as I know:
Q1 - scheduling is done at hardware level, warps are the scheduling units and warps, their lanes constituents (a laneid is the hardware equivalent of the thread index in a warp), SMs and other components at this level are all hardware units which are abstracted and programmed via the CUDA programming model.
Q2 - It also depends on the grid: if you're launching two blocks containing a single thread each, you will end up with two warps each of which contains only one active thread. As I said all scheduling and execution is done on a warp-basis and more warps the hardware has, the more it can schedule (although they may contain dummy NOP threads) and try to hide latency/less instruction pipeline stalls.
Q3 - Once resources are allocated threads are always divided into 32-thread warps. On Fermi warp schedulers pick two warp per cycle and dispatch them to execution units. On pre-Fermi architectures SMs had fewer than 32 thread processors. Right now Fermi has 32 thread processors. However, a full memory request can only retrieve 128 bytes at a time. Therefore, for data sizes larger than 32 bits per thread per transaction, the memory controller may still break the request down into a half-warp size (https://stackoverflow.com/a/14927626/1938163). Besides
The SM schedules threads in groups of 32 parallel threads called
warps. Each SM features two warp schedulers and two instruction
dispatch units, allowing two warps to be issued and executed
concurrently. Fermi’s dual warp scheduler selects two warps, and
issues one instruction from each warp to a group of sixteen cores,
sixteen load/store units, or four SFUs.
you don't have a "scheduling group size" at thread-level as you wrote, but if you re-read the above statement you'll have that 16 cores (or 16 load/store units or 4 SFUs) are readied with one instruction from a 32-thread warp each. If you were asking "why 16?" well.. that's another architectural story... and I suspect it's a carefully designed tradeoff. I'm sorry but I don't know more about this.

Do warp vote functions synchronize threads in the warp?

Do CUDA warp vote functions, such as __any() and __all(), synchronize threads in the warp?
In other words, is there any guarantee that all threads inside the warp execute instructions preceding warp vote function, especially the instruction(s) that manipulate the predicate?
The synchronization is implicit, since threads within a warp execute in lockstep. [*]
Code that relies on this behavior is known as "warp synchronous."
[*] If you are thinking that conditional code will cause threads within a warp to follow different execution paths, you have more to learn about how CUDA hardware works. Divergent conditional code (i.e. conditional code where the condition is true for some threads but not for others) causes certain threads within the warp to be disabled (either by predication or the branch synchronization stack), but each thread still occupies one of the 32 lanes available in the warp.
They don't. You can use warp vote functions within code branches. If they would synchronize in such a case there would be a possible deadlock. From the PTX ISA:
vote
Vote across thread group.
Syntax
vote.mode.pred d, {!}a;
vote.ballot.b32 d, {!}a; // 'ballot' form, returns bitmask
.mode = { .all, .any, .uni };
Description
Performs a reduction of the source predicate across threads in a warp. The destination > predicate value is the same across all threads in the warp.
The reduction modes are:
.all
True if source predicate is True for all active threads in warp. Negate the source predicate to compute .none.
.any
True if source predicate is True for some active thread in warp. Negate the source predicate to compute .not_all.
.uni
True if source predicate has the same value in all active threads in warp. Negating the source predicate also computes .uni.
In the ballot form, vote.ballot.b32 simply copies the predicate from each thread in a warp into the corresponding bit position of destination register d, where the bit position corresponds to the thread's lane id.
EDIT:
Since threads within a warp are implicit synchronized you don't have to manually ensure that the threads are properly synchronized when the vote takes place. Note that for __all only active threads participate within the vote. Active threads are threads that execute instructions where the condition is true. This explains why a vote can occur within code branches.

Which is better, the atomic's competition between: threads of the single Warp or threads of different Warps?

Which is better, the atomic's competition (concurrency) between threads of the single Warp or between threads of different Warps in one block? I think that when you access the shared memory is better when threads of one warp are competing with each other is less than the threads of different warps. And with access to global memory on the contrary, it is better that a threads of different warps of one block competed less than the threads of single warp, isn't it?
I need it to know how better to resolve competition (concurrency) and what better to separate store: between threads in single warp or between warps.
Incidentally it may be said that the team __ syncthreads (); synchronizes it warps in a single block and not the threads of one warp?
If a significant number of threads in a block perform atomic updates to the same value, you will get poor performance since those threads must all be serialized. In such cases, it is usually better to have each thread write its result to a separate location and then, in a separate kernel, process those values.
If each thread in a warp performs an atomic update to the same value, all the threads in the warp perform the update in the same clock cycle, so they must all be serialized at the point of the atomic update. This probably means that the warp is scheduled 32 times to get all the threads serviced (very bad).
On the other hand, if a single thread in each warp in a block performs an atomic update to the same value, the impact will be lower because the pairs of warps (the two warps processed at each clock by the two warp schedulers) are offset in time (by one clock cycle), as they move through the processing pipelines. So you end up with only two atomic updates (one from each of the two warps), getting issued within one cycle and needing to immediately be serialized.
So, in the second case, the situation is better, but still problematic. The reason is that, depending on where the shared value is, you can still get serialization between SMs, and this can be very slow since each thread may have to wait for updates to go all the way out to global memory, or at least to L2, and then back. It may be possible to refactor the algorithm in such a way that threads within a block perform atomic updates to a value in shared memory (L1), and then have one thread in each block perform an atomic update to a value in global memory (L2).
The atomic operations can be complete lifesavers but they tend to be overused by people new to CUDA. It is often better to use a separate step with a parallel reduction or parallel stream compaction algorithm (see thrust::copy_if).