I am trying to implement a atomic based mutex.
I succeed it but I have one question about warps / deadlock.
This code works well.
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
But this one doesn't...
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
I think it's a position of exiting loop. In the first one, exit happens where the condition is, in the second one it happens in the end of if, so the thread wait for other warps finish loop, but other threads wait the first thread as well... But I think I am wrong, so if you can explain me :).
Thanks !
There are other questions here on mutexes. You might want to look at some of them. Search on "cuda critical section", for example.
Assuming that one will work and one won't because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in the same warp is notoriously difficult and fragile. The general advice is to avoid it. As discussed elsewhere, if you must use mutexes or critical sections, have a single thread in the threadblock negotiate for any thread that needs it, then control behavior within the threadblock using intra-threadblock synchronization mechanisms, such as __syncthreads().
This question (IMO) can't really be answered without looking at the way the compiler is ordering the various paths of execution. Therefore we need to look at the SASS code (the machine code). You can use the cuda binary utilities to do this, and will probably want to refer to both the PTX reference as well as the SASS reference. This also means that you need a complete code, not just the snippets you've provided.
Here's my code for analysis:
$ cat t830.cu
#include <stdio.h>
__device__ int mLock = 0;
__device__ void doCriticJob(){
}
__global__ void kernel1(){
int index = 0;
int mSize = 1;
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
}
__global__ void kernel2(){
int index = 0;
int mSize = 1;
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
}
int main(){
kernel2<<<4,128>>>();
cudaDeviceSynchronize();
}
kernel1 is my representation of your deadlock code, and kernel2 is my representation of your "working" code. When I compile this on linux under CUDA 7 and run on a cc2.0 device (Quadro5000), if I call kernel1 the code will deadlock, and if I call kernel2 (as is shown) it doesn't.
I use cuobjdump -sass to dump the machine code:
$ cuobjdump -sass ./t830
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_20
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_20
Function : _Z7kernel1v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R4, 0x1; /* 0x1800000004011de2 */
/*0010*/ SSY 0x48; /* 0x60000000c0000007 */
/*0018*/ MOV R2, c[0xe][0x0]; /* 0x2800780000009de4 */
/*0020*/ MOV R3, c[0xe][0x4]; /* 0x280078001000dde4 */
/*0028*/ ATOM.E.CAS R0, [R2], RZ, R4; /* 0x54080000002fdd25 */
/*0030*/ ISETP.NE.AND P0, PT, R0, RZ, PT; /* 0x1a8e0000fc01dc23 */
/*0038*/ #P0 BRA 0x18; /* 0x4003ffff600001e7 */
/*0040*/ NOP.S; /* 0x4000000000001df4 */
/*0048*/ ATOM.E.EXCH RZ, [R2], RZ; /* 0x547ff800002fdd05 */
/*0050*/ EXIT; /* 0x8000000000001de7 */
............................
Function : _Z7kernel2v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R0, 0x1; /* 0x1800000004001de2 */
/*0010*/ MOV32I R3, 0x1; /* 0x180000000400dde2 */
/*0018*/ MOV R4, c[0xe][0x0]; /* 0x2800780000011de4 */
/*0020*/ MOV R5, c[0xe][0x4]; /* 0x2800780010015de4 */
/*0028*/ ATOM.E.CAS R2, [R4], RZ, R3; /* 0x54061000004fdd25 */
/*0030*/ ISETP.NE.AND P1, PT, R2, RZ, PT; /* 0x1a8e0000fc23dc23 */
/*0038*/ #!P1 MOV R0, RZ; /* 0x28000000fc0025e4 */
/*0040*/ #!P1 ATOM.E.EXCH RZ, [R4], RZ; /* 0x547ff800004fe505 */
/*0048*/ LOP.AND R2, R0, 0xff; /* 0x6800c003fc009c03 */
/*0050*/ I2I.S32.S16 R2, R2; /* 0x1c00000008a09e84 */
/*0058*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*0060*/ #P0 BRA 0x18; /* 0x4003fffec00001e7 */
/*0068*/ EXIT; /* 0x8000000000001de7 */
............................
Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$
Considering a single warp, with either code, all threads must acquire the lock (via atomicCAS) once, in order for the code to complete successfully. With either code, only one thread in a warp can acquire the lock at any given time, and in order for other threads in the warp to (later) acquire the lock, that thread must have an opportunity to release it (via atomicExch).
The key difference between these realizations then, lies in how the compiler scheduled the atomicExch instruction with respect to conditional branches.
Let's consider the "deadlock" code (kernel1). In this case, the ATOM.E.EXCH instruction does not occur until after the one (and only) conditional branch (#P0 BRA 0x18;) instruction. A conditional branch in CUDA code represents a possible point of warp divergence, and execution after warp divergence is, to some degree, unspecified and up to the specifics of the machine. But given this uncertainty, it's possible that the thread that acquired the lock will wait for the other threads to complete their branches, before executing the atomicExch instruction, which means that the other threads will not have a chance to acquire the lock, and we have deadlock.
If we then compare that to the "working" code, we see that once the ATOM.E.CAS instruction is issued, there are no conditional branches in between that point and the point at which the ATOM.E.EXCH instruction is issued, thus releasing the lock just acquired. Since each thread that acquires the lock (via ATOM.E.CAS) will release it (via ATOM.E.EXCH) before any conditional branching occurs, there isn't any possibility (given this code realization) for the kind of deadlock witnessed previously (with kernel1) to occur.
(#P0 is a form of predication, and you can read about it in the PTX reference here to understand how it can lead to conditional branching.)
NOTE: I consider both of these codes to be dangerous, and possibly flawed. Even though the current tests don't seem to uncover a problem with the "working" code, I think it's possible that a future CUDA compiler might choose to schedule things differently, and break that code. It's even possible that compiling for a different machine architecture might produce different code here. I consider a mechanism like this to be more robust, which avoids intra-warp contention entirely. Even such a mechanism, however, can lead to inter-threadblock deadlocks. Any mutex must be used under specific programming and usage limitations.
Related
When different threads in a warp execute divergent code, divergent branches are serialized, and inactive warps are "disabled."
If the divergent paths contain a small number of instructions, such that branch predication is used, it's pretty clear what "disabled" means (threads are turned on/off by the predicate), and it's also clearly visible in the sass dump.
If the divergent execution paths contain larger numbers of instructions (exact number dependent on some compiler heuristics) branch instructions are inserted to potentially skip one execution path or the other. This makes sense: if one long branch is seldom taken, or not taken by any threads in a certain warp, it's advantageous to allow the warp to skip those instructions (rather than being forced to execute both paths in all cases as for predication).
My question is: How are inactive threads "disabled" in the case of divergence with branches? The slide on page 2, lower left of this presentation seems to indicate that branches are taken based on a condition and threads that do not participate are switched off via predicates attached to the instructions at the branch targets. However, this is not the behavior I observe in SASS.
Here's a minimal compilable sample:
#include <stdio.h>
__global__ void nonpredicated( int* a, int iter )
{
if( a[threadIdx.x] == 0 )
// Make the number of divergent instructions unknown at
// compile time so the compiler is forced to create branches
for( int i = 0; i < iter; i++ )
{
a[threadIdx.x] += 5;
a[threadIdx.x] *= 5;
}
else
for( int i = 0; i < iter; i++ )
{
a[threadIdx.x] += 2;
a[threadIdx.x] *= 2;
}
}
int main(){}
Here's the SASS dump showing that the branch instructions are predicated, but the code at the branch targets is not predicated. Are the threads that did not take the branch switched off implicitly during execution of those branch targets, in some way that is not directly visible in the SASS? I often see terminology like "active mask" alluded to in various Cuda documents, but I'm wondering how this manifests in SASS, if it is a separate mechanism from predication.
Additionally, for pre-Volta architectures, the program counter is shared per-warp, so the idea of a predicated branch instruction is confusing to me. Why would you attach a per-thread predicate to an instruction that might change something (the program counter) that is shared by all threads in the warp?
code for sm_20
Function : _Z13nonpredicatedPii
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_TID.X; /* 0x2c00000084001c04 */
/*0010*/ MOV32I R3, 0x4; /* 0x180000001000dde2 */
/*0018*/ IMAD.U32.U32 R2.CC, R0, R3, c[0x0][0x20]; /* 0x2007800080009c03 */
/*0020*/ IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x24]; /* 0x208680009000dc43 */
/*0028*/ LD.E R0, [R2]; /* 0x8400000000201c85 */
/*0030*/ ISETP.EQ.AND P0, PT, R0, RZ, PT; /* 0x190e0000fc01dc23 */
/*0038*/ #P0 BRA 0xd0; /* 0x40000002400001e7 */
/*0040*/ MOV R4, c[0x0][0x28]; /* 0x28004000a0011de4 */
/*0048*/ ISETP.LT.AND P0, PT, R4, 0x1, PT; /* 0x188ec0000441dc23 */
/*0050*/ MOV R4, RZ; /* 0x28000000fc011de4 */
/*0058*/ #P0 EXIT; /* 0x80000000000001e7 */
/*0060*/ NOP; /* 0x4000000000001de4 */
/*0068*/ NOP; /* 0x4000000000001de4 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ NOP; /* 0x4000000000001de4 */
/*0080*/ IADD R4, R4, 0x1; /* 0x4800c00004411c03 */
/*0088*/ IADD R0, R0, 0x2; /* 0x4800c00008001c03 */
/*0090*/ ISETP.LT.AND P0, PT, R4, c[0x0][0x28], PT; /* 0x188e4000a041dc23 */
/*0098*/ SHL R0, R0, 0x1; /* 0x6000c00004001c03 */
/*00a0*/ #P0 BRA 0x80; /* 0x4003ffff600001e7 */
/*00a8*/ ST.E [R2], R0; /* 0x9400000000201c85 */
/*00b0*/ BRA 0x128; /* 0x40000001c0001de7 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
/*00c0*/ NOP; /* 0x4000000000001de4 */
/*00c8*/ NOP; /* 0x4000000000001de4 */
/*00d0*/ MOV R0, c[0x0][0x28]; /* 0x28004000a0001de4 */
/*00d8*/ MOV R4, RZ; /* 0x28000000fc011de4 */
/*00e0*/ ISETP.LT.AND P0, PT, R0, 0x1, PT; /* 0x188ec0000401dc23 */
/*00e8*/ MOV R0, RZ; /* 0x28000000fc001de4 */
/*00f0*/ #P0 EXIT; /* 0x80000000000001e7 */
/*00f8*/ MOV32I R5, 0x19; /* 0x1800000064015de2 */
/*0100*/ IADD R0, R0, 0x1; /* 0x4800c00004001c03 */
/*0108*/ IMAD R4, R4, 0x5, R5; /* 0x200ac00014411ca3 */
/*0110*/ ISETP.LT.AND P0, PT, R0, c[0x0][0x28], PT; /* 0x188e4000a001dc23 */
/*0118*/ #P0 BRA 0x100; /* 0x4003ffff800001e7 */
/*0120*/ ST.E [R2], R4; /* 0x9400000000211c85 */
/*0128*/ EXIT; /* 0x8000000000001de7 */
.....................................
Are the threads that did not take the branch switched off implicitly during execution of those branch targets, in some way that is not directly visible in the SASS?
Yes.
There is a warp execution or "active" mask which is separate from the formal concept of predication as defined in the PTX ISA manual.
Predicated execution may allow instructions to be executed (or not) for a particular thread on an instruction-by-instruction basis. The compiler may also emit predicated instructions to enact a conditional jump or branch.
However the GPU also maintains a warp active mask. When the machine observes that thread execution within a warp has diverged (for example at the point of a predicated branch, or perhaps any predicated instruction), it will set the active mask accordingly. This process isn't really "visible" at the SASS level. AFAIK the low level execution process for a diverged warp (not via predication) isn't well specified, so questions around how long the warp stays diverged and the exact mechanism for re-synchronization aren't well specified, and AFAIK can be affected by compiler choices, on some architectures. This is one recent discussion (note particularly the remarks by #njuffa).
Why would you attach a per-thread predicate to an instruction that might change something (the program counter) that is shared by all threads in the warp?
This is how you perform a conditional jump or branch. Since all execution is lock-step, if we are going to execute a particular instruction (regardless of mask status or predication status) the PC had better point to that instruction. However, the GPU can perform instruction replay to handle different cases, as needed at execution time.
A few other notes:
a mention of the "active mask" is here:
The scheduler dispatches all 32 lanes of the warp to the execution units with an active mask. Non-active threads execute through the pipe.
some NVIDIA tools allow for inspection of the active mask.
When looking into the SASS output generated for the NVIDIA Fermi architecture, the instruction IADD.X is observed. From NVIDIA documentation, IADD means integer add, but not understanding what it means by IADD.X. Can somebody please help... Is this meaning an integer addition with extended number of bits?
The instruction snippet is:
IADD.X R5, R3, c[0x0][0x24]; /* 0x4800400090315c43 */
Yes, the .X stands for eXtended precision. You will see IADD.X used together with IADD.CC, where the latter adds the less significant bits, and produces a carry flag (thus the .CC), and this carry flag is then incorporated into addition of the more significant bits performed by IADD.X.
Since NVIDIA GPUs are basically 32-bit processors with 64-bit addressing capability, a frequent use of this idiom is in address (pointer) arithmetic. The use of 64-bit integer types, such as long long int or uint64_t will likewise lead to the use of these instructions.
Here is a worked example of a kernel doing 64-bit integer addition. This CUDA code was compiled for compute capability 3.5 with CUDA 7.5, and the machine code dumped with cuobjdump --dump-sass.
__global__ void addint64 (long long int a, long long int b, long long int *res)
{
*res = a + b;
}
MOV R1, c[0x0][0x44];
MOV R2, c[0x0][0x148]; // b[31:0]
MOV R0, c[0x0][0x14c]; // b[63:32]
IADD R4.CC, R2, c[0x0][0x140]; // tmp[31:0] = b[31:0] + a[31:0]; carry-out
MOV R2, c[0x0][0x150]; // res[31:0]
MOV R3, c[0x0][0x154]; // res[63:32]
IADD.X R5, R0, c[0x0][0x144]; // tmp[63:32] = b[63:32] + a[63:32] + carry-in
ST.E.64 [R2], R4; // [res] = tmp[63:0]
EXIT
I have read a lot of threads about CUDA branch divergence, telling me that using ternary operator is better than if/else statements, because ternary operator doesn't result in branch divergence.
I wonder, for the following code:
foo = (a > b) ? (bar(a)) : (b);
Where bar is another function or some more complicate statements, is it still true that there is no branch divergence ?
I don't know what sources you consulted, but with the CUDA toolchain there is no noticeable performance difference between the use of the ternary operator and the equivalent if-then-else sequence in most cases. In the case where such differences are noticed, they are due to second order effects in the code generation, and the code based on if-then-else sequence may well be faster in my experience. In essence, ternary operators and tightly localized branching are treated in much the same way. There can be no guarantees that a ternary operator may not be translated into machine code containing a branch.
The GPU hardware offers multiple mechanisms that help avoiding branches and the CUDA compiler makes good use of these mechanisms to minimize branches. One is predication, which can be applied to pretty much any instruction. The other is support for select-type instructions which are essentially the hardware equivalent of the ternary operator. The compiler uses if-conversion to translate short branches into branch-less code sequences. Often, it choses a combination of predicated code and a uniform branch. In cases of non-divergent control flow (all threads in a warp take the same branch) the uniform branch skips over the predicated code section.
Except in cases of extreme performance optimization, CUDA can (and should) be written in natural idioms that are clear and appropriate to the task at hand, using either if-then-else sequences or ternary operators as you see fit. The compiler will take care of the rest.
(I would like to add the comment for #njuffa's answer but my reputation is not enough)
I found the performance different between them with my program.
The if-clause style costs 4.78ms:
// fin is {0-4}, range_limit = 5
if(fin >= range_limit){
res_set = res_set^1;
fx_ref = fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X);
fin = 0;
}
// then branch for next loop iteration.
// nvvp report these assemblies.
#!P1 LOP32I.XOR R48, R48, 0x1;
#!P1 FMUL.FTZ R39, R7, 14;
#!P1 MOV R0, RZ;
MOV R40, R48;
{ #!P1 FFMA.FTZ R6, R39, c[0x2][0x0], R5;
#!P0 BRA `(.L_35); } // the predicate also use for loop's branching
And the ternary style costs 4.46ms:
res_set = (fin < range_limit) ? res_set: (res_set ^1);
fx_ref = (fin < range_limit) ? fx_ref : fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X) ;
fin = (fin < range_limit) ? fin:0;
//comments are where nvvp mark the instructions are for the particular code line
ISETP.GE.AND P2, PT, R34.reuse, c[0x0][0x160], PT; //res_set
FADD.FTZ R27, -R25, R4;
ISETP.LT.AND P0, PT, R34, c[0x0][0x160], PT; //fx_ref
IADD32I R10, R10, 0x1;
SHL R0, R9, 0x2;
SEL R4, R4, R27, P1;
ISETP.LT.AND P1, PT, R10, 0x5, PT;
IADD R33, R0, R26;
{ SEL R0, RZ, 0x1, !P2;
STS [R33], R58; }
{ FADD.FTZ R3, R3, 74.75;
STS [R33+0x8], R29; }
{ #!P0 FMUL.FTZ R28, R4, 14; //fx_ref
STS [R33+0x10], R30; }
{ IADD32I R24, R24, 0x1;
STS [R33+0x18], R31; }
{ LOP.XOR R9, R0, R9; //res_set
STS [R33+0x20], R32; }
{ SEL R0, R34, RZ, P0; //fin
STS [R33+0x28], R36; }
{ #!P0 FFMA.FTZ R2, R28, c[0x2][0x0], R3; //fx_ref
The inserted lines are from the next loop iteration calculation.
I think in the case of many instructions shared the same predicate value, the ternary style may provide more opportunity for ILP optimization.
I try to run a simple program with 3 dimensional grid but for some reason when I launch it with cuda-memcheck it just gets stuck, and after the timeout it's terminated. The problem has nothing to do with a short timeout cause I changed it just for this manner to 60 seconds.
The code I run has a grid of 45x1575x1575 and it runs an empty __global__ function. My compute capability is 2.1 and I compile with the flag -maxrregcount=24 to limit the number of registers the device functions can use (saw in some other program of mine that it gives the best results with the occupancy calculator)
Here's my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void stam(int a){
}
int main()
{
// Choose which GPU to run on, change this on a multi-GPU system.
cudaError_t cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
return;
}
dim3 gridSize(45,1575,1575);
stam<<<gridSize,224>>>(4);
cudaStatus = cudaDeviceSynchronize(); // This function gets stuck
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed!!");
return;
}
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
Isn't the max grid size 65535x65535x65535? What is the problem in here?
Edit: it only crashes when I compile it with the -G flag. Otherwise it's just slow, but it doesn't exceed the 60 seconds.
Your code is simply taking too long (yes, longer than 60 seconds) to run.
Even though your kernel "does nothing" it still represents a __global__ function call. To facilitate it, a fair amount of preamble code gets generated by the compiler. Normally the compiler would optimize much of that preamble code away, since your function does nothing (e.g. it does nothing with the variable passed to it, which the preamble code makes available to each thread.) However when you pass the -G switch, you eliminate nearly all compiler optimizations. You can get a sense of the size of the code that is actually running for each threadblock, by taking your executable and inspecting the code with cuobjdump -sass ....
Secondly, running code with cuda-memcheck usually increases execution time. The cuda-memcheck executive adjusts the order and reduces the rate at which threadblocks get executed, so it can do full analysis of the memory access pattern of each threadblock, among other things.
The net effect is that your empty kernel call, in part due to the very large grid (over 100 million threadblocks need to be processed), is taking longer than 60 seconds to execute. If you want to verify this, increase your TDR timeout to 5 minutes or 10 minutes, and eventually you will see the program return normally.
In my case, with -G and cuda-memcheck your program takes about 30 seconds to run on a Quadro5000 GPU, which has 11 SMs. Your cc2.1 GPU may have around 2 SMs, and so will run even slower than mine. If I compile without the -G switch, the runtime drops to about 2 seconds. If I compile with the -G switch, but run without cuda-memcheck, it takes about 4 seconds. If I eliminate the int a parameter from the kernel (which drastically reduces the preamble code), I can compile with -G and run with cuda-memcheck and it only takes 2 seconds.
Kernel machine code with -G and int a parameter:
Function : _Z4stami
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ ISUB R1, R1, 0x8; /* 0x4800c00020105d03 */
/*0010*/ S2R R0, SR_LMEMHIOFF; /* 0x2c000000dc001c04 */
/*0018*/ ISETP.GE.AND P0, PT, R1, R0, PT; /* 0x1b0e00000011dc23 */
/*0020*/ #P0 BRA 0x30; /* 0x40000000200001e7 */
/*0028*/ BPT.TRAP; /* 0xd00000000000c007 */
/*0030*/ IADD R0, R1, RZ; /* 0x48000000fc101c03 */
/*0038*/ MOV R2, R0; /* 0x2800000000009de4 */
/*0040*/ MOV R3, RZ; /* 0x28000000fc00dde4 */
/*0048*/ MOV R2, R2; /* 0x2800000008009de4 */
/*0050*/ MOV R3, R3; /* 0x280000000c00dde4 */
/*0058*/ MOV R4, c[0x0][0x4]; /* 0x2800400010011de4 */
/*0060*/ MOV R5, RZ; /* 0x28000000fc015de4 */
/*0068*/ IADD R2.CC, R2, R4; /* 0x4801000010209c03 */
/*0070*/ IADD.X R3, R3, R5; /* 0x480000001430dc43 */
/*0078*/ MOV32I R0, 0x20; /* 0x1800000080001de2 */
/*0080*/ LDC R0, c[0x0][R0]; /* 0x1400000000001c86 */
/*0088*/ IADD R2.CC, R2, RZ; /* 0x48010000fc209c03 */
/*0090*/ IADD.X R3, R3, RZ; /* 0x48000000fc30dc43 */
/*0098*/ MOV R2, R2; /* 0x2800000008009de4 */
/*00a0*/ MOV R3, R3; /* 0x280000000c00dde4 */
/*00a8*/ ST.E [R2], R0; /* 0x9400000000201c85 */
/*00b0*/ BRA 0xc8; /* 0x4000000040001de7 */
/*00b8*/ EXIT; /* 0x8000000000001de7 */
/*00c0*/ EXIT; /* 0x8000000000001de7 */
/*00c8*/ EXIT; /* 0x8000000000001de7 */
/*00d0*/ EXIT; /* 0x8000000000001de7 */
.........................
Kernel machine code with -G but without int a parameter:
Function : _Z4stamv
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ BRA 0x20; /* 0x4000000040001de7 */
/*0010*/ EXIT; /* 0x8000000000001de7 */
/*0018*/ EXIT; /* 0x8000000000001de7 */
/*0020*/ EXIT; /* 0x8000000000001de7 */
/*0028*/ EXIT; /* 0x8000000000001de7 */
.........................
I've just run your code with no problems on a C2050 (capability 2.0) under both cuda-memcheck and cuda-gdb. This suggests the problem is more likely related to your card/setup than the code itself.
If you were exceeding capability, you should get a launch error code, not a hang (you can check max sizes using deviceQuery SDK code if you're unsure).
It may be that cuda-memcheck is trying to gain exclusive control of the GPU, and is timing out as something else is using it [e.g. your X server] - does cuda-gdb work any better, do these tools work for other codes?
In Nsight Visual Studio, we will have a graph to present the statistics of "taken", "not taken" and "diverged" branches. I am confused about the differece between "not taken" and "diverged".
For example
kernel()
{
if(tid % 32 != 31)
{...}
else
{...}
}
In my opinion, when tid %31 == 31 in a warp, the divergency will happen, but what is "not taken"?
From the Nsight Visual Studio Edition User Guide:
Not Taken / Taken Total: number of executed branch instructions with a uniform control flow decision; that is all active threads of a warp either take or not take the branch.
Diverged: Total number of executed branch instruction for which the conditional resulted in different outcomes across the threads of the warp. All code paths with at least one participating thread get executed sequentially. Lower numbers are better, however, check the Flow Control Efficiency to understand the impact of control flow on the device utilization.
Now, let us consider the following simple code, which perhaps is what you are currently considering in your tests:
#include<thrust\device_vector.h>
__global__ void test_divergence(int* d_output) {
int tid = threadIdx.x;
if(tid % 32 != 31)
d_output[tid] = tid;
else
d_output[tid] = 30000;
}
void main() {
const int N = 32;
thrust::device_vector<int> d_vec(N,0);
test_divergence<<<2,32>>>(thrust::raw_pointer_cast(d_vec.data()));
}
The Branch Statistics graph produced by Nsight is reported below. As you can see, Taken is equal to 100%, since all the threads bump into the if statement. The surprising result is that you have no Diverge. This can be explained by taking a look at the disassembled code of the kernel function (compiled for a compute capability of 2.1):
MOV R1, c[0x1][0x100];
S2R R0, SR_TID.X;
SHR R2, R0, 0x1f;
IMAD.U32.U32.HI R2, R2, 0x20, R0;
LOP.AND R2, R2, -0x20;
ISUB R2, R0, R2;
ISETP.EQ.AND P0, PT, R2, 0x1f, PT;
ISCADD R2, R0, c[0x0][0x20], 0x2;
SEL R0, R0, 0x7530, !P0;
ST [R2], R0;
EXIT;
As you can see, the compiler is able to optimize the diassembled code so that no branching is present, except the uniform one due to the EXIT instruction, as pointed out by Greg Smith in the comment below.
EDIT: A MORE COMPLEX EXAMPLE FOLLOWING GREG SMITH'S COMMENT
I'm now considering the following more complex example
/**************************/
/* TEST DIVERGENCE KERNEL */
/**************************/
__global__ void testDivergence(float *a, float *b)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < 16) a[tid] = tid + 1;
else b[tid] = tid + 2;
}
/********/
/* MAIN */
/********/
void main() {
const int N = 64;
float* d_a; cudaMalloc((void**)&d_a,N*sizeof(float));
float* d_b; cudaMalloc((void**)&d_b,N*sizeof(float));
testDivergence<<<2,32>>>(d_a, d_b);
}
This is the Branch Statistics graph
while this is the disassembled code
MOV R1, c[0x1][0x100];
S2R R0, SR_CTAID.X; R0 = blockIdx.x
S2R R2, SR_TID.X; R0 = threadIdx.x
IMAD R0, R0, c[0x0][0x8], R2; R0 = threadIdx.x + blockIdx.x * blockDim.x
ISETP.LT.AND P0, PT, R0, 0x10, PT; Checks if R0 < 16 and puts the result in predicate register P0
/*0028*/ #P0 BRA.U 0x58; If P0 = true, jumps to line 58
#!P0 IADD R2, R0, 0x2; If P0 = false, R2 = R0 + 2
#!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; If P0 = false, calculates address to store b[tid] in global memory
#!P0 I2F.F32.S32 R2, R2; "
#!P0 ST [R0], R2; "
/*0050*/ #!P0 BRA.U 0x78; If P0 = false, jumps to line 78
/*0058*/ #P0 IADD R2, R0, 0x1; R2 = R0 + 1
#P0 ISCADD R0, R0, c[0x0][0x20], 0x2;
#P0 I2F.F32.S32 R2, R2;
#P0 ST [R0], R2;
/*0078*/ EXIT;
As it can be seen, now we have two BRA instructions in the disassembled code. From the graph above, each warp bumps into 3 branches (one for the EXIT and the two BRAs). Both warps have 1 taken branch, since all the threads uniformly bump into the EXIT instruction. The first warp has 2 not taken branches, since the two BRAs paths are not followed uniformly across the warp threads. The second warp has 1 not taken branch and 1 taken branch since all the warp threads follow uniformly one of the two BRAs. I would say that again diverged* is equal to zero because the instructions in the two branches are exactly the same, although performed on different operands.