In Nsight Visual Studio, we will have a graph to present the statistics of "taken", "not taken" and "diverged" branches. I am confused about the differece between "not taken" and "diverged".
For example
kernel()
{
if(tid % 32 != 31)
{...}
else
{...}
}
In my opinion, when tid %31 == 31 in a warp, the divergency will happen, but what is "not taken"?
From the Nsight Visual Studio Edition User Guide:
Not Taken / Taken Total: number of executed branch instructions with a uniform control flow decision; that is all active threads of a warp either take or not take the branch.
Diverged: Total number of executed branch instruction for which the conditional resulted in different outcomes across the threads of the warp. All code paths with at least one participating thread get executed sequentially. Lower numbers are better, however, check the Flow Control Efficiency to understand the impact of control flow on the device utilization.
Now, let us consider the following simple code, which perhaps is what you are currently considering in your tests:
#include<thrust\device_vector.h>
__global__ void test_divergence(int* d_output) {
int tid = threadIdx.x;
if(tid % 32 != 31)
d_output[tid] = tid;
else
d_output[tid] = 30000;
}
void main() {
const int N = 32;
thrust::device_vector<int> d_vec(N,0);
test_divergence<<<2,32>>>(thrust::raw_pointer_cast(d_vec.data()));
}
The Branch Statistics graph produced by Nsight is reported below. As you can see, Taken is equal to 100%, since all the threads bump into the if statement. The surprising result is that you have no Diverge. This can be explained by taking a look at the disassembled code of the kernel function (compiled for a compute capability of 2.1):
MOV R1, c[0x1][0x100];
S2R R0, SR_TID.X;
SHR R2, R0, 0x1f;
IMAD.U32.U32.HI R2, R2, 0x20, R0;
LOP.AND R2, R2, -0x20;
ISUB R2, R0, R2;
ISETP.EQ.AND P0, PT, R2, 0x1f, PT;
ISCADD R2, R0, c[0x0][0x20], 0x2;
SEL R0, R0, 0x7530, !P0;
ST [R2], R0;
EXIT;
As you can see, the compiler is able to optimize the diassembled code so that no branching is present, except the uniform one due to the EXIT instruction, as pointed out by Greg Smith in the comment below.
EDIT: A MORE COMPLEX EXAMPLE FOLLOWING GREG SMITH'S COMMENT
I'm now considering the following more complex example
/**************************/
/* TEST DIVERGENCE KERNEL */
/**************************/
__global__ void testDivergence(float *a, float *b)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < 16) a[tid] = tid + 1;
else b[tid] = tid + 2;
}
/********/
/* MAIN */
/********/
void main() {
const int N = 64;
float* d_a; cudaMalloc((void**)&d_a,N*sizeof(float));
float* d_b; cudaMalloc((void**)&d_b,N*sizeof(float));
testDivergence<<<2,32>>>(d_a, d_b);
}
This is the Branch Statistics graph
while this is the disassembled code
MOV R1, c[0x1][0x100];
S2R R0, SR_CTAID.X; R0 = blockIdx.x
S2R R2, SR_TID.X; R0 = threadIdx.x
IMAD R0, R0, c[0x0][0x8], R2; R0 = threadIdx.x + blockIdx.x * blockDim.x
ISETP.LT.AND P0, PT, R0, 0x10, PT; Checks if R0 < 16 and puts the result in predicate register P0
/*0028*/ #P0 BRA.U 0x58; If P0 = true, jumps to line 58
#!P0 IADD R2, R0, 0x2; If P0 = false, R2 = R0 + 2
#!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; If P0 = false, calculates address to store b[tid] in global memory
#!P0 I2F.F32.S32 R2, R2; "
#!P0 ST [R0], R2; "
/*0050*/ #!P0 BRA.U 0x78; If P0 = false, jumps to line 78
/*0058*/ #P0 IADD R2, R0, 0x1; R2 = R0 + 1
#P0 ISCADD R0, R0, c[0x0][0x20], 0x2;
#P0 I2F.F32.S32 R2, R2;
#P0 ST [R0], R2;
/*0078*/ EXIT;
As it can be seen, now we have two BRA instructions in the disassembled code. From the graph above, each warp bumps into 3 branches (one for the EXIT and the two BRAs). Both warps have 1 taken branch, since all the threads uniformly bump into the EXIT instruction. The first warp has 2 not taken branches, since the two BRAs paths are not followed uniformly across the warp threads. The second warp has 1 not taken branch and 1 taken branch since all the warp threads follow uniformly one of the two BRAs. I would say that again diverged* is equal to zero because the instructions in the two branches are exactly the same, although performed on different operands.
Related
When looking into the SASS output generated for the NVIDIA Fermi architecture, the instruction IADD.X is observed. From NVIDIA documentation, IADD means integer add, but not understanding what it means by IADD.X. Can somebody please help... Is this meaning an integer addition with extended number of bits?
The instruction snippet is:
IADD.X R5, R3, c[0x0][0x24]; /* 0x4800400090315c43 */
Yes, the .X stands for eXtended precision. You will see IADD.X used together with IADD.CC, where the latter adds the less significant bits, and produces a carry flag (thus the .CC), and this carry flag is then incorporated into addition of the more significant bits performed by IADD.X.
Since NVIDIA GPUs are basically 32-bit processors with 64-bit addressing capability, a frequent use of this idiom is in address (pointer) arithmetic. The use of 64-bit integer types, such as long long int or uint64_t will likewise lead to the use of these instructions.
Here is a worked example of a kernel doing 64-bit integer addition. This CUDA code was compiled for compute capability 3.5 with CUDA 7.5, and the machine code dumped with cuobjdump --dump-sass.
__global__ void addint64 (long long int a, long long int b, long long int *res)
{
*res = a + b;
}
MOV R1, c[0x0][0x44];
MOV R2, c[0x0][0x148]; // b[31:0]
MOV R0, c[0x0][0x14c]; // b[63:32]
IADD R4.CC, R2, c[0x0][0x140]; // tmp[31:0] = b[31:0] + a[31:0]; carry-out
MOV R2, c[0x0][0x150]; // res[31:0]
MOV R3, c[0x0][0x154]; // res[63:32]
IADD.X R5, R0, c[0x0][0x144]; // tmp[63:32] = b[63:32] + a[63:32] + carry-in
ST.E.64 [R2], R4; // [res] = tmp[63:0]
EXIT
I try to run a simple program with 3 dimensional grid but for some reason when I launch it with cuda-memcheck it just gets stuck, and after the timeout it's terminated. The problem has nothing to do with a short timeout cause I changed it just for this manner to 60 seconds.
The code I run has a grid of 45x1575x1575 and it runs an empty __global__ function. My compute capability is 2.1 and I compile with the flag -maxrregcount=24 to limit the number of registers the device functions can use (saw in some other program of mine that it gives the best results with the occupancy calculator)
Here's my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void stam(int a){
}
int main()
{
// Choose which GPU to run on, change this on a multi-GPU system.
cudaError_t cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
return;
}
dim3 gridSize(45,1575,1575);
stam<<<gridSize,224>>>(4);
cudaStatus = cudaDeviceSynchronize(); // This function gets stuck
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed!!");
return;
}
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
Isn't the max grid size 65535x65535x65535? What is the problem in here?
Edit: it only crashes when I compile it with the -G flag. Otherwise it's just slow, but it doesn't exceed the 60 seconds.
Your code is simply taking too long (yes, longer than 60 seconds) to run.
Even though your kernel "does nothing" it still represents a __global__ function call. To facilitate it, a fair amount of preamble code gets generated by the compiler. Normally the compiler would optimize much of that preamble code away, since your function does nothing (e.g. it does nothing with the variable passed to it, which the preamble code makes available to each thread.) However when you pass the -G switch, you eliminate nearly all compiler optimizations. You can get a sense of the size of the code that is actually running for each threadblock, by taking your executable and inspecting the code with cuobjdump -sass ....
Secondly, running code with cuda-memcheck usually increases execution time. The cuda-memcheck executive adjusts the order and reduces the rate at which threadblocks get executed, so it can do full analysis of the memory access pattern of each threadblock, among other things.
The net effect is that your empty kernel call, in part due to the very large grid (over 100 million threadblocks need to be processed), is taking longer than 60 seconds to execute. If you want to verify this, increase your TDR timeout to 5 minutes or 10 minutes, and eventually you will see the program return normally.
In my case, with -G and cuda-memcheck your program takes about 30 seconds to run on a Quadro5000 GPU, which has 11 SMs. Your cc2.1 GPU may have around 2 SMs, and so will run even slower than mine. If I compile without the -G switch, the runtime drops to about 2 seconds. If I compile with the -G switch, but run without cuda-memcheck, it takes about 4 seconds. If I eliminate the int a parameter from the kernel (which drastically reduces the preamble code), I can compile with -G and run with cuda-memcheck and it only takes 2 seconds.
Kernel machine code with -G and int a parameter:
Function : _Z4stami
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ ISUB R1, R1, 0x8; /* 0x4800c00020105d03 */
/*0010*/ S2R R0, SR_LMEMHIOFF; /* 0x2c000000dc001c04 */
/*0018*/ ISETP.GE.AND P0, PT, R1, R0, PT; /* 0x1b0e00000011dc23 */
/*0020*/ #P0 BRA 0x30; /* 0x40000000200001e7 */
/*0028*/ BPT.TRAP; /* 0xd00000000000c007 */
/*0030*/ IADD R0, R1, RZ; /* 0x48000000fc101c03 */
/*0038*/ MOV R2, R0; /* 0x2800000000009de4 */
/*0040*/ MOV R3, RZ; /* 0x28000000fc00dde4 */
/*0048*/ MOV R2, R2; /* 0x2800000008009de4 */
/*0050*/ MOV R3, R3; /* 0x280000000c00dde4 */
/*0058*/ MOV R4, c[0x0][0x4]; /* 0x2800400010011de4 */
/*0060*/ MOV R5, RZ; /* 0x28000000fc015de4 */
/*0068*/ IADD R2.CC, R2, R4; /* 0x4801000010209c03 */
/*0070*/ IADD.X R3, R3, R5; /* 0x480000001430dc43 */
/*0078*/ MOV32I R0, 0x20; /* 0x1800000080001de2 */
/*0080*/ LDC R0, c[0x0][R0]; /* 0x1400000000001c86 */
/*0088*/ IADD R2.CC, R2, RZ; /* 0x48010000fc209c03 */
/*0090*/ IADD.X R3, R3, RZ; /* 0x48000000fc30dc43 */
/*0098*/ MOV R2, R2; /* 0x2800000008009de4 */
/*00a0*/ MOV R3, R3; /* 0x280000000c00dde4 */
/*00a8*/ ST.E [R2], R0; /* 0x9400000000201c85 */
/*00b0*/ BRA 0xc8; /* 0x4000000040001de7 */
/*00b8*/ EXIT; /* 0x8000000000001de7 */
/*00c0*/ EXIT; /* 0x8000000000001de7 */
/*00c8*/ EXIT; /* 0x8000000000001de7 */
/*00d0*/ EXIT; /* 0x8000000000001de7 */
.........................
Kernel machine code with -G but without int a parameter:
Function : _Z4stamv
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ BRA 0x20; /* 0x4000000040001de7 */
/*0010*/ EXIT; /* 0x8000000000001de7 */
/*0018*/ EXIT; /* 0x8000000000001de7 */
/*0020*/ EXIT; /* 0x8000000000001de7 */
/*0028*/ EXIT; /* 0x8000000000001de7 */
.........................
I've just run your code with no problems on a C2050 (capability 2.0) under both cuda-memcheck and cuda-gdb. This suggests the problem is more likely related to your card/setup than the code itself.
If you were exceeding capability, you should get a launch error code, not a hang (you can check max sizes using deviceQuery SDK code if you're unsure).
It may be that cuda-memcheck is trying to gain exclusive control of the GPU, and is timing out as something else is using it [e.g. your X server] - does cuda-gdb work any better, do these tools work for other codes?
I was under impression that adding a positive zero to negative zero should produce a positive zero. To quote IEEE 754 2008:
When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0. However, x + x = x − (−x) retains the same sign as x even when x is zero.
However, in case of CUDA, it looks like compiler is being too aggressive in optimizing away addition of a positive zero in Release builds. Plain C/C++ (or C#/.NET) are working as expected. I’ve looked at PTX code produced by the compiler for different builds, and add.f32 instruction is indeed missing in Release build.
Am I missing anything here?
__global__ void convertToPositiveZero(float* dst, int size)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < size)
{
dst[index] += 0;
}
}
// Host code
int size = 100;
float* zzh = (float*)malloc(size * sizeof(float));
zzh[0] = -0.0f;
zzh[1] = 0.0f;
assert(0x80000000 == *((int*)&zzh[0]));
if (0x80000000 != *((int*)&zzh[0]))
{
printf("Expected negative zero.\n");
exit(-1);
}
assert(0x00000000 == *((int*)&zzh[1]));
float* zzd;
cudaMalloc(&zzd, size * sizeof(float));
cudaMemcpy(zzd, zzh, size * sizeof(float), cudaMemcpyHostToDevice);
convertToPositiveZero<<<1, 100>>>(zzd, size);
cudaMemcpy(zzh, zzd, size * sizeof(float), cudaMemcpyDeviceToHost);
//zzh[0] += 0.0f;
assert(0x00000000 == *((int*)&zzh[0]));
if (0x00000000 != *((int*)&zzh[0]))
{
printf("Expected positive zero.\n");
exit(-1);
}
assert(0x00000000 == *((int*)&zzh[1]));
printf("Done.\n");
Your problem seems to be due to the optimizations carried out by nvcc when fusing FADD and FMUL into FMAD operations.
I was able to reproduce your problem under a Release modality. The resulting disassembled code, compiled by CUDA 5.5 and for a sm=2.1, is
code for sm_21
Function : _Z21convertToPositiveZeroPfi
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)
/*0000*/ MOV R1, c[0x1][0x100];
/*0008*/ S2R R0, SR_CTAID.X;
/*0010*/ S2R R2, SR_TID.X;
/*0018*/ IMAD R0, R0, c[0x0][0x8], R2;
/*0020*/ ISETP.GE.AND P0, PT, R0, c[0x0][0x28], PT;
/*0028*/ #P0 BRA.U 0x60;
/*0030*/ #!P0 MOV32I R3, 0x4;
/*0038*/ #!P0 IMAD R2.CC, R0, R3, c[0x0][0x20];
/*0040*/ #!P0 IMAD.HI.X R3, R0, R3, c[0x0][0x24];
/*0048*/ #!P0 LD.E R0, [R2];
/*0050*/ #!P0 F2F.F32.F32 R0, R0;
/*0058*/ #!P0 ST.E [R2], R0;
/*0060*/ EXIT ;
As you also noticed from the PTX file, there is no floating point add operations. Now, if you compile with -fmad=false option, the disassembled code becomes
code for sm_21
Function : _Z21convertToPositiveZeroPfi
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)
/*0000*/ MOV R1, c[0x1][0x100];
/*0008*/ S2R R0, SR_CTAID.X;
/*0010*/ S2R R2, SR_TID.X;
/*0018*/ IMAD R0, R0, c[0x0][0x8], R2;
/*0020*/ ISETP.GE.AND P0, PT, R0, c[0x0][0x28], PT;
/*0028*/ #P0 BRA.U 0x60;
/*0030*/ #!P0 MOV32I R3, 0x4;
/*0038*/ #!P0 IMAD R2.CC, R0, R3, c[0x0][0x20];
/*0040*/ #!P0 IMAD.HI.X R3, R0, R3, c[0x0][0x24];
/*0048*/ #!P0 LD.E R0, [R2];
/*0050*/ #!P0 FADD R0, R0, RZ;
/*0058*/ #!P0 ST.E [R2], R0;
/*0060*/ EXIT ;
As you can see, the presence of a FADD operation is restored and the "correct" sign of 0 is restored as well.
When I declare device functions with __forceinline__, the linker outputs this information:
2> nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2> nvlink : info : used 28 registers, 456 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem
and without it the output is:
2> nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2> nvlink : info : used 23 registers, 216 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem
Why is the size of the stack frame smaller when the __forceinline__ is not used?
How important is to keep the stack frame as small as possible?
Thank you for your help.
The main reason to reduce the stack frame is that the stack is allocated in local memory which resides in off-chip device memory. This makes the access to the stack (if not cached) slow.
To show this, let me make a simple example. Consider the case:
__device__ __noinline__ void func(float* d_a, float* test, int tid) {
d_a[tid]=test[tid]*d_a[tid];
}
__global__ void kernel_function(float* d_a) {
float test[16];
test[threadIdx.x] = threadIdx.x;
func(d_a,test,threadIdx.x);
}
Note that the __device__ function is declared __noinline__. In this case
ptxas : info : Function properties for _Z15kernel_functionPf
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 36 bytes cmem[0]
i.e., we have 64 bytes of stack frame. The corresponding disassembled code is
MOV R1, c[0x1][0x100];
ISUB R1, R1, 0x40;
S2R R6, SR_TID.X; R6 = ThreadIdx.x
MOV R4, c[0x0][0x20];
IADD R5, R1, c[0x0][0x4];
I2F.F32.U32 R2, R6; R2 = R6 (integer to float conversion)
ISCADD R0, R6, R1, 0x2;
STL [R0], R2; stores R2 to test[ThreadIdx.x]
CAL 0x50;
EXIT ; __device__ function part
ISCADD R2, R6, R5, 0x2;
ISCADD R3, R6, R4, 0x2;
LD R2, [R2]; loads d_a[tid]
LD R0, [R3]; loads test[tid]
FMUL R0, R2, R0; d_a[tid] = d_a[tid]*test[tid]
ST [R3], R0; store the new value of d_a[tid] to global memory
RET ;
As you can see, test is stored and loaded from global memory, forming the stack frame (it is 16 floats = 64 bytes).
Now change the device function to
__device__ __forceinline__ void func(float* d_a, float* test, int tid) {
d_a[tid]=test[tid]*d_a[tid];
}
that is, change the __device__ function from __noinline__ to __forceinline__. In this case, we have
ptxas : info : Compiling entry function '_Z15kernel_functionPf' for 'sm_20'
ptxas : info : Function properties for _Z15kernel_functionPf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
i.e., we have an empty stack frame now. Indeed, the disassembled code becomes:
MOV R1, c[0x1][0x100];
S2R R2, SR_TID.X; R2 = ThreadIdx.x
ISCADD R3, R2, c[0x0][0x20], 0x2;
I2F.F32.U32 R2, R2; R2 = R2 (integer to float conversion)
LD R0, [R3]; R2 = d_a[ThreadIdx.x] (load from global memory)
FMUL R0, R2, R0; d_a[ThreadIdx.x] = d_a[ThreadIdx.x] * ThreadIdx.x
ST [R3], R0; stores the new value of d_a[ThreadIdx.x] to global memory
EXIT ;
As you can see, forcing the inlining enables the compiler to perform proper optimizations so that now test is fully discarded from the code.
In the above example, __forceinline__ has an effect that is opposite to what you are experiencing, which also shows that, without any further information, the first question cannot be answered.
Edit: this question is a re-done version of the original, so the first several responses may no longer be relevant.
I'm curious about what impact a device function call with forced no-inlining has on synchronization within a device function. I have a simple test kernel that illustrates the behavior in question.
The kernel takes a buffer and passes it to a device function, along with a shared buffer and an indicator variable which identifies a single thread as the "boss" thread. The device function has divergent code: the boss thread first spends time doing trivial operations on the shared buffer, then writes to the global buffer. After a synchronization call, all threads write to the global buffer. After the kernel call, the host prints the contents of the global buffer. Here is the code:
CUDA CODE:
test_main.cu
#include<cutil_inline.h>
#include "test_kernel.cu"
int main()
{
int scratchBufferLength = 100;
int *scratchBuffer;
int *d_scratchBuffer;
int b = 1;
int t = 64;
// copy scratch buffer to device
scratchBuffer = (int *)calloc(scratchBufferLength,sizeof(int));
cutilSafeCall( cudaMalloc(&d_scratchBuffer,
sizeof(int) * scratchBufferLength) );
cutilSafeCall( cudaMemcpy(d_scratchBuffer, scratchBuffer,
sizeof(int)*scratchBufferLength, cudaMemcpyHostToDevice) );
// kernel call
testKernel<<<b, t>>>(d_scratchBuffer);
cudaThreadSynchronize();
// copy data back to host
cutilSafeCall( cudaMemcpy(scratchBuffer, d_scratchBuffer,
sizeof(int) * scratchBufferLength, cudaMemcpyDeviceToHost) );
// print results
printf("Scratch buffer contents: \t");
for(int i=0; i < scratchBufferLength; ++i)
{
if(i % 25 == 0)
printf("\n");
printf("%d ", scratchBuffer[i]);
}
printf("\n");
//cleanup
cudaFree(d_scratchBuffer);
free(scratchBuffer);
return 0;
}
test_kernel.cu
#ifndef __TEST_KERNEL_CU
#define __TEST_KERNEL_CU
#define IS_BOSS() (threadIdx.x == blockDim.x - 1)
__device__
__noinline__
void testFunc(int *sA, int *scratchBuffer, bool isBoss) {
if(isBoss) { // produces unexpected output-- "broken" code
//if(IS_BOSS()) { // produces expected output-- "working" code
for (int c = 0; c < 10000; c++) {
sA[0] = 1;
}
}
if(isBoss) {
scratchBuffer[0] = 1;
}
__syncthreads();
scratchBuffer[threadIdx.x ] = threadIdx.x;
return;
}
__global__
void testKernel(int *scratchBuffer)
{
__shared__ int sA[4];
bool isBoss = IS_BOSS();
testFunc(sA, scratchBuffer, isBoss);
return;
}
#endif
I compiled this code from within the CUDA SDK to take advantage of the "cutilsafecall()" functions in test_main.cu, but of course these could be taken out if you'd like to compile outside the SDK. I compiled with CUDA Driver/Toolkit version 4.0, compute capability 2.0, and the code was run on a GeForce GTX 480, which has the Fermi architecture.
The expected output is
0 1 2 3 ... blockDim.x-1
However, the output I get is
1 1 2 3 ... blockDim.x-1
This seems to indicate that the boss thread executed the conditional "scratchBuffer[0] = 1;" statement AFTER all threads execute the "scratchBuffer[threadIdx.x] = threadIdx.x;" statement, even though they are separated by a __syncthreads() barrier.
This occurs even if the boss thread is instructed to write a sentinel value into the buffer position of a thread in its same warp; the sentinel is the final value present in the buffer, rather than the appropriate threadIdx.x .
One modification that causes the code to produce expected output is to change the conditional statement
if(isBoss) {
to
if(IS_BOSS()) {
; i.e., to change the divergence-controlling variable from being stored in a parameter register to being computed in a macro function. (Note the comments on the appropriate lines in the source code.) It's this particular change I've been focusing on to try and track down the problem. In looking at the disassembled .cubins of the kernel with the 'isBoss' conditional (i.e., broken code) and the 'IS_BOSS()' conditional (i.e., working code), the most conspicuous difference in the instructions seems to be the absence of an SSY instruction in the disassembled broken code.
Here are the disassembled kernels generated by disassembling the .cubin files with
"cuobjdump -sass test_kernel.cubin" . everything up to the first 'EXIT' is the kernel, and everything after that is the device function. The only differences are in the device function.
DISASSEMBLED OBJECT CODE:
"broken" code
code for sm_20
Function : _Z10testKernelPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20009de428004000*/ MOV R2, c [0x0] [0x8];
/*0010*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*0018*/ /*0xfc015de428000000*/ MOV R5, RZ;
/*0020*/ /*0x00011de428004000*/ MOV R4, c [0x0] [0x0];
/*0028*/ /*0xfc209c034800ffff*/ IADD R2, R2, 0xfffff;
/*0030*/ /*0x9001dde428004000*/ MOV R7, c [0x0] [0x24];
/*0038*/ /*0x80019de428004000*/ MOV R6, c [0x0] [0x20];
/*0040*/ /*0x08001c03110e0000*/ ISET.EQ.U32.AND R0, R0, R2, pt;
/*0048*/ /*0x01221f841c000000*/ I2I.S32.S32 R8, -R0;
/*0050*/ /*0x2001000750000000*/ CAL 0x60;
/*0058*/ /*0x00001de780000000*/ EXIT;
/*0060*/ /*0x20201e841c000000*/ I2I.S32.S8 R0, R8;
/*0068*/ /*0xfc01dc231a8e0000*/ ISETP.NE.AND P0, pt, R0, RZ, pt;
/*0070*/ /*0xc00021e740000000*/ #!P0 BRA 0xa8;
/*0078*/ /*0xfc001de428000000*/ MOV R0, RZ;
/*0080*/ /*0x04001c034800c000*/ IADD R0, R0, 0x1;
/*0088*/ /*0x04009de218000000*/ MOV32I R2, 0x1;
/*0090*/ /*0x4003dc231a8ec09c*/ ISETP.NE.AND P1, pt, R0, 0x2710, pt;
/*0098*/ /*0x00409c8594000000*/ ST.E [R4], R2;
/*00a0*/ /*0x600005e74003ffff*/ #P1 BRA 0x80;
/*00a8*/ /*0x040001e218000000*/ #P0 MOV32I R0, 0x1;
/*00b0*/ /*0x0060008594000000*/ #P0 ST.E [R6], R0;
/*00b8*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
/*00c0*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*00c8*/ /*0x10011c03200dc000*/ IMAD.U32.U32 R4.CC, R0, 0x4, R6;
/*00d0*/ /*0x10009c435000c000*/ IMUL.U32.U32.HI R2, R0, 0x4;
/*00d8*/ /*0x08715c4348000000*/ IADD.X R5, R7, R2;
/*00e0*/ /*0x00401c8594000000*/ ST.E [R4], R0;
/*00e8*/ /*0x00001de790000000*/ RET;
.................................
"working" code
code for sm_20
Function : _Z10testKernelPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20009de428004000*/ MOV R2, c [0x0] [0x8];
/*0010*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*0018*/ /*0xfc015de428000000*/ MOV R5, RZ;
/*0020*/ /*0x00011de428004000*/ MOV R4, c [0x0] [0x0];
/*0028*/ /*0xfc209c034800ffff*/ IADD R2, R2, 0xfffff;
/*0030*/ /*0x9001dde428004000*/ MOV R7, c [0x0] [0x24];
/*0038*/ /*0x80019de428004000*/ MOV R6, c [0x0] [0x20];
/*0040*/ /*0x08001c03110e0000*/ ISET.EQ.U32.AND R0, R0, R2, pt;
/*0048*/ /*0x01221f841c000000*/ I2I.S32.S32 R8, -R0;
/*0050*/ /*0x2001000750000000*/ CAL 0x60;
/*0058*/ /*0x00001de780000000*/ EXIT;
/*0060*/ /*0x20009de428004000*/ MOV R2, c [0x0] [0x8];
/*0068*/ /*0x8400dc042c000000*/ S2R R3, SR_Tid_X;
/*0070*/ /*0x20201e841c000000*/ I2I.S32.S8 R0, R8;
/*0078*/ /*0x4000000760000001*/ SSY 0xd0;
/*0080*/ /*0xfc209c034800ffff*/ IADD R2, R2, 0xfffff;
/*0088*/ /*0x0831dc031a8e0000*/ ISETP.NE.U32.AND P0, pt, R3, R2, pt;
/*0090*/ /*0xc00001e740000000*/ #P0 BRA 0xc8;
/*0098*/ /*0xfc009de428000000*/ MOV R2, RZ;
/*00a0*/ /*0x04209c034800c000*/ IADD R2, R2, 0x1;
/*00a8*/ /*0x04021de218000000*/ MOV32I R8, 0x1;
/*00b0*/ /*0x4021dc231a8ec09c*/ ISETP.NE.AND P0, pt, R2, 0x2710, pt;
/*00b8*/ /*0x00421c8594000000*/ ST.E [R4], R8;
/*00c0*/ /*0x600001e74003ffff*/ #P0 BRA 0xa0;
/*00c8*/ /*0xfc01dc33190e0000*/ ISETP.EQ.AND.S P0, pt, R0, RZ, pt;
/*00d0*/ /*0x040021e218000000*/ #!P0 MOV32I R0, 0x1;
/*00d8*/ /*0x0060208594000000*/ #!P0 ST.E [R6], R0;
/*00e0*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
/*00e8*/ /*0x10311c03200dc000*/ IMAD.U32.U32 R4.CC, R3, 0x4, R6;
/*00f0*/ /*0x10309c435000c000*/ IMUL.U32.U32.HI R2, R3, 0x4;
/*00f8*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*0100*/ /*0x08715c4348000000*/ IADD.X R5, R7, R2;
/*0108*/ /*0x00401c8594000000*/ ST.E [R4], R0;
/*0110*/ /*0x00001de790000000*/ RET;
.................................
The "SSY" instruction is present in the working code but not the broken code. The cuobjdump manual describes the instruction with, "Set synchronization point; used before potentially divergent instructions." This makes me think that for some reason the compiler does not recognize the possibility of divergence in the broken code.
I also found that if I comment out the __noinline__ directive, then the code produces the expected output, and indeed the assembly produced by the otherwise "broken" and "working" versions is exactly identical. So, this makes me think that when a variable is passed via the call stack, that variable cannot be used to control divergence and a subsequent synchronization call; the compiler does not seem to recognize the possibility of divergence in that case, and therefore doesn't insert an "SSY" instruction. Does anyone know if this is indeed a legitimate limitation of CUDA, and if so, if this is documented anywhere?
Thanks in advance.
This appears to have simply been a compiler bug fixed in CUDA 4.1/4.2. Does not reproduce for the asker on CUDA 4.2.