Related
I cannot find a document that explains the the following instruction format in CUDA
FMAD R6, -R6, c [0x1] [0x1], R5;
What is the format (source, destination, ...) and what is that -R6?
The PTX reference guide describes fma as follows
fma.rnd{.ftz}{.sat}.f32 d, a, b, c;
fma.rnd.f64 d, a, b, c;
performs
d = a*b + c;
in either single or double precision.
You are looking at disassembled SASS, the instruction set references for that show FMAD as being the (non IEEE 754 compliant) single precision form from the GT200 instruction set. That is a little bit problematic, because I don't presently have a toolchain which supports that deprecated instruction set. However, if I use the Fermi instruction set instead and compile this kernel:
__global__ void kernel(const float *x, const float *y, float *a)
{
float xval = x[threadIdx.x];
float yval = y[threadIdx.x];
float aval = -xval * xval + yval;
a[threadIdx.x] = aval;:
}
I get this SASS:
code for sm_20
Function : _Z6kernelPKfS0_Pf
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R3, SR_TID.X; /* 0x2c0000008400dc04 */
/*0010*/ MOV32I R5, 0x4; /* 0x1800000010015de2 */
/*0018*/ IMAD.U32.U32 R8.CC, R3, R5, c[0x0][0x20]; /* 0x200b800080321c03 */
/*0020*/ IMAD.U32.U32.HI.X R9, R3, R5, c[0x0][0x24]; /* 0x208a800090325c43 */
/*0028*/ IMAD.U32.U32 R6.CC, R3, R5, c[0x0][0x28]; /* 0x200b8000a0319c03 */
/*0030*/ LD.E R0, [R8]; /* 0x8400000000801c85 */
/*0038*/ IMAD.U32.U32.HI.X R7, R3, R5, c[0x0][0x2c]; /* 0x208a8000b031dc43 */
/*0040*/ IMAD.U32.U32 R4.CC, R3, R5, c[0x0][0x30]; /* 0x200b8000c0311c03 */
/*0048*/ LD.E R2, [R6]; /* 0x8400000000609c85 */
/*0050*/ IMAD.U32.U32.HI.X R5, R3, R5, c[0x0][0x34]; /* 0x208a8000d0315c43 */
/*0058*/ FFMA.FTZ R0, -R0, R0, R2; /* 0x3004000000001e40 */
/*0060*/ ST.E [R4], R0; /* 0x9400000000401c85 */
/*0068*/ EXIT; /* 0x8000000000001de7 */
..................................
Note that I also have the negated register in the FFMA.FTZ arguments. So I would guess that your:
FMAD R6, -R6, c [0x1] [0x1], R5;
is the equivalent of
R6 = -R6 * const + R5
where c [0x1] [0x1] is a compile time constant, and that the GPU has some sort of instruction modifier which it can set to control negation of a floating point value as part of a floating point operation without explicitly twiddling the sign bit of the register before the call.
(I look forward to #njuffa tearing this answer to shreds).
I have the following kernel performing a simple assignment of a global memory matrix in to a global memory matrix out:
__global__ void simple_copy(float *outdata, const float *indata){
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
outdata[y*width + x] = indata[y*width + x];
}
I'm inspecting the disassembled microcode dumped by cuobjdump:
Function : _Z11simple_copyPfPKf
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x80001de218000000*/ MOV32I R0, 0x20; R0 = TILE_DIM
/*0010*/ /*0x00001c8614000000*/ LDC R0, c [0x0] [R0]; R0 = c
/*0018*/ /*0x90009de218000000*/ MOV32I R2, 0x24; R2 = 36
/*0020*/ /*0x00209c8614000000*/ LDC R2, c [0x0] [R2]; R2 = c
int x = blockIdx.x * TILE_DIM + threadIdx.x;
/*0028*/ /*0x9400dc042c000000*/ S2R R3, SR_CTAid_X; R3 = BlockIdx.x
/*0030*/ /*0x0c00dde428000000*/ MOV R3, R3; R3 = R3 ???
/*0038*/ /*0x84011c042c000000*/ S2R R4, SR_Tid_X; R3 = ThreadIdx.x
/*0040*/ /*0x10011de428000000*/ MOV R4, R4; R4 = R4 ???
/*0048*/ /*0x8030dca32008c000*/ IMAD R3, R3, 0x20, R4; R3 = R3 * TILE_DIM + R4 (contains x)
int y = blockIdx.y * TILE_DIM + threadIdx.y;
/*0050*/ /*0x98011c042c000000*/ S2R R4, SR_CTAid_Y;
/*0058*/ /*0x10011de428000000*/ MOV R4, R4;
/*0060*/ /*0x88015c042c000000*/ S2R R5, SR_Tid_Y;
/*0068*/ /*0x14015de428000000*/ MOV R5, R5;
/*0070*/ /*0x80411ca3200ac000*/ IMAD R4, R4, 0x20, R5; R4 ... (contains y)
int width = gridDim.x * TILE_DIM;
/*0078*/ /*0x50015de428004000*/ MOV R5, c [0x0] [0x14]; R5 = c
/*0080*/ /*0x80515ca35000c000*/ IMUL R5, R5, 0x20; R5 = R5 * TILE_DIM (contains width)
y*width + x
/*0088*/ /*0x14419ca320060000*/ IMAD R6, R4, R5, R3; R6 = R4 * R5 + R3 (contains y*width+x)
Loads indata[y*width + x]
/*0090*/ /*0x08619c036000c000*/ SHL R6, R6, 0x2;
/*0098*/ /*0x18209c0348000000*/ IADD R2, R2, R6;
/*00a0*/ /*0x08009de428000000*/ MOV R2, R2; R2 = R2 ???
/*00a8*/ /*0x00209c8580000000*/ LD R2, [R2]; Load from memory - R2 =
Stores outdata[y*width + x]
/*00b0*/ /*0x1440dca320060000*/ IMAD R3, R4, R5, R3;
/*00b8*/ /*0x0830dc036000c000*/ SHL R3, R3, 0x2;
/*00c0*/ /*0x0c001c0348000000*/ IADD R0, R0, R3; R0 = R0 + R3
/*00c8*/ /*0x00001de428000000*/ MOV R0, R0; R0 = R0 ???
/*00d0*/ /*0x00009c8590000000*/ ST [R0], R2; Store to memory
/*00d8*/ /*0x40001de740000000*/ BRA 0xf0;
/*00e0*/ /*0x00001de780000000*/ EXIT;
/*00e8*/ /*0x00001de780000000*/ EXIT;
/*00f0*/ /*0x00001de780000000*/ EXIT;
/*00f8*/ /*0x00001de780000000*/ EXIT;
The comments on top or aside of the disassembled code are my own.
As you can see, there are some apparently useless operations, marked by ??? in the comments. Essentially, they are moves of registers into themselves.
I have then the two following questions:
If they are useless, I believe that they are uselessly consuming computation time. Can I optimize the disassembled microcode by removing them?
PTX files can be inlined in CUDA codes. However, PTX is just an intermediate language needed for portability across GPUs. Can I somehow "inline" an optimized disassembled microcode?
Thank you very much in advance.
EDIT: THE SAME CODE COMPILED IN RELEASE MODE FOR SM = 2.0
Function : _Z11simple_copyPfPKf
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_CTAID.Y; /* 0x2c00000098001c04 */
/*0010*/ S2R R2, SR_TID.Y; /* 0x2c00000088009c04 */
/*0018*/ S2R R3, SR_CTAID.X; /* 0x2c0000009400dc04 */
/*0020*/ S2R R4, SR_TID.X; /* 0x2c00000084011c04 */
/*0028*/ MOV R5, c[0x0][0x14]; /* 0x2800400050015de4 */
/*0030*/ ISCADD R2, R0, R2, 0x5; /* 0x4000000008009ca3 */
/*0038*/ ISCADD R3, R3, R4, 0x5; /* 0x400000001030dca3 */
/*0040*/ SHL R0, R5, 0x5; /* 0x6000c00014501c03 */
/*0048*/ IMAD R2, R0, R2, R3; /* 0x2006000008009ca3 */
/*0050*/ ISCADD R0, R2, c[0x0][0x24], 0x2; /* 0x4000400090201c43 */
/*0058*/ ISCADD R2, R2, c[0x0][0x20], 0x2; /* 0x4000400080209c43 */
/*0060*/ LD R0, [R0]; /* 0x8000000000001c85 */
/*0068*/ ST [R2], R0; /* 0x9000000000201c85 */
/*0070*/ EXIT ; /* 0x8000000000001de7 */
EDIT: THE SAME CODE COMPILED IN RELEASE MODE FOR SM = 2.1
Function : _Z11simple_copyPfPKf
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ NOP; /* 0x4000000000001de4 */
/*0010*/ MOV R0, c[0x0][0x14]; /* 0x2800400050001de4 */
/*0018*/ S2R R2, SR_CTAID.Y; /* 0x2c00000098009c04 */
/*0020*/ SHL R0, R0, 0x5; /* 0x6000c00014001c03 */
/*0028*/ S2R R3, SR_TID.Y; /* 0x2c0000008800dc04 */
/*0030*/ ISCADD R3, R2, R3, 0x5; /* 0x400000000c20dca3 */
/*0038*/ S2R R4, SR_CTAID.X; /* 0x2c00000094011c04 */
/*0040*/ S2R R5, SR_TID.X; /* 0x2c00000084015c04 */
/*0048*/ ISCADD R2, R4, R5, 0x5; /* 0x4000000014409ca3 */
/*0050*/ IMAD R2, R0, R3, R2; /* 0x200400000c009ca3 */
/*0058*/ ISCADD R0, R2, c[0x0][0x24], 0x2; /* 0x4000400090201c43 */
/*0060*/ ISCADD R2, R2, c[0x0][0x20], 0x2; /* 0x4000400080209c43 */
/*0068*/ LD R0, [R0]; /* 0x8000000000001c85 */
/*0070*/ ST [R2], R0; /* 0x9000000000201c85 */
/*0078*/ EXIT ; /* 0x8000000000001de7 */
The answer to both questions is no.
If you try to delete instructions from the final binary payload. you will change the length of code sections and break the ELF and fatbinary files. To fix that would require hand crafting headers whose formats are not readily documented, which sounds like a lot of work just to optimize out a couple of instructions.
And inline native assembler is not supported, but I am sure you knew that already.
And finally, I can't reproduce using CUDA 5.0:
Fatbin elf code:
================
arch = sm_20
code version = [1,6]
producer = cuda
host = mac
compile_size = 32bit
identifier = pumpkinhead.cu
code for sm_20
Function : _Z11simple_copyPfPKf
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x98001c042c000000*/ S2R R0, SR_CTAid_Y;
/*0010*/ /*0x88009c042c000000*/ S2R R2, SR_Tid_Y;
/*0018*/ /*0x9400dc042c000000*/ S2R R3, SR_CTAid_X;
/*0020*/ /*0x84011c042c000000*/ S2R R4, SR_Tid_X;
/*0028*/ /*0x08001ca340000000*/ ISCADD R0, R0, R2, 0x5;
/*0030*/ /*0x10309ca340000000*/ ISCADD R2, R3, R4, 0x5;
/*0038*/ /*0x50001ca350004000*/ IMUL R0, R0, c [0x0] [0x14];
/*0040*/ /*0x08009ca340000000*/ ISCADD R2, R0, R2, 0x5;
/*0048*/ /*0x90201c4340004000*/ ISCADD R0, R2, c [0x0] [0x24], 0x2;
/*0050*/ /*0x80209c4340004000*/ ISCADD R2, R2, c [0x0] [0x20], 0x2;
/*0058*/ /*0x00001c8580000000*/ LD R0, [R0];
/*0060*/ /*0x00201c8590000000*/ ST [R2], R0;
/*0068*/ /*0x00001de780000000*/ EXIT;
.....................................
Are you sure the code you have shown was compiled with release settings?
I'm trying to understand how to use __threadfence(), as it seems like a powerful synchronization primitive that lets different blocks work together without going through the huge hassle of ending a kernel and starting a new one. The CUDA C Programming guide has an example of it (Appendix B.5), which is fleshed out in the "threadFenceReduction" sample in the SDK, so it seems like something we "should" be using.
However, when I have tried using __threadfence(), it is shockingly slow. See the code below for an example. From what I understand, __threadfence() should just make sure that all pending memory transfers from the current thread block are finished, before proceeding. Memory latency is somewhat better than a microsecond, I believe, so the total time to deal with the 64KB of memory transfers in the included code, on a GTX680, should be somewhere around a microsecond. Instead, the __threadfence() instruction seems to take around 20 microseconds! Instead of using __threadfence() to synchronize, I can instead end the kernel, and launch an entirely new kernel (in the same, default, stream so that it is synchronized), in less then a third of the time!
What is going on here? Does my code have a bug in it that I'm not noticing? Or is __threadfence() really 20x slower than it should be, and 6x slower than an entire kernel launch+cleanup?
Time for 1000 runs of the threadfence kernel: 27.716831 ms
Answer: 120
Time for 1000 runs of just the first 3 lines, including threadfence: 25.962912 ms
Synchronizing without threadfence, by splitting to two kernels: 7.653344 ms
Answer: 120
#include "cuda.h"
#include <cstdio>
__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__device__ int scratch[16];
__device__ int junk[16000];
__device__ int answer;
__global__ void usethreadfence() //just like the code example in B.5 of the CUDA C Programming Guide
{
if (threadIdx.x==0) scratch[blockIdx.x]=blockIdx.x;
junk[threadIdx.x+blockIdx.x*1000]=17+threadIdx.x; //do some more memory writes to make the kernel nontrivial
__threadfence();
if (threadIdx.x==0) {
unsigned int value = atomicInc(&count, gridDim.x);
isLastBlockDone = (value == (gridDim.x - 1));
}
__syncthreads();
if (isLastBlockDone && threadIdx.x==0) {
// The last block sums the results stored in scratch[0 .. gridDim.x-1]
int sum=0;
for (int i=0;i<gridDim.x;i++) sum+=scratch[i];
answer=sum;
}
}
__global__ void justthreadfence() //first three lines of the previous kernel, so we can compare speeds
{
if (threadIdx.x==0) scratch[blockIdx.x]=blockIdx.x;
junk[threadIdx.x+blockIdx.x*1000]=17+threadIdx.x;
__threadfence();
}
__global__ void usetwokernels_1() //this and the next kernel reproduce the functionality of the first kernel, but faster!
{
if (threadIdx.x==0) scratch[blockIdx.x]=blockIdx.x;
junk[threadIdx.x+blockIdx.x*1000]=17+threadIdx.x;
}
__global__ void usetwokernels_2()
{
if (threadIdx.x==0) {
int sum=0;
for (int i=0;i<gridDim.x;i++) sum+=scratch[i];
answer=sum;
}
}
int main() {
int sum;
cudaEvent_t start, stop; float time; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0);
for (int i=0;i<1000;i++) usethreadfence<<<16,1000>>>();
cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf ("Time for 1000 runs of the threadfence kernel: %f ms\n", time); cudaEventDestroy(start); cudaEventDestroy(stop);
cudaMemcpyFromSymbol(&sum,answer,sizeof(int)); printf("Answer: %d\n",sum);
cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0);
for (int i=0;i<1000;i++) justthreadfence<<<16,1000>>>();
cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf ("Time for 1000 runs of just the first 3 lines, including threadfence: %f ms\n", time); cudaEventDestroy(start); cudaEventDestroy(stop);
cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0);
for (int i=0;i<1000;i++) {usetwokernels_1<<<16,1000>>>(); usetwokernels_2<<<16,1000>>>();}
cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf ("Synchronizing without threadfence, by splitting to two kernels: %f ms\n", time); cudaEventDestroy(start); cudaEventDestroy(stop);
cudaMemcpyFromSymbol(&sum,answer,sizeof(int)); printf("Answer: %d\n",sum);
}
I have tested your code, compiled with CUDA 6.0, on two different cards: GT540M (Fermi) and Kepler K20c (Kepler) and these are the results
GT540M
Time for 1000 runs of the threadfence kernel: 303.373688 ms
Answer: 120
Time for 1000 runs of just the first 3 lines, including threadfence: 300.395416 ms
Synchronizing without threadfence, by splitting to two kernels: 597.729919 ms
Answer: 120
Kepler K20c
Time for 1000 runs of the threadfence kernel: 10.164096 ms
Answer: 120
Time for 1000 runs of just the first 3 lines, including threadfence: 8.808896 ms
Synchronizing without threadfence, by splitting to two kernels: 17.330784 ms
Answer: 120
I do not observe any particularly slow behavior of __threadfence() against the other two considered cases.
This can be justified by resorting to the disassembled codes.
usethreadfence()
c[0xe][0x0] = scratch
c[0xe][0x4] = junk
c[0xe][0xc] = count
c[0x0][0x14] = gridDim.x
/*0000*/ MOV R1, c[0x1][0x100];
/*0008*/ S2R R0, SR_TID.X; R0 = threadIdx.x
/*0010*/ ISETP.NE.AND P0, PT, R0, RZ, PT; P0 = (R0 != 0)
/*0018*/ S2R R5, SR_CTAID.X; R5 = blockIdx.x
/*0020*/ IMAD R3, R5, 0x3e8, R0; R3 = R5 * 1000 + R0 = threadIdx.x + blockIdx.x * 1000
if (threadIdx.x == 0)
/*0028*/ #!P0 ISCADD R2, R5, c[0xe][0x0], 0x2; R2 = scratch + threadIdx.x
/*0030*/ IADD R4, R0, 0x11; R4 = R0 + 17 = threadIdx.x + 17
/*0038*/ ISCADD R3, R3, c[0xe][0x4], 0x2; R3 = junk + threadIdx.x + blockIdx.x * 1000
/*0040*/ #!P0 ST [R2], R5; scratch[threadIdx.x] = blockIdx.x
/*0048*/ ST [R3], R4; junk[threadIdx.x + blockIdx.x * 1000] = threadIdx.x + 17
/*0050*/ MEMBAR.GL; __threadfence
/*0058*/ #P0 BRA.U 0x98; if (threadIdx.x != 0) branch to 0x98
if (threadIdx.x == 0)
/*0060*/ #!P0 MOV R2, c[0xe][0xc]; R2 = &count
/*0068*/ #!P0 MOV R3, c[0x0][0x14]; R3 = gridDim.x
/*0070*/ #!P0 ATOM.INC R2, [R2], R3; R2 = value = count + 1; *(&count) ++
/*0078*/ #!P0 IADD R3, R3, -0x1; R3 = R3 - 1 = gridDim.x - 1
/*0080*/ #!P0 ISETP.EQ.AND P1, PT, R2, R3, PT; P1 = (R2 == R3) = 8 value == (gridDim.x - 1))
/*0088*/ #!P0 SEL R2, RZ, 0x1, !P1; if (!P1) R2 = RZ otherwise R2 = 1 (R2 = isLastBlockDone)
/*0090*/ #!P0 STS.U8 [RZ], R2; Stores R2 (i.e., isLastBlockDone) to shared memory to [0]
/*0098*/ ISETP.EQ.AND P0, PT, R0, RZ, PT; P0 = (R0 == 0) = (threadIdx.x == 0)
/*00a0*/ BAR.RED.POPC RZ, RZ, RZ, PT; __syncthreads()
/*00a8*/ LDS.U8 R0, [RZ]; R0 = R2 = isLastBlockDone
/*00b0*/ ISETP.NE.AND P0, PT, R0, RZ, P0; P0 = (R0 == 0)
/*00b8*/ #!P0 EXIT; if (isLastBlockDone != 0) exits
/*00c0*/ ISETP.NE.AND P0, PT, RZ, c[0x0][0x14], PT; IMPLEMENTING THE FOR LOOP WITH A LOOP UNROLL OF 4
/*00c8*/ MOV R0, RZ;
/*00d0*/ #!P0 BRA 0x1b8;
/*00d8*/ MOV R2, c[0x0][0x14];
/*00e0*/ ISETP.GT.AND P0, PT, R2, 0x3, PT;
/*00e8*/ MOV R2, RZ;
/*00f0*/ #!P0 BRA 0x170;
/*00f8*/ MOV R3, c[0x0][0x14];
/*0100*/ IADD R7, R3, -0x3;
/*0108*/ NOP;
/*0110*/ ISCADD R3, R2, c[0xe][0x0], 0x2;
/*0118*/ IADD R2, R2, 0x4;
/*0120*/ LD R4, [R3];
/*0128*/ ISETP.LT.U32.AND P0, PT, R2, R7, PT;
/*0130*/ LD R5, [R3+0x4];
/*0138*/ LD R6, [R3+0x8];
/*0140*/ LD R3, [R3+0xc];
/*0148*/ IADD R0, R4, R0;
/*0150*/ IADD R0, R5, R0;
/*0158*/ IADD R0, R6, R0;
/*0160*/ IADD R0, R3, R0;
/*0168*/ #P0 BRA 0x110;
/*0170*/ ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT;
/*0178*/ #!P0 BRA 0x1b8;
/*0180*/ ISCADD R3, R2, c[0xe][0x0], 0x2;
/*0188*/ IADD R2, R2, 0x1;
/*0190*/ LD R3, [R3];
/*0198*/ ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT;
/*01a0*/ NOP;
/*01a8*/ IADD R0, R3, R0;
/*01b0*/ #P0 BRA 0x180;
/*01b8*/ MOV R2, c[0xe][0x8];
/*01c0*/ ST [R2], R0;
/*01c8*/ EXIT;
justthreadfence()
Function : _Z15justthreadfencev
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R3, SR_TID.X; /* 0x2c0000008400dc04 */
/*0010*/ ISETP.NE.AND P0, PT, R3, RZ, PT; /* 0x1a8e0000fc31dc23 */
/*0018*/ S2R R4, SR_CTAID.X; /* 0x2c00000094011c04 */
/*0020*/ IMAD R2, R4, 0x3e8, R3; /* 0x2006c00fa0409ca3 */
/*0028*/ #!P0 ISCADD R0, R4, c[0xe][0x0], 0x2; /* 0x4000780000402043 */
/*0030*/ IADD R3, R3, 0x11; /* 0x4800c0004430dc03 */
/*0038*/ ISCADD R2, R2, c[0xe][0x4], 0x2; /* 0x4000780010209c43 */
/*0040*/ #!P0 ST [R0], R4; /* 0x9000000000012085 */
/*0048*/ ST [R2], R3; /* 0x900000000020dc85 */
/*0050*/ MEMBAR.GL; /* 0xe000000000001c25 */
/*0058*/ EXIT; /* 0x8000000000001de7 */
usetwokernels_1()
Function : _Z15usetwokernels_1v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_TID.X; /* 0x2c00000084001c04 */
/*0010*/ ISETP.NE.AND P0, PT, R0, RZ, PT; /* 0x1a8e0000fc01dc23 */
/*0018*/ S2R R2, SR_CTAID.X; /* 0x2c00000094009c04 */
/*0020*/ IMAD R4, R2, 0x3e8, R0; /* 0x2000c00fa0211ca3 */
/*0028*/ #!P0 ISCADD R3, R2, c[0xe][0x0], 0x2; /* 0x400078000020e043 */
/*0030*/ IADD R0, R0, 0x11; /* 0x4800c00044001c03 */
/*0038*/ ISCADD R4, R4, c[0xe][0x4], 0x2; /* 0x4000780010411c43 */
/*0040*/ #!P0 ST [R3], R2; /* 0x900000000030a085 */
/*0048*/ ST [R4], R0; /* 0x9000000000401c85 */
/*0050*/ EXIT; /* 0x8000000000001de7 */
.....................................
usetwokernels_1()
Function : _Z15usetwokernels_2v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_TID.X; /* 0x2c00000084001c04 */
/*0010*/ ISETP.NE.AND P0, PT, R0, RZ, PT; /* 0x1a8e0000fc01dc23 */
/*0018*/ #P0 EXIT; /* 0x80000000000001e7 */
/*0020*/ ISETP.NE.AND P0, PT, RZ, c[0x0][0x14], PT; /* 0x1a8e400053f1dc23 */
/*0028*/ MOV R0, RZ; /* 0x28000000fc001de4 */
/*0030*/ #!P0 BRA 0x130; /* 0x40000003e00021e7 */
/*0038*/ MOV R2, c[0x0][0x14]; /* 0x2800400050009de4 */
/*0040*/ ISETP.GT.AND P0, PT, R2, 0x3, PT; /* 0x1a0ec0000c21dc23 */
/*0048*/ MOV R2, RZ; /* 0x28000000fc009de4 */
/*0050*/ #!P0 BRA 0xe0; /* 0x40000002200021e7 */
/*0058*/ MOV R3, c[0x0][0x14]; /* 0x280040005000dde4 */
/*0060*/ IADD R7, R3, -0x3; /* 0x4800fffff431dc03 */
/*0068*/ NOP; /* 0x4000000000001de4 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ NOP; /* 0x4000000000001de4 */
/*0080*/ ISCADD R3, R2, c[0xe][0x0], 0x2; /* 0x400078000020dc43 */
/*0088*/ LD R4, [R3]; /* 0x8000000000311c85 */
/*0090*/ IADD R2, R2, 0x4; /* 0x4800c00010209c03 */
/*0098*/ LD R5, [R3+0x4]; /* 0x8000000010315c85 */
/*00a0*/ ISETP.LT.U32.AND P0, PT, R2, R7, PT; /* 0x188e00001c21dc03 */
/*00a8*/ LD R6, [R3+0x8]; /* 0x8000000020319c85 */
/*00b0*/ LD R3, [R3+0xc]; /* 0x800000003030dc85 */
/*00b8*/ IADD R0, R4, R0; /* 0x4800000000401c03 */
/*00c0*/ IADD R0, R5, R0; /* 0x4800000000501c03 */
/*00c8*/ IADD R0, R6, R0; /* 0x4800000000601c03 */
/*00d0*/ IADD R0, R3, R0; /* 0x4800000000301c03 */
/*00d8*/ #P0 BRA 0x80; /* 0x4003fffe800001e7 */
/*00e0*/ ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT; /* 0x188e40005021dc03 */
/*00e8*/ #!P0 BRA 0x130; /* 0x40000001000021e7 */
/*00f0*/ NOP; /* 0x4000000000001de4 */
/*00f8*/ NOP; /* 0x4000000000001de4 */
/*0100*/ ISCADD R3, R2, c[0xe][0x0], 0x2; /* 0x400078000020dc43 */
/*0108*/ IADD R2, R2, 0x1; /* 0x4800c00004209c03 */
/*0110*/ LD R3, [R3]; /* 0x800000000030dc85 */
/*0118*/ ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT; /* 0x188e40005021dc03 */
/*0120*/ IADD R0, R3, R0; /* 0x4800000000301c03 */
/*0128*/ #P0 BRA 0x100; /* 0x4003ffff400001e7 */
/*0130*/ MOV R2, c[0xe][0x8]; /* 0x2800780020009de4 */
/*0138*/ ST [R2], R0; /* 0x9000000000201c85 */
/*0140*/ EXIT; /* 0x8000000000001de7 */
.....................................
As it can be seen, the instructions of justthreadfencev() are strictly contained in those of usethreadfence(), while those of usetwokernels_1() and usetwokernels_2() are practically a partitioning of those of justthreadfencev(). So, the difference in timings could be ascribed to the kernel launch overhead of the second kernel.
Edit: this question is a re-done version of the original, so the first several responses may no longer be relevant.
I'm curious about what impact a device function call with forced no-inlining has on synchronization within a device function. I have a simple test kernel that illustrates the behavior in question.
The kernel takes a buffer and passes it to a device function, along with a shared buffer and an indicator variable which identifies a single thread as the "boss" thread. The device function has divergent code: the boss thread first spends time doing trivial operations on the shared buffer, then writes to the global buffer. After a synchronization call, all threads write to the global buffer. After the kernel call, the host prints the contents of the global buffer. Here is the code:
CUDA CODE:
test_main.cu
#include<cutil_inline.h>
#include "test_kernel.cu"
int main()
{
int scratchBufferLength = 100;
int *scratchBuffer;
int *d_scratchBuffer;
int b = 1;
int t = 64;
// copy scratch buffer to device
scratchBuffer = (int *)calloc(scratchBufferLength,sizeof(int));
cutilSafeCall( cudaMalloc(&d_scratchBuffer,
sizeof(int) * scratchBufferLength) );
cutilSafeCall( cudaMemcpy(d_scratchBuffer, scratchBuffer,
sizeof(int)*scratchBufferLength, cudaMemcpyHostToDevice) );
// kernel call
testKernel<<<b, t>>>(d_scratchBuffer);
cudaThreadSynchronize();
// copy data back to host
cutilSafeCall( cudaMemcpy(scratchBuffer, d_scratchBuffer,
sizeof(int) * scratchBufferLength, cudaMemcpyDeviceToHost) );
// print results
printf("Scratch buffer contents: \t");
for(int i=0; i < scratchBufferLength; ++i)
{
if(i % 25 == 0)
printf("\n");
printf("%d ", scratchBuffer[i]);
}
printf("\n");
//cleanup
cudaFree(d_scratchBuffer);
free(scratchBuffer);
return 0;
}
test_kernel.cu
#ifndef __TEST_KERNEL_CU
#define __TEST_KERNEL_CU
#define IS_BOSS() (threadIdx.x == blockDim.x - 1)
__device__
__noinline__
void testFunc(int *sA, int *scratchBuffer, bool isBoss) {
if(isBoss) { // produces unexpected output-- "broken" code
//if(IS_BOSS()) { // produces expected output-- "working" code
for (int c = 0; c < 10000; c++) {
sA[0] = 1;
}
}
if(isBoss) {
scratchBuffer[0] = 1;
}
__syncthreads();
scratchBuffer[threadIdx.x ] = threadIdx.x;
return;
}
__global__
void testKernel(int *scratchBuffer)
{
__shared__ int sA[4];
bool isBoss = IS_BOSS();
testFunc(sA, scratchBuffer, isBoss);
return;
}
#endif
I compiled this code from within the CUDA SDK to take advantage of the "cutilsafecall()" functions in test_main.cu, but of course these could be taken out if you'd like to compile outside the SDK. I compiled with CUDA Driver/Toolkit version 4.0, compute capability 2.0, and the code was run on a GeForce GTX 480, which has the Fermi architecture.
The expected output is
0 1 2 3 ... blockDim.x-1
However, the output I get is
1 1 2 3 ... blockDim.x-1
This seems to indicate that the boss thread executed the conditional "scratchBuffer[0] = 1;" statement AFTER all threads execute the "scratchBuffer[threadIdx.x] = threadIdx.x;" statement, even though they are separated by a __syncthreads() barrier.
This occurs even if the boss thread is instructed to write a sentinel value into the buffer position of a thread in its same warp; the sentinel is the final value present in the buffer, rather than the appropriate threadIdx.x .
One modification that causes the code to produce expected output is to change the conditional statement
if(isBoss) {
to
if(IS_BOSS()) {
; i.e., to change the divergence-controlling variable from being stored in a parameter register to being computed in a macro function. (Note the comments on the appropriate lines in the source code.) It's this particular change I've been focusing on to try and track down the problem. In looking at the disassembled .cubins of the kernel with the 'isBoss' conditional (i.e., broken code) and the 'IS_BOSS()' conditional (i.e., working code), the most conspicuous difference in the instructions seems to be the absence of an SSY instruction in the disassembled broken code.
Here are the disassembled kernels generated by disassembling the .cubin files with
"cuobjdump -sass test_kernel.cubin" . everything up to the first 'EXIT' is the kernel, and everything after that is the device function. The only differences are in the device function.
DISASSEMBLED OBJECT CODE:
"broken" code
code for sm_20
Function : _Z10testKernelPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20009de428004000*/ MOV R2, c [0x0] [0x8];
/*0010*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*0018*/ /*0xfc015de428000000*/ MOV R5, RZ;
/*0020*/ /*0x00011de428004000*/ MOV R4, c [0x0] [0x0];
/*0028*/ /*0xfc209c034800ffff*/ IADD R2, R2, 0xfffff;
/*0030*/ /*0x9001dde428004000*/ MOV R7, c [0x0] [0x24];
/*0038*/ /*0x80019de428004000*/ MOV R6, c [0x0] [0x20];
/*0040*/ /*0x08001c03110e0000*/ ISET.EQ.U32.AND R0, R0, R2, pt;
/*0048*/ /*0x01221f841c000000*/ I2I.S32.S32 R8, -R0;
/*0050*/ /*0x2001000750000000*/ CAL 0x60;
/*0058*/ /*0x00001de780000000*/ EXIT;
/*0060*/ /*0x20201e841c000000*/ I2I.S32.S8 R0, R8;
/*0068*/ /*0xfc01dc231a8e0000*/ ISETP.NE.AND P0, pt, R0, RZ, pt;
/*0070*/ /*0xc00021e740000000*/ #!P0 BRA 0xa8;
/*0078*/ /*0xfc001de428000000*/ MOV R0, RZ;
/*0080*/ /*0x04001c034800c000*/ IADD R0, R0, 0x1;
/*0088*/ /*0x04009de218000000*/ MOV32I R2, 0x1;
/*0090*/ /*0x4003dc231a8ec09c*/ ISETP.NE.AND P1, pt, R0, 0x2710, pt;
/*0098*/ /*0x00409c8594000000*/ ST.E [R4], R2;
/*00a0*/ /*0x600005e74003ffff*/ #P1 BRA 0x80;
/*00a8*/ /*0x040001e218000000*/ #P0 MOV32I R0, 0x1;
/*00b0*/ /*0x0060008594000000*/ #P0 ST.E [R6], R0;
/*00b8*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
/*00c0*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*00c8*/ /*0x10011c03200dc000*/ IMAD.U32.U32 R4.CC, R0, 0x4, R6;
/*00d0*/ /*0x10009c435000c000*/ IMUL.U32.U32.HI R2, R0, 0x4;
/*00d8*/ /*0x08715c4348000000*/ IADD.X R5, R7, R2;
/*00e0*/ /*0x00401c8594000000*/ ST.E [R4], R0;
/*00e8*/ /*0x00001de790000000*/ RET;
.................................
"working" code
code for sm_20
Function : _Z10testKernelPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20009de428004000*/ MOV R2, c [0x0] [0x8];
/*0010*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*0018*/ /*0xfc015de428000000*/ MOV R5, RZ;
/*0020*/ /*0x00011de428004000*/ MOV R4, c [0x0] [0x0];
/*0028*/ /*0xfc209c034800ffff*/ IADD R2, R2, 0xfffff;
/*0030*/ /*0x9001dde428004000*/ MOV R7, c [0x0] [0x24];
/*0038*/ /*0x80019de428004000*/ MOV R6, c [0x0] [0x20];
/*0040*/ /*0x08001c03110e0000*/ ISET.EQ.U32.AND R0, R0, R2, pt;
/*0048*/ /*0x01221f841c000000*/ I2I.S32.S32 R8, -R0;
/*0050*/ /*0x2001000750000000*/ CAL 0x60;
/*0058*/ /*0x00001de780000000*/ EXIT;
/*0060*/ /*0x20009de428004000*/ MOV R2, c [0x0] [0x8];
/*0068*/ /*0x8400dc042c000000*/ S2R R3, SR_Tid_X;
/*0070*/ /*0x20201e841c000000*/ I2I.S32.S8 R0, R8;
/*0078*/ /*0x4000000760000001*/ SSY 0xd0;
/*0080*/ /*0xfc209c034800ffff*/ IADD R2, R2, 0xfffff;
/*0088*/ /*0x0831dc031a8e0000*/ ISETP.NE.U32.AND P0, pt, R3, R2, pt;
/*0090*/ /*0xc00001e740000000*/ #P0 BRA 0xc8;
/*0098*/ /*0xfc009de428000000*/ MOV R2, RZ;
/*00a0*/ /*0x04209c034800c000*/ IADD R2, R2, 0x1;
/*00a8*/ /*0x04021de218000000*/ MOV32I R8, 0x1;
/*00b0*/ /*0x4021dc231a8ec09c*/ ISETP.NE.AND P0, pt, R2, 0x2710, pt;
/*00b8*/ /*0x00421c8594000000*/ ST.E [R4], R8;
/*00c0*/ /*0x600001e74003ffff*/ #P0 BRA 0xa0;
/*00c8*/ /*0xfc01dc33190e0000*/ ISETP.EQ.AND.S P0, pt, R0, RZ, pt;
/*00d0*/ /*0x040021e218000000*/ #!P0 MOV32I R0, 0x1;
/*00d8*/ /*0x0060208594000000*/ #!P0 ST.E [R6], R0;
/*00e0*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
/*00e8*/ /*0x10311c03200dc000*/ IMAD.U32.U32 R4.CC, R3, 0x4, R6;
/*00f0*/ /*0x10309c435000c000*/ IMUL.U32.U32.HI R2, R3, 0x4;
/*00f8*/ /*0x84001c042c000000*/ S2R R0, SR_Tid_X;
/*0100*/ /*0x08715c4348000000*/ IADD.X R5, R7, R2;
/*0108*/ /*0x00401c8594000000*/ ST.E [R4], R0;
/*0110*/ /*0x00001de790000000*/ RET;
.................................
The "SSY" instruction is present in the working code but not the broken code. The cuobjdump manual describes the instruction with, "Set synchronization point; used before potentially divergent instructions." This makes me think that for some reason the compiler does not recognize the possibility of divergence in the broken code.
I also found that if I comment out the __noinline__ directive, then the code produces the expected output, and indeed the assembly produced by the otherwise "broken" and "working" versions is exactly identical. So, this makes me think that when a variable is passed via the call stack, that variable cannot be used to control divergence and a subsequent synchronization call; the compiler does not seem to recognize the possibility of divergence in that case, and therefore doesn't insert an "SSY" instruction. Does anyone know if this is indeed a legitimate limitation of CUDA, and if so, if this is documented anywhere?
Thanks in advance.
This appears to have simply been a compiler bug fixed in CUDA 4.1/4.2. Does not reproduce for the asker on CUDA 4.2.
The following code sums every 32 elements in an array to the very first element of each 32 element group:
int i = threadIdx.x;
int warpid = i&31;
if(warpid < 16){
s_buf[i] += s_buf[i+16];__syncthreads();
s_buf[i] += s_buf[i+8];__syncthreads();
s_buf[i] += s_buf[i+4];__syncthreads();
s_buf[i] += s_buf[i+2];__syncthreads();
s_buf[i] += s_buf[i+1];__syncthreads();
}
I thought I can eliminate all the __syncthreads() in the code, since all the operations are done in the same warp. But if I eliminate them, I get garbage results back. It shall not affect performance too much, but I want to know why I need __syncthreads() here.
I'm providing an answer here because I think that the above two are not fully satisfactory. The "intellectual property" of this answer belongs to Mark Harris, who has pointed out this issue in this presentation (slide 22), and to #talonmies, who has pointed this problem out to the OP in the comments above.
Let me first try to resume what the OP was asking, filtering his mistakes.
The OP seems to be dealing with the last step of reduction in shared memory reduction, warp reduction by loop unrolling. He is doing something like
template <class T>
__device__ void warpReduce(T *sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
template <class T>
__global__ void reduce4_no_synchthreads(T *g_idata, T *g_odata, unsigned int N)
{
extern __shared__ T sdata[];
unsigned int tid = threadIdx.x; // Local thread index
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; // Global thread index - Fictitiously double the block dimension
// --- Performs the first level of reduction in registers when reading from global memory.
T mySum = (i < N) ? g_idata[i] : 0;
if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
sdata[tid] = mySum;
// --- Before going further, we have to make sure that all the shared memory loads have been completed
__syncthreads();
// --- Reduction in shared memory. Only half of the threads contribute to reduction.
for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
// --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
__syncthreads();
}
// --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
if (tid < 32) warpReduce(sdata, tid);
// --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
// individual blocks
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
As pointed out by Mark Harris and talonmies, the shared memory variable sdata must be declared as volatile, to prevent compiler optimizations. So, the right way to define the __device__ function above is:
template <class T>
__device__ void warpReduce(volatile T *sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
Let us now see the disassembled codes corresponding to the two cases above examined, i.e., sdata declared as not volatile or volatile (code compiled for Fermi architecture).
Not volatile
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0010*/ SHL R3, R0, 0x1; /* 0x6000c0000400dc03 */
/*0018*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0020*/ IMAD R3, R3, c[0x0][0x8], R2; /* 0x200440002030dca3 */
/*0028*/ IADD R4, R3, c[0x0][0x8]; /* 0x4800400020311c03 */
/*0030*/ ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT; /* 0x188e4000a031dc03 */
/*0038*/ ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT; /* 0x1b0e4000a043dc03 */
/*0040*/ #P0 ISCADD R3, R3, c[0x0][0x20], 0x2; /* 0x400040008030c043 */
/*0048*/ #!P1 ISCADD R4, R4, c[0x0][0x20], 0x2; /* 0x4000400080412443 */
/*0050*/ #!P0 MOV R5, RZ; /* 0x28000000fc0161e4 */
/*0058*/ #!P1 LD R4, [R4]; /* 0x8000000000412485 */
/*0060*/ #P0 LD R5, [R3]; /* 0x8000000000314085 */
/*0068*/ SHL R3, R2, 0x2; /* 0x6000c0000820dc03 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ #!P1 IADD R5, R4, R5; /* 0x4800000014416403 */
/*0080*/ MOV R4, c[0x0][0x8]; /* 0x2800400020011de4 */
/*0088*/ STS [R3], R5; /* 0xc900000000315c85 */
/*0090*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0098*/ MOV R6, c[0x0][0x8]; /* 0x2800400020019de4 */
/*00a0*/ ISETP.LT.U32.AND P0, PT, R6, 0x42, PT; /* 0x188ec0010861dc03 */
/*00a8*/ #P0 BRA 0x118; /* 0x40000001a00001e7 */
/*00b0*/ NOP; /* 0x4000000000001de4 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
/*00c0*/ MOV R6, R4; /* 0x2800000010019de4 */
/*00c8*/ SHR.U32 R4, R4, 0x1; /* 0x5800c00004411c03 */
/*00d0*/ ISETP.GE.U32.AND P0, PT, R2, R4, PT; /* 0x1b0e00001021dc03 */
/*00d8*/ #!P0 IADD R7, R4, R2; /* 0x480000000841e003 */
/*00e0*/ #!P0 SHL R7, R7, 0x2; /* 0x6000c0000871e003 */
/*00e8*/ #!P0 LDS R7, [R7]; /* 0xc10000000071e085 */
/*00f0*/ #!P0 IADD R5, R7, R5; /* 0x4800000014716003 */
/*00f8*/ #!P0 STS [R3], R5; /* 0xc900000000316085 */
/*0100*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0108*/ ISETP.GT.U32.AND P0, PT, R6, 0x83, PT; /* 0x1a0ec0020c61dc03 */
/*0110*/ #P0 BRA 0xc0; /* 0x4003fffea00001e7 */
/*0118*/ ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT; /* 0x1a0ec0007c21dc03 */
/*0120*/ #P0 BRA.U 0x198; /* 0x40000001c00081e7 */
/*0128*/ #!P0 LDS R8, [R3]; /* 0xc100000000322085 */
/*0130*/ #!P0 LDS R5, [R3+0x80]; /* 0xc100000200316085 */
/*0138*/ #!P0 LDS R4, [R3+0x40]; /* 0xc100000100312085 */
/*0140*/ #!P0 LDS R7, [R3+0x20]; /* 0xc10000008031e085 */
/*0148*/ #!P0 LDS R6, [R3+0x10]; /* 0xc10000004031a085 */
/*0150*/ #!P0 IADD R8, R8, R5; /* 0x4800000014822003 */
/*0158*/ #!P0 IADD R8, R8, R4; /* 0x4800000010822003 */
/*0160*/ #!P0 LDS R5, [R3+0x8]; /* 0xc100000020316085 */
/*0168*/ #!P0 IADD R7, R8, R7; /* 0x480000001c81e003 */
/*0170*/ #!P0 LDS R4, [R3+0x4]; /* 0xc100000010312085 */
/*0178*/ #!P0 IADD R6, R7, R6; /* 0x480000001871a003 */
/*0180*/ #!P0 IADD R5, R6, R5; /* 0x4800000014616003 */
/*0188*/ #!P0 IADD R4, R5, R4; /* 0x4800000010512003 */
/*0190*/ #!P0 STS [R3], R4; /* 0xc900000000312085 */
/*0198*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*01a0*/ #P0 BRA.U 0x1c0; /* 0x40000000600081e7 */
/*01a8*/ #!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; /* 0x4000400090002043 */
/*01b0*/ #!P0 LDS R2, [RZ]; /* 0xc100000003f0a085 */
/*01b8*/ #!P0 ST [R0], R2; /* 0x900000000000a085 */
/*01c0*/ EXIT; /* 0x8000000000001de7 */
Lines /*0128*/-/*0148*/, /*0160*/ and /*0170*/ correspond to the shared memory loads to registers and line /*0190*/ to the shared memory store from register. The intermediate lines correspond to the summations, as performed in registers. So, the intermediate results are kept in registers (which are private to each thread) and not flushed each time to shared memory, preventing the threads to have full visibility of the intermediate results.
volatile
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0010*/ SHL R3, R0, 0x1; /* 0x6000c0000400dc03 */
/*0018*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0020*/ IMAD R3, R3, c[0x0][0x8], R2; /* 0x200440002030dca3 */
/*0028*/ IADD R4, R3, c[0x0][0x8]; /* 0x4800400020311c03 */
/*0030*/ ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT; /* 0x188e4000a031dc03 */
/*0038*/ ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT; /* 0x1b0e4000a043dc03 */
/*0040*/ #P0 ISCADD R3, R3, c[0x0][0x20], 0x2; /* 0x400040008030c043 */
/*0048*/ #!P1 ISCADD R4, R4, c[0x0][0x20], 0x2; /* 0x4000400080412443 */
/*0050*/ #!P0 MOV R5, RZ; /* 0x28000000fc0161e4 */
/*0058*/ #!P1 LD R4, [R4]; /* 0x8000000000412485 */
/*0060*/ #P0 LD R5, [R3]; /* 0x8000000000314085 */
/*0068*/ SHL R3, R2, 0x2; /* 0x6000c0000820dc03 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ #!P1 IADD R5, R4, R5; /* 0x4800000014416403 */
/*0080*/ MOV R4, c[0x0][0x8]; /* 0x2800400020011de4 */
/*0088*/ STS [R3], R5; /* 0xc900000000315c85 */
/*0090*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0098*/ MOV R6, c[0x0][0x8]; /* 0x2800400020019de4 */
/*00a0*/ ISETP.LT.U32.AND P0, PT, R6, 0x42, PT; /* 0x188ec0010861dc03 */
/*00a8*/ #P0 BRA 0x118; /* 0x40000001a00001e7 */
/*00b0*/ NOP; /* 0x4000000000001de4 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
/*00c0*/ MOV R6, R4; /* 0x2800000010019de4 */
/*00c8*/ SHR.U32 R4, R4, 0x1; /* 0x5800c00004411c03 */
/*00d0*/ ISETP.GE.U32.AND P0, PT, R2, R4, PT; /* 0x1b0e00001021dc03 */
/*00d8*/ #!P0 IADD R7, R4, R2; /* 0x480000000841e003 */
/*00e0*/ #!P0 SHL R7, R7, 0x2; /* 0x6000c0000871e003 */
/*00e8*/ #!P0 LDS R7, [R7]; /* 0xc10000000071e085 */
/*00f0*/ #!P0 IADD R5, R7, R5; /* 0x4800000014716003 */
/*00f8*/ #!P0 STS [R3], R5; /* 0xc900000000316085 */
/*0100*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0108*/ ISETP.GT.U32.AND P0, PT, R6, 0x83, PT; /* 0x1a0ec0020c61dc03 */
/*0110*/ #P0 BRA 0xc0; /* 0x4003fffea00001e7 */
/*0118*/ ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT; /* 0x1a0ec0007c21dc03 */
/*0120*/ SSY 0x1f0; /* 0x6000000320000007 */
/*0128*/ #P0 NOP.S; /* 0x40000000000001f4 */
/*0130*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0138*/ LDS R4, [R3+0x80]; /* 0xc100000200311c85 */
/*0140*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*0148*/ STS [R3], R6; /* 0xc900000000319c85 */
/*0150*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0158*/ LDS R4, [R3+0x40]; /* 0xc100000100311c85 */
/*0160*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*0168*/ STS [R3], R6; /* 0xc900000000319c85 */
/*0170*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0178*/ LDS R4, [R3+0x20]; /* 0xc100000080311c85 */
/*0180*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*0188*/ STS [R3], R6; /* 0xc900000000319c85 */
/*0190*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0198*/ LDS R4, [R3+0x10]; /* 0xc100000040311c85 */
/*01a0*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*01a8*/ STS [R3], R6; /* 0xc900000000319c85 */
/*01b0*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*01b8*/ LDS R4, [R3+0x8]; /* 0xc100000020311c85 */
/*01c0*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*01c8*/ STS [R3], R6; /* 0xc900000000319c85 */
/*01d0*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*01d8*/ LDS R4, [R3+0x4]; /* 0xc100000010311c85 */
/*01e0*/ IADD R4, R5, R4; /* 0x4800000010511c03 */
/*01e8*/ STS.S [R3], R4; /* 0xc900000000311c95 */
/*01f0*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*01f8*/ #P0 BRA.U 0x218; /* 0x40000000600081e7 */
/*0200*/ #!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; /* 0x4000400090002043 */
/*0208*/ #!P0 LDS R2, [RZ]; /* 0xc100000003f0a085 */
/*0210*/ #!P0 ST [R0], R2; /* 0x900000000000a085 */
/*0218*/ EXIT; /* 0x8000000000001de7 */
As it can be seen from lines /*0130*/-/*01e8*/, now each time a summation is performed, the intermediate result is immediately flushed to shared memory for full thread visibility.
Maybe have a look at these Slides from Mark Harris. Why reinvent the wheel.
www.uni-graz.at/~haasegu/Lectures/GPU_CUDA/Lit/reduction.pdf?page=35
Each reduction step is dependent on the other.
So you can only leave out the synchronization in the last excecuted warp equals 32 active threads in the reduction phase.
One step before you need 64 threads and hence need a synchronisation since parallel execution is not guaranteed since you use 2 warps.