Related
I'm trying to find ways to avoid thread divergence (branching or warp divergence) in my CUDA kernel.
For instance, I have the following conditional assignment (a and b are char values, x and y are unsigned int values):
if (a == b) { ++x; }
else { ++y; }
Or, alternatively:
if (a == b) { ++x; }
if (a != b) { ++y; }
How can the above operations be re-written to avoid branching?
I've looked in the type casting intrinsics, but there is no casting available from bool to int. I'm thinking there might be some trick with min, max and absolute values (e.g., __sad) to obtain the appropriate integer result to add for each case (i.e., 1, 0 or 0, 1).
There doesn't seem to be a regular int absolute value function, but what I do see is:
Calculate | x − y | + z , the sum of absolute difference.
__device__ unsigned int __sad ( int x, int y, unsigned int z )
Which I suppose I could provide a z = 0 argument to, in order to get a normal absolute value. Maybe something along the lines of:
const unsigned int mu = __sad(a, b, 1);
const unsigned int mv = __sad(a, b, 0);
const int u = __nv_min(1, mu);
const int v = __nv_min(1, mv);
x += u;
y += v;
However, there is no min function (see related question).
tl;dr: Consider avoiding such supposed-micro-optimizations.
Let's see if we can determine what differences there are (if any) from the original formulation suggested in the question:
if (a == b) { ++x; }
else { ++y; }
and the formulation suggested in another answer:
x += (a == b);
y += (a != b);
we'll use this test code:
$ cat t1513.cu
__global__ void k(char a, char b, unsigned int *dx, unsigned int *dy){
unsigned int x=*dx;
unsigned int y=*dy;
#ifndef USE_OPT
if (a == b)
{
++x;
} else {
++y;
}
#else
x += (a == b);
y += (a != b);
#endif
*dy = y;
*dx = x;
}
$ nvcc -c t1513.cu
$ cuobjdump -sass t1513.o >out1.sass
$ nvcc -c t1513.cu -DUSE_OPT
$ cuobjdump -sass t1513.o >out2.sass
$ diff out1.sass out2.sass
28,29c28,29
< /*0078*/ BFE R7, R7, 0x1000; /* 0x7000c0400071dc23 */
< /* 0x22e04283f2828287 */
---
> /*0078*/ BFE R9, R7, 0x1000; /* 0x7000c04000725c23 */
> /* 0x22804283f2804287 */
31,41c31,41
< /*0090*/ ISET.EQ.AND R7, R8, R7, PT; /* 0x110e00001c81dc23 */
< /*0098*/ LOP32I.AND R7, R7, 0x1; /* 0x380000000471dc02 */
< /*00a0*/ LOP32I.XOR R8, R7, 0x1; /* 0x3800000004721c82 */
< /*00a8*/ IADD R8, R6, R8; /* 0x4800000020621c03 */
< /*00b0*/ IADD R7, R0, R7; /* 0x480000001c01dc03 */
< /*00b8*/ ST.E [R4], R8; /* 0x9400000000421c85 */
< /* 0x200000000002f047 */
< /*00c8*/ ST.E [R2], R7; /* 0x940000000021dc85 */
< /*00d0*/ EXIT; /* 0x8000000000001de7 */
< /*00d8*/ BRA 0xd8; /* 0x4003ffffe0001de7 */
< /*00e0*/ NOP; /* 0x4000000000001de4 */
---
> /*0090*/ ISET.NE.AND R7, R8, R9, PT; /* 0x128e00002481dc23 */
> /*0098*/ ISET.EQ.AND R8, R8, R9, PT; /* 0x110e000024821c23 */
> /*00a0*/ LOP32I.AND R7, R7, 0x1; /* 0x380000000471dc02 */
> /*00a8*/ IADD R7, R6, R7; /* 0x480000001c61dc03 */
> /*00b0*/ LOP32I.AND R6, R8, 0x1; /* 0x3800000004819c02 */
> /*00b8*/ ST.E [R4], R7; /* 0x940000000041dc85 */
> /* 0x2000000002f04287 */
> /*00c8*/ IADD R6, R0, R6; /* 0x4800000018019c03 */
> /*00d0*/ ST.E [R2], R6; /* 0x9400000000219c85 */
> /*00d8*/ EXIT; /* 0x8000000000001de7 */
> /*00e0*/ BRA 0xe0; /* 0x4003ffffe0001de7 */
$
Studying the above diff output we see:
There is no branching (and indeed not even any predication) in either realization.
The supposedly "optimized" case is nearly identical, except that it is 1 instruction longer than the if/else case.
Yes, I understand this is not "your code". I can only work with what is presented.
This gives me the intuition that these types of transformations:
Require effort (potentially wasted time)
May not yield any improvement in performance
May obfuscate the code, making maintenance more difficult
Proceed as you wish, of course.
As a helpful comment pointed out, I was overthinking the problem.
The following works, and uses a simple bool to int conversion:
x += (a == b);
y += (a != b);
Examining the PTX assembly file before and after this change (several places in the kernel), the number of branches was reduced from 39 to 9, so this made a significant change. The nvcc compiler did not optimize these out on its own, particularly in cases where there were if/then/else statements two or three levels deep, as in:
bool ag = (ca == '.');
bool bg = (cb == '.');
bool agx = ag && apg;
bool bgx = bg && bpg;
bool gx = agx || bgx;
if (ag || bg)
{
if (ag && bg)
{
// ignore
} else {
if (!gx)
{
++gs;
++ps;
}
apg = ag;
bpg = bg;
}
} else {
if (ca == cb)
{
++ms;
++ps;
} else {
++ns;
++ps;
}
apg = false;
bpg = false;
}
Once I was able to reduce all of the assignments to boolean expressions (here are two out of the six assignments after conversion from the original kernel):
apg = (apg && !!(ag && bg)) || ((ag || bg) && !(ag && bg) && ag)
bpg = (bpg && !!(ag && bg)) || ((ag || bg) && !(ag && bg) && bg)
I was able to simplify those expressions:
apg = (ag && !bg) || (ag && apg)
bpg = (!ag && bg) || (bg && bpg)
And in two cases I was able to combine several expressions (multiple assignments) into a single boolean expression. Ultimately, the full set of conditionals reduced to:
ps += ((ca != '.') && (cb != '.')) || ((ca != '.') && !bpg) || ((cb != '.') && !apg);
ms += (ca == cb) && (ca != '.') && (cb != '.');
apg = ((ca == '.') && (cb != '.')) || ((ca == '.') && apg);
bpg = ((ca != '.') && (cb == '.')) || ((cb == '.') && bpg);
Based on the method from this answer, I found that the number of real branches in my kernel were ultimately reduced from 39 to 12:
cuobjdump -sass kernel_original.o > kernel_original.sass
grep BRA kernel.sass | wc -l
39
cuobjdump -sass kernel_simplified.o > kernel_simplified.sass
grep BRA kernel_opt.sass | wc -l
12
tl;dr: Consider the larger picture first before applying such supposed-micro-optimizations.
Looking at Robert's example code, my first thought was
++*( (a==b) ? &x : &y);
However I was on my mobile phone and could not check the disassembly of this myself.
Robert was kind enough to insert it into his test kernel and posted the SASS diff of this idea vs. the original if/else code posted in the question:
$ cuobjdump -sass t1513.o >out3.sass
$ diff out1.sass out3.sass
13,44c13,52
< /* 0x2230427042004307 */
< /*0008*/ MOV R1, c[0x0][0x44]; /* 0x2800400110005de4 */
< /*0010*/ MOV R4, c[0x0][0x150]; /* 0x2800400540011de4 */
< /*0018*/ MOV R5, c[0x0][0x154]; /* 0x2800400550015de4 */
< /*0020*/ MOV R2, c[0x0][0x148]; /* 0x2800400520009de4 */
< /*0028*/ MOV R3, c[0x0][0x14c]; /* 0x280040053000dde4 */
< /*0030*/ LD.E R6, [R4]; /* 0x8400000000419c85 */
< /*0038*/ LDC.U8 R7, c[0x0][0x141]; /* 0x1400000507f1dc06 */
< /* 0x2272028042824047 */
< /*0048*/ LD.E R0, [R2]; /* 0x8400000000201c85 */
< /*0050*/ LDC.U8 R8, c[0x0][0x140]; /* 0x1400000503f21c06 */
< /*0058*/ I2I.S16.S8 R7, R7; /* 0x1c0000001c11de84 */
< /*0060*/ I2I.S16.S8 R8, R8; /* 0x1c00000020121e84 */
< /*0068*/ LOP32I.AND R7, R7, 0xff; /* 0x38000003fc71dc02 */
< /*0070*/ LOP32I.AND R8, R8, 0xff; /* 0x38000003fc821c02 */
< /*0078*/ BFE R7, R7, 0x1000; /* 0x7000c0400071dc23 */
< /* 0x22e04283f2828287 */
< /*0088*/ BFE R8, R8, 0x1000; /* 0x7000c04000821c23 */
< /*0090*/ ISET.EQ.AND R7, R8, R7, PT; /* 0x110e00001c81dc23 */
< /*0098*/ LOP32I.AND R7, R7, 0x1; /* 0x380000000471dc02 */
< /*00a0*/ LOP32I.XOR R8, R7, 0x1; /* 0x3800000004721c82 */
< /*00a8*/ IADD R8, R6, R8; /* 0x4800000020621c03 */
< /*00b0*/ IADD R7, R0, R7; /* 0x480000001c01dc03 */
< /*00b8*/ ST.E [R4], R8; /* 0x9400000000421c85 */
< /* 0x200000000002f047 */
< /*00c8*/ ST.E [R2], R7; /* 0x940000000021dc85 */
< /*00d0*/ EXIT; /* 0x8000000000001de7 */
< /*00d8*/ BRA 0xd8; /* 0x4003ffffe0001de7 */
< /*00e0*/ NOP; /* 0x4000000000001de4 */
< /*00e8*/ NOP; /* 0x4000000000001de4 */
< /*00f0*/ NOP; /* 0x4000000000001de4 */
< /*00f8*/ NOP; /* 0x4000000000001de4 */
---
> /* 0x2270420042304307 */
> /*0008*/ MOV R1, c[0x0][0x44]; /* 0x2800400110005de4 */
> /*0010*/ MOV R10, c[0x0][0x148]; /* 0x2800400520029de4 */
> /*0018*/ IADD32I R1, R1, -0x8; /* 0x0bffffffe0105c02 */
> /*0020*/ MOV R11, c[0x0][0x14c]; /* 0x280040053002dde4 */
> /*0028*/ LDC.U8 R0, c[0x0][0x141]; /* 0x1400000507f01c06 */
> /*0030*/ MOV R8, c[0x0][0x150]; /* 0x2800400540021de4 */
> /*0038*/ MOV R9, c[0x0][0x154]; /* 0x2800400550025de4 */
> /* 0x2232423240423047 */
> /*0048*/ LD.E R4, [R10]; /* 0x8400000000a11c85 */
> /*0050*/ I2I.S16.S8 R0, R0; /* 0x1c00000000101e84 */
> /*0058*/ LD.E R5, [R8]; /* 0x8400000000815c85 */
> /*0060*/ LDC.U8 R2, c[0x0][0x140]; /* 0x1400000503f09c06 */
> /*0068*/ LOP32I.AND R0, R0, 0xff; /* 0x38000003fc001c02 */
> /*0070*/ I2I.S16.S8 R2, R2; /* 0x1c00000008109e84 */
> /*0078*/ BFE R0, R0, 0x1000; /* 0x7000c04000001c23 */
> /* 0x2283f282b2028287 */
> /*0088*/ LOP32I.AND R2, R2, 0xff; /* 0x38000003fc209c02 */
> /*0090*/ BFE R3, R2, 0x1000; /* 0x7000c0400020dc23 */
> /*0098*/ ISETP.NE.AND P0, PT, R3, R0, PT; /* 0x1a8e00000031dc23 */
> /*00a0*/ LOP.OR R3, R1, c[0x0][0x24]; /* 0x680040009010dc43 */
> /*00a8*/ #P0 IADD32I R3, R3, 0x4; /* 0x080000001030c002 */
> /*00b0*/ LOP32I.AND R3, R3, 0xffffff; /* 0x3803fffffc30dc02 */
> /*00b8*/ SEL R0, R4, R5, !P0; /* 0x2010000014401c04 */
> /* 0x22f042e3f2e28047 */
> /*00c8*/ STL.64 [R1], R4; /* 0xc800000000111ca5 */
> /*00d0*/ IADD32I R0, R0, 0x1; /* 0x0800000004001c02 */
> /*00d8*/ STL [R3], R0; /* 0xc800000000301c85 */
> /*00e0*/ LDL.64 R6, [R1]; /* 0xc000000000119ca5 */
> /*00e8*/ ST.E [R8], R7; /* 0x940000000081dc85 */
> /*00f0*/ ST.E [R10], R6; /* 0x9400000000a19c85 */
> /*00f8*/ EXIT; /* 0x8000000000001de7 */
> /*0100*/ BRA 0x100; /* 0x4003ffffe0001de7 */
> /*0108*/ NOP; /* 0x4000000000001de4 */
> /*0110*/ NOP; /* 0x4000000000001de4 */
> /*0118*/ NOP; /* 0x4000000000001de4 */
> /*0120*/ NOP; /* 0x4000000000001de4 */
> /*0128*/ NOP; /* 0x4000000000001de4 */
> /*0130*/ NOP; /* 0x4000000000001de4 */
> /*0138*/ NOP; /* 0x4000000000001de4 */
$
Robert concluded that the compiler chose to use predication in this case.
The disassembly seemed to make no sense to me, until I realised that Robert inserted my one-liner in a different way than I expected. In trying to stay close to the (most likely accurately) presumed intentions of the questioner, he dereferenced the pointers into automatic variables, then inserted my one-liner (which really makes little sense in that case because taking the address of automatic variables forces them into local memory), and wrote the content of the the automatic variables back to global memory.
My thought however was to just replace the entire body of the test case with my ++*( (a==b) ? dx : dy); one-liner, which would have led to better looking assembly:
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x2800400110005de4 */
/*0010*/ LDC.U8 R0, c[0x0][0x141]; /* 0x1400000507f01c06 */
/*0018*/ LDC.U8 R2, c[0x0][0x140]; /* 0x1400000503f09c06 */
/*0020*/ I2I.S16.S8 R0, R0; /* 0x1c00000000101e84 */
/*0028*/ I2I.S16.S8 R2, R2; /* 0x1c00000008109e84 */
/*0030*/ LOP32I.AND R0, R0, 0xff; /* 0x38000003fc001c02 */
/*0038*/ LOP32I.AND R2, R2, 0xff; /* 0x38000003fc209c02 */
/* 0x228202c042804237 */
/*0048*/ BFE R0, R0, 0x1000; /* 0x7000c04000001c23 */
/*0050*/ BFE R3, R2, 0x1000; /* 0x7000c0400020dc23 */
/*0058*/ MOV R2, c[0x0][0x148]; /* 0x2800400520009de4 */
/*0060*/ ISETP.NE.AND P0, PT, R3, R0, PT; /* 0x1a8e00000031dc23 */
/*0068*/ MOV R0, c[0x0][0x14c]; /* 0x2800400530001de4 */
/*0070*/ SEL R2, R2, c[0x0][0x150], !P0; /* 0x2010400540209c04 */
/*0078*/ SEL R3, R0, c[0x0][0x154], !P0; /* 0x201040055000dc04 */
/* 0x20000002f04283f7 */
/*0088*/ LD.E R0, [R2]; /* 0x8400000000201c85 */
/*0090*/ IADD32I R4, R0, 0x1; /* 0x0800000004011c02 */
/*0098*/ ST.E [R2], R4; /* 0x9400000000211c85 */
/*00a0*/ EXIT; /* 0x8000000000001de7 */
/*00a8*/ BRA 0xa8; /* 0x4003ffffe0001de7 */
/*00b0*/ NOP; /* 0x4000000000001de4 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
This code looks better to me than Robert's testcase (by itself). But it probably is of no use to vallismortis, because in his case the variables will not be in addressable memory.
Of course, Robert's other comment about premature optimisation also applies here, even if this should actually result in faster code.
I was curious to know the meaning behind the footnote at the bottom of Table 2 in page 18 in Volta whitepaper. While the table indicates that Volta has 256 KB registers per SM similar to its predecessors, the footprint mentions that
The per-thread program counter (PC) that forms part of the improved SIMT model typically requires two of the
register slots per thread.
Does it mean that for every running thread in Volta you have 2 reserved 32-bit registers that keep track of the PC? If yes, does it also mean that this reservation is static in a sense that regardless of how many threads are residing on your SM, 2048(maximum number of threads allowed on SM)*2=4096 registers are taken? Also, can this reservation be eliminated by compiling for a CC lower than 7.0?
It seems that for every running thread, 2 additional registers are allocated from SM's register file when compiling for Compute Capability 7.0.
Using CUDA 9.1, I compiled the following simple saxpy kernel
__global__ void saxpy(float* out, float a, float* x, float* y) {
out[ threadIdx.x ] = a * x[ threadIdx.x ] + y[ threadIdx.x ];
}
for CC 6.1 and 7.0 with maximum compiler optimization flag (-03) applied. While using cuobjdump -res-usage on the binary for CC 6.1 shows that 8 registers are used for every thread in the kernel, the same command on the binary for CC 7.0 reports that register usage per thread is 10. I also printed the sass using cuobjdump -sass. Below is the content for the binary for CC 6.1. You can see architected registers with indices 0 to 7 are all used.
code for sm_61
Function : _Z5saxpyPffS_S_
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x083fc400e3e007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ S2R R0, SR_TID.X; /* 0xf0c8000002170000 */
/*0018*/ SHL R6, R0.reuse, 0x2; /* 0x3848000000270006 */
/* 0x081fc840fec007f5 */
/*0028*/ SHR.U32 R0, R0, 0x1e; /* 0x3828000001e70000 */
/*0030*/ IADD R2.CC, R6.reuse, c[0x0][0x150]; /* 0x4c10800005470602 */
/*0038*/ IADD.X R3, R0.reuse, c[0x0][0x154]; /* 0x4c10080005570003 */
/* 0x001f8800eec007f0 */
/*0048*/ { IADD R4.CC, R6, c[0x0][0x158]; /* 0x4c10800005670604 */
/*0050*/ LDG.E R2, [R2]; } /* 0xeed4200000070202 */
/*0058*/ IADD.X R5, R0, c[0x0][0x15c]; /* 0x4c10080005770005 */
/* 0x001fdc00fec00771 */
/*0068*/ LDG.E R4, [R4]; /* 0xeed4200000070404 */
/*0070*/ IADD R6.CC, R6, c[0x0][0x140]; /* 0x4c10800005070606 */
/*0078*/ IADD.X R7, R0, c[0x0][0x144]; /* 0x4c10080005170007 */
/* 0x001ffc001e2047f2 */
/*0088*/ FFMA R0, R2, c[0x0][0x148], R4; /* 0x4980020005270200 */
/*0090*/ STG.E [R6], R0; /* 0xeedc200000070600 */
/*0098*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*00a8*/ BRA 0xa0; /* 0xe2400fffff07000f */
/*00b0*/ NOP; /* 0x50b0000000070f00 */
/*00b8*/ NOP; /* 0x50b0000000070f00 */
..........................
Now for CC 7.0.
code for sm_70
Function : _Z5saxpyPffS_S_
.headerflags #"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
/*0000*/ #!PT SHFL.IDX PT, RZ, RZ, RZ, RZ; /* 0x000000fffffff389 */
/* 0x000fe200000e00ff */
/*0010*/ MOV R1, c[0x0][0x28]; /* 0x00000a0000017a02 */
/* 0x000fd00000000f00 */
/*0020*/ S2R R6, SR_TID.X; /* 0x0000000000067919 */
/* 0x000e220000002100 */
/*0030*/ MOV R7, 0x4; /* 0x0000000400077802 */
/* 0x000fca0000000f00 */
/*0040*/ IMAD.WIDE.U32 R2, R6.reuse, R7.reuse, c[0x0][0x170]; /* 0x00005c0006027625 */
/* 0x0c1fe400078e0007 */
/*0050*/ IMAD.WIDE.U32 R4, R6, R7, c[0x0][0x178]; /* 0x00005e0006047625 */
/* 0x000fd000078e0007 */
/*0060*/ LDG.E.SYS R2, [R2]; /* 0x0000000002027381 */
/* 0x000e2800001ee900 */
/*0070*/ LDG.E.SYS R4, [R4]; /* 0x0000000004047381 */
/* 0x000e2200001ee900 */
/*0080*/ IMAD.WIDE.U32 R6, R6, R7, c[0x0][0x160]; /* 0x0000580006067625 */
/* 0x000fe400078e0007 */
/*0090*/ FFMA R0, R2, c[0x0][0x168], R4; /* 0x00005a0002007a23 */
/* 0x001fd00000000004 */
/*00a0*/ STG.E.SYS [R6], R0; /* 0x0000000006007386 */
/* 0x0001e2000010e900 */
/*00b0*/ EXIT; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*00c0*/ BRA 0xc0; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*00d0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*00e0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*00f0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
You see that again only architected register 0 to 7 (except for R3 and R5) are used inside the code block. There is also the use of RZ at the beginning of the kernel. Now I do not see where three other registers are, which makes me inclined to believe that two registers are reserved for tracking thread's PC.
Anyway, I came to the conclusion I stated at the beginning of the post with clearly insufficient observations. Any contribution to improve this answer is appreciated.
The problem
During a project in CUDA C, I came across unexpected behaviour regarding single precision and double precision floating point operations. In the project, I first fill an array with number in a kernel and in another kernel, I do some computation on these numbers. All variables and arrays are double precision, so I would not expect any single precision floating point operation to happen. However, if I analyze the executable of the program using NVPROF, it shows that single precision operations are executed. How is this possible?
Minimal, Complete, and Verifiable example
Here is the smallest program, that shows this behaviour on my architecture: (asserts and error catching has been left out). I use a Nvidia Tesla k40 graphics card.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define Nx 10
#define Ny 10
#define RANDOM double(0.236954587566)
__global__ void test(double *array, size_t pitch){
double rho, u;
int x = threadIdx.x + blockDim.x*blockIdx.x;
int y = threadIdx.y + blockDim.y*blockIdx.y;
int idx = y*(pitch/sizeof(double)) + 2*x;
if(x < Nx && y < Ny){
rho = array[idx];
u = array[idx+1]/rho;
array[idx] = rho*u;
}
}
__global__ void fill(double *array, size_t pitch){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = y*(pitch/sizeof(double)) + 2*x;
if(x < Nx || y < Ny){
array[idx] = RANDOM*idx;
array[idx + 1] = idx*idx*RANDOM;
}
}
int main(int argc, char* argv[]) {
double *d_array;
size_t pitch;
cudaMallocPitch((void **) &d_array, &pitch, 2*Nx*sizeof(double), Ny);
dim3 threadDistribution = dim3(8,8);
dim3 blockDistribution = dim3( (Nx + threadDistribution.x - 1) / (threadDistribution.x), (Ny + threadDistribution.y) / (threadDistribution.y));
fill <<< blockDistribution, threadDistribution >>> (d_array, pitch);
cudaDeviceSynchronize();
test <<< blockDistribution, threadDistribution >>> (d_array, pitch);
return 0;
}
The output of NVPROF (edited to make it more readable, if you need the full output, just ask in the comments):
....
Device "Tesla K40c (0)"
Kernel: test(double*, unsigned long)
Metric Name Min Max Avg
flop_count_sp 198 198 198
flop_count_sp_add 0 0 0
flop_count_sp_mul 0 0 0
flop_count_sp_fma 99 99 99
flop_count_sp_special 102 102 102
flop_count_dp 1214 1214 1214
flop_count_dp_add 0 0 0
flop_count_dp_mul 204 204 204
flop_count_dp_fma 505 505 505
What I've found so far
I found that if I delete the division in line 16:
u = array[idx+1]/rho;
==>
u = array[idx+1];
the output is as expected: zero single precision operations and exactly 100 double precision operations are executed. Does anyone know why the division causes the program to use single precision flop and 10 times more double precision floating point operations?
I've also tried using intrinsics (__ddiv_rn), but this didn't solve the problem.
Many thanks in advance!
Edit - Working solution
Altough I still haven't figured out why it uses the single precision, I have found a 'solution' to this problem, thanks to #EOF.
Replacing the division by multiplication with the reciprocal of rho did the job:
u = array[idx+1]/rho;
==>
u = array[idx+1]*__drcp_rn(rho);
As others have pointed out, CUDA devices do not have instructions for floating point division in hardware. Instead they start from an initial approximation to the reciprocal of the denominator, provided by a single precision special function unit. It's product with the numerator is then iteratively refined until it matches the fraction to within machine precision.
Even the __ddiv_rn() intrinsic is compiled to this instruction sequence by ptxas, so it's use makes no difference.
You can gain closer insight by inspecting the code yourself using cuobjdump -sass, although this is made difficult by no official documentation for shader assembly being available other than the bare list of instructions.
I'll use the following bare-bones division kernel as an example:
__global__ void div(double x, double y, double *z) {
*z = x / y;
}
This is compiled to the following shader assembly for a compute capability 3.5 device:
Function : _Z3divddPd
.headerflags #"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
/* 0x08a0109c10801000 */
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x64c03c00089c0006 */
/*0010*/ MOV R0, c[0x0][0x14c]; /* 0x64c03c00299c0002 */
/*0018*/ MOV32I R2, 0x1; /* 0x74000000009fc00a */
/*0020*/ MOV R8, c[0x0][0x148]; /* 0x64c03c00291c0022 */
/*0028*/ MOV R9, c[0x0][0x14c]; /* 0x64c03c00299c0026 */
/*0030*/ MUFU.RCP64H R3, R0; /* 0x84000000031c000e */
/*0038*/ MOV32I R0, 0x35b7333; /* 0x7401adb9999fc002 */
/* 0x08a080a080a4a4a4 */
/*0048*/ DFMA R4, -R8, R2, c[0x2][0x0]; /* 0x9b880840001c2012 */
/*0050*/ DFMA R4, R4, R4, R4; /* 0xdb801000021c1012 */
/*0058*/ DFMA R4, R4, R2, R2; /* 0xdb800800011c1012 */
/*0060*/ DMUL R6, R4, c[0x0][0x140]; /* 0x64000000281c101a */
/*0068*/ FSETP.GE.AND P0, PT, R0, |c[0x0][0x144]|, PT; /* 0x5db09c00289c001e */
/*0070*/ DFMA R8, -R8, R6, c[0x0][0x140]; /* 0x9b881800281c2022 */
/*0078*/ MOV R2, c[0x0][0x150]; /* 0x64c03c002a1c000a */
/* 0x0880acb0a0ac8010 */
/*0088*/ MOV R3, c[0x0][0x154]; /* 0x64c03c002a9c000e */
/*0090*/ DFMA R4, R8, R4, R6; /* 0xdb801800021c2012 */
/*0098*/ #P0 BRA 0xb8; /* 0x120000000c00003c */
/*00a0*/ FFMA R0, RZ, c[0x0][0x14c], R5; /* 0x4c001400299ffc02 */
/*00a8*/ FSETP.GT.AND P0, PT, |R0|, c[0x2][0x8], PT; /* 0x5da01c40011c021e */
/*00b0*/ #P0 BRA 0xe8; /* 0x120000001800003c */
/*00b8*/ MOV R4, c[0x0][0x140]; /* 0x64c03c00281c0012 */
/* 0x08a1b810b8008010 */
/*00c8*/ MOV R5, c[0x0][0x144]; /* 0x64c03c00289c0016 */
/*00d0*/ MOV R7, c[0x0][0x14c]; /* 0x64c03c00299c001e */
/*00d8*/ MOV R6, c[0x0][0x148]; /* 0x64c03c00291c001a */
/*00e0*/ CAL 0xf8; /* 0x1300000008000100 */
/*00e8*/ ST.E.64 [R2], R4; /* 0xe5800000001c0810 */
/*00f0*/ EXIT; /* 0x18000000001c003c */
/*00f8*/ LOP32I.AND R0, R7, 0x40000000; /* 0x20200000001c1c00 */
/* 0x08a08010a010b010 */
/*0108*/ MOV32I R15, 0x1ff00000; /* 0x740ff800001fc03e */
/*0110*/ ISETP.LT.U32.AND P0, PT, R0, c[0x2][0xc], PT; /* 0x5b101c40019c001e */
/*0118*/ MOV R8, RZ; /* 0xe4c03c007f9c0022 */
/*0120*/ SEL R9, R15, c[0x2][0x10], !P0; /* 0x65002040021c3c26 */
/*0128*/ MOV32I R12, 0x1; /* 0x74000000009fc032 */
/*0130*/ DMUL R10, R8, R6; /* 0xe4000000031c202a */
/*0138*/ LOP32I.AND R0, R5, 0x7f800000; /* 0x203fc000001c1400 */
/* 0x08a0108ca01080a0 */
/*0148*/ MUFU.RCP64H R13, R11; /* 0x84000000031c2c36 */
/*0150*/ DFMA R16, -R10, R12, c[0x2][0x0]; /* 0x9b883040001c2842 */
/*0158*/ ISETP.LT.U32.AND P0, PT, R0, c[0x2][0x14], PT; /* 0x5b101c40029c001e */
/*0160*/ MOV R14, RZ; /* 0xe4c03c007f9c003a */
/*0168*/ DFMA R16, R16, R16, R16; /* 0xdb804000081c4042 */
/*0170*/ SEL R15, R15, c[0x2][0x10], !P0; /* 0x65002040021c3c3e */
/*0178*/ SSY 0x3a0; /* 0x1480000110000000 */
/* 0x08acb4a4a4a4a480 */
/*0188*/ DMUL R14, R14, R4; /* 0xe4000000021c383a */
/*0190*/ DFMA R12, R16, R12, R12; /* 0xdb803000061c4032 */
/*0198*/ DMUL R16, R14, R12; /* 0xe4000000061c3842 */
/*01a0*/ DFMA R10, -R10, R16, R14; /* 0xdb883800081c282a */
/*01a8*/ DFMA R10, R10, R12, R16; /* 0xdb804000061c282a */
/*01b0*/ DSETP.LEU.AND P0, PT, |R10|, RZ, PT; /* 0xdc581c007f9c2a1e */
/*01b8*/ #!P0 BRA 0x1e0; /* 0x120000001020003c */
/* 0x088010b010b8acb4 */
/*01c8*/ DSETP.EQ.AND P0, PT, R10, RZ, PT; /* 0xdc101c007f9c281e */
/*01d0*/ #!P0 BRA 0x358; /* 0x12000000c020003c */
/*01d8*/ DMUL.S R8, R4, R6; /* 0xe4000000035c1022 */
/*01e0*/ ISETP.GT.U32.AND P0, PT, R0, c[0x2][0x18], PT; /* 0x5b401c40031c001e */
/*01e8*/ MOV32I R0, 0x1ff00000; /* 0x740ff800001fc002 */
/*01f0*/ MOV R14, RZ; /* 0xe4c03c007f9c003a */
/*01f8*/ SEL R15, R0, c[0x2][0x10], !P0; /* 0x65002040021c003e */
/* 0x08b4a49c849c849c */
/*0208*/ DMUL R12, R10, R8; /* 0xe4000000041c2832 */
/*0210*/ DMUL R18, R10, R14; /* 0xe4000000071c284a */
/*0218*/ DMUL R10, R12, R14; /* 0xe4000000071c302a */
/*0220*/ DMUL R16, R8, R18; /* 0xe4000000091c2042 */
/*0228*/ DFMA R8, R10, R6, -R4; /* 0xdb901000031c2822 */
/*0230*/ DFMA R12, R16, R6, -R4; /* 0xdb901000031c4032 */
/*0238*/ DSETP.GT.AND P0, PT, |R8|, |R12|, PT; /* 0xdc209c00061c221e */
/* 0x08b010ac10b010a0 */
/*0248*/ SEL R9, R17, R11, P0; /* 0xe5000000059c4426 */
/*0250*/ FSETP.GTU.AND P1, PT, |R9|, 1.469367938527859385e-39, PT; /* 0xb5e01c00801c263d */
/*0258*/ MOV R11, R9; /* 0xe4c03c00049c002e */
/*0260*/ SEL R8, R16, R10, P0; /* 0xe5000000051c4022 */
/*0268*/ #P1 NOP.S; /* 0x8580000000443c02 */
/*0270*/ FSETP.LT.AND P0, PT, |R5|, 1.5046327690525280102e-36, PT; /* 0xb5881c20001c161d */
/*0278*/ MOV32I R0, 0x3ff00000; /* 0x741ff800001fc002 */
/* 0x0880a48090108c10 */
/*0288*/ MOV R16, RZ; /* 0xe4c03c007f9c0042 */
/*0290*/ SEL R17, R0, c[0x2][0x1c], !P0; /* 0x65002040039c0046 */
/*0298*/ LOP.OR R10, R8, 0x1; /* 0xc2001000009c2029 */
/*02a0*/ LOP.AND R8, R8, -0x2; /* 0xca0003ffff1c2021 */
/*02a8*/ DMUL R4, R16, R4; /* 0xe4000000021c4012 */
/*02b0*/ DMUL R6, R16, R6; /* 0xe4000000031c401a */
/*02b8*/ DFMA R14, R10, R6, -R4; /* 0xdb901000031c283a */
/* 0x08b010b010a0b4a4 */
/*02c8*/ DFMA R12, R8, R6, -R4; /* 0xdb901000031c2032 */
/*02d0*/ DSETP.GT.AND P0, PT, |R12|, |R14|, PT; /* 0xdc209c00071c321e */
/*02d8*/ SEL R8, R10, R8, P0; /* 0xe5000000041c2822 */
/*02e0*/ LOP.AND R0, R8, 0x1; /* 0xc2000000009c2001 */
/*02e8*/ IADD R11.CC, R8, -0x1; /* 0xc88403ffff9c202d */
/*02f0*/ ISETP.EQ.U32.AND P0, PT, R0, 0x1, PT; /* 0xb3201c00009c001d */
/*02f8*/ IADD.X R0, R9, -0x1; /* 0xc88043ffff9c2401 */
/* 0x08b4a480a010b010 */
/*0308*/ SEL R10, R11, R8, !P0; /* 0xe5002000041c2c2a */
/*0310*/ #P0 IADD R8.CC, R8, 0x1; /* 0xc084000000802021 */
/*0318*/ SEL R11, R0, R9, !P0; /* 0xe5002000049c002e */
/*0320*/ #P0 IADD.X R9, R9, RZ; /* 0xe08040007f802426 */
/*0328*/ DFMA R14, R10, R6, -R4; /* 0xdb901000031c283a */
/*0330*/ DFMA R4, R8, R6, -R4; /* 0xdb901000031c2012 */
/*0338*/ DSETP.GT.AND P0, PT, |R4|, |R14|, PT; /* 0xdc209c00071c121e */
/* 0x08b4acb4a010b810 */
/*0348*/ SEL R8, R10, R8, P0; /* 0xe5000000041c2822 */
/*0350*/ SEL.S R9, R11, R9, P0; /* 0xe500000004dc2c26 */
/*0358*/ MOV R8, RZ; /* 0xe4c03c007f9c0022 */
/*0360*/ MUFU.RCP64H R9, R7; /* 0x84000000031c1c26 */
/*0368*/ DSETP.GT.AND P0, PT, |R8|, RZ, PT; /* 0xdc201c007f9c221e */
/*0370*/ #P0 BRA.U 0x398; /* 0x120000001000023c */
/*0378*/ #!P0 DSETP.NEU.AND P1, PT, |R6|, +INF , PT; /* 0xb4681fff80201a3d */
/* 0x0800b8a010ac0010 */
/*0388*/ #!P0 SEL R9, R7, R9, P1; /* 0xe500040004a01c26 */
/*0390*/ #!P0 SEL R8, R6, RZ, P1; /* 0xe50004007fa01822 */
/*0398*/ DMUL.S R8, R8, R4; /* 0xe4000000025c2022 */
/*03a0*/ MOV R4, R8; /* 0xe4c03c00041c0012 */
/*03a8*/ MOV R5, R9; /* 0xe4c03c00049c0016 */
/*03b0*/ RET; /* 0x19000000001c003c */
/*03b8*/ BRA 0x3b8; /* 0x12007ffffc1c003c */
The MUFU.RCP64H instruction provides the initial approximation of the reciprocal. It operates on the high 32 bits of the denominator (y) and provides the high 32 bits of the double precision approximation, and therefor is counted as a Floating Point Operations (Single Precision Special) by the profiler.
There is another single precision FFMA instruction further down apparently used as a high-throughput version of testing a conditional where full precision isn't required.
I cannot find a document that explains the the following instruction format in CUDA
FMAD R6, -R6, c [0x1] [0x1], R5;
What is the format (source, destination, ...) and what is that -R6?
The PTX reference guide describes fma as follows
fma.rnd{.ftz}{.sat}.f32 d, a, b, c;
fma.rnd.f64 d, a, b, c;
performs
d = a*b + c;
in either single or double precision.
You are looking at disassembled SASS, the instruction set references for that show FMAD as being the (non IEEE 754 compliant) single precision form from the GT200 instruction set. That is a little bit problematic, because I don't presently have a toolchain which supports that deprecated instruction set. However, if I use the Fermi instruction set instead and compile this kernel:
__global__ void kernel(const float *x, const float *y, float *a)
{
float xval = x[threadIdx.x];
float yval = y[threadIdx.x];
float aval = -xval * xval + yval;
a[threadIdx.x] = aval;:
}
I get this SASS:
code for sm_20
Function : _Z6kernelPKfS0_Pf
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R3, SR_TID.X; /* 0x2c0000008400dc04 */
/*0010*/ MOV32I R5, 0x4; /* 0x1800000010015de2 */
/*0018*/ IMAD.U32.U32 R8.CC, R3, R5, c[0x0][0x20]; /* 0x200b800080321c03 */
/*0020*/ IMAD.U32.U32.HI.X R9, R3, R5, c[0x0][0x24]; /* 0x208a800090325c43 */
/*0028*/ IMAD.U32.U32 R6.CC, R3, R5, c[0x0][0x28]; /* 0x200b8000a0319c03 */
/*0030*/ LD.E R0, [R8]; /* 0x8400000000801c85 */
/*0038*/ IMAD.U32.U32.HI.X R7, R3, R5, c[0x0][0x2c]; /* 0x208a8000b031dc43 */
/*0040*/ IMAD.U32.U32 R4.CC, R3, R5, c[0x0][0x30]; /* 0x200b8000c0311c03 */
/*0048*/ LD.E R2, [R6]; /* 0x8400000000609c85 */
/*0050*/ IMAD.U32.U32.HI.X R5, R3, R5, c[0x0][0x34]; /* 0x208a8000d0315c43 */
/*0058*/ FFMA.FTZ R0, -R0, R0, R2; /* 0x3004000000001e40 */
/*0060*/ ST.E [R4], R0; /* 0x9400000000401c85 */
/*0068*/ EXIT; /* 0x8000000000001de7 */
..................................
Note that I also have the negated register in the FFMA.FTZ arguments. So I would guess that your:
FMAD R6, -R6, c [0x1] [0x1], R5;
is the equivalent of
R6 = -R6 * const + R5
where c [0x1] [0x1] is a compile time constant, and that the GPU has some sort of instruction modifier which it can set to control negation of a floating point value as part of a floating point operation without explicitly twiddling the sign bit of the register before the call.
(I look forward to #njuffa tearing this answer to shreds).
The following code sums every 32 elements in an array to the very first element of each 32 element group:
int i = threadIdx.x;
int warpid = i&31;
if(warpid < 16){
s_buf[i] += s_buf[i+16];__syncthreads();
s_buf[i] += s_buf[i+8];__syncthreads();
s_buf[i] += s_buf[i+4];__syncthreads();
s_buf[i] += s_buf[i+2];__syncthreads();
s_buf[i] += s_buf[i+1];__syncthreads();
}
I thought I can eliminate all the __syncthreads() in the code, since all the operations are done in the same warp. But if I eliminate them, I get garbage results back. It shall not affect performance too much, but I want to know why I need __syncthreads() here.
I'm providing an answer here because I think that the above two are not fully satisfactory. The "intellectual property" of this answer belongs to Mark Harris, who has pointed out this issue in this presentation (slide 22), and to #talonmies, who has pointed this problem out to the OP in the comments above.
Let me first try to resume what the OP was asking, filtering his mistakes.
The OP seems to be dealing with the last step of reduction in shared memory reduction, warp reduction by loop unrolling. He is doing something like
template <class T>
__device__ void warpReduce(T *sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
template <class T>
__global__ void reduce4_no_synchthreads(T *g_idata, T *g_odata, unsigned int N)
{
extern __shared__ T sdata[];
unsigned int tid = threadIdx.x; // Local thread index
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; // Global thread index - Fictitiously double the block dimension
// --- Performs the first level of reduction in registers when reading from global memory.
T mySum = (i < N) ? g_idata[i] : 0;
if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
sdata[tid] = mySum;
// --- Before going further, we have to make sure that all the shared memory loads have been completed
__syncthreads();
// --- Reduction in shared memory. Only half of the threads contribute to reduction.
for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
// --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
__syncthreads();
}
// --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
if (tid < 32) warpReduce(sdata, tid);
// --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
// individual blocks
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
As pointed out by Mark Harris and talonmies, the shared memory variable sdata must be declared as volatile, to prevent compiler optimizations. So, the right way to define the __device__ function above is:
template <class T>
__device__ void warpReduce(volatile T *sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
Let us now see the disassembled codes corresponding to the two cases above examined, i.e., sdata declared as not volatile or volatile (code compiled for Fermi architecture).
Not volatile
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0010*/ SHL R3, R0, 0x1; /* 0x6000c0000400dc03 */
/*0018*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0020*/ IMAD R3, R3, c[0x0][0x8], R2; /* 0x200440002030dca3 */
/*0028*/ IADD R4, R3, c[0x0][0x8]; /* 0x4800400020311c03 */
/*0030*/ ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT; /* 0x188e4000a031dc03 */
/*0038*/ ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT; /* 0x1b0e4000a043dc03 */
/*0040*/ #P0 ISCADD R3, R3, c[0x0][0x20], 0x2; /* 0x400040008030c043 */
/*0048*/ #!P1 ISCADD R4, R4, c[0x0][0x20], 0x2; /* 0x4000400080412443 */
/*0050*/ #!P0 MOV R5, RZ; /* 0x28000000fc0161e4 */
/*0058*/ #!P1 LD R4, [R4]; /* 0x8000000000412485 */
/*0060*/ #P0 LD R5, [R3]; /* 0x8000000000314085 */
/*0068*/ SHL R3, R2, 0x2; /* 0x6000c0000820dc03 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ #!P1 IADD R5, R4, R5; /* 0x4800000014416403 */
/*0080*/ MOV R4, c[0x0][0x8]; /* 0x2800400020011de4 */
/*0088*/ STS [R3], R5; /* 0xc900000000315c85 */
/*0090*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0098*/ MOV R6, c[0x0][0x8]; /* 0x2800400020019de4 */
/*00a0*/ ISETP.LT.U32.AND P0, PT, R6, 0x42, PT; /* 0x188ec0010861dc03 */
/*00a8*/ #P0 BRA 0x118; /* 0x40000001a00001e7 */
/*00b0*/ NOP; /* 0x4000000000001de4 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
/*00c0*/ MOV R6, R4; /* 0x2800000010019de4 */
/*00c8*/ SHR.U32 R4, R4, 0x1; /* 0x5800c00004411c03 */
/*00d0*/ ISETP.GE.U32.AND P0, PT, R2, R4, PT; /* 0x1b0e00001021dc03 */
/*00d8*/ #!P0 IADD R7, R4, R2; /* 0x480000000841e003 */
/*00e0*/ #!P0 SHL R7, R7, 0x2; /* 0x6000c0000871e003 */
/*00e8*/ #!P0 LDS R7, [R7]; /* 0xc10000000071e085 */
/*00f0*/ #!P0 IADD R5, R7, R5; /* 0x4800000014716003 */
/*00f8*/ #!P0 STS [R3], R5; /* 0xc900000000316085 */
/*0100*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0108*/ ISETP.GT.U32.AND P0, PT, R6, 0x83, PT; /* 0x1a0ec0020c61dc03 */
/*0110*/ #P0 BRA 0xc0; /* 0x4003fffea00001e7 */
/*0118*/ ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT; /* 0x1a0ec0007c21dc03 */
/*0120*/ #P0 BRA.U 0x198; /* 0x40000001c00081e7 */
/*0128*/ #!P0 LDS R8, [R3]; /* 0xc100000000322085 */
/*0130*/ #!P0 LDS R5, [R3+0x80]; /* 0xc100000200316085 */
/*0138*/ #!P0 LDS R4, [R3+0x40]; /* 0xc100000100312085 */
/*0140*/ #!P0 LDS R7, [R3+0x20]; /* 0xc10000008031e085 */
/*0148*/ #!P0 LDS R6, [R3+0x10]; /* 0xc10000004031a085 */
/*0150*/ #!P0 IADD R8, R8, R5; /* 0x4800000014822003 */
/*0158*/ #!P0 IADD R8, R8, R4; /* 0x4800000010822003 */
/*0160*/ #!P0 LDS R5, [R3+0x8]; /* 0xc100000020316085 */
/*0168*/ #!P0 IADD R7, R8, R7; /* 0x480000001c81e003 */
/*0170*/ #!P0 LDS R4, [R3+0x4]; /* 0xc100000010312085 */
/*0178*/ #!P0 IADD R6, R7, R6; /* 0x480000001871a003 */
/*0180*/ #!P0 IADD R5, R6, R5; /* 0x4800000014616003 */
/*0188*/ #!P0 IADD R4, R5, R4; /* 0x4800000010512003 */
/*0190*/ #!P0 STS [R3], R4; /* 0xc900000000312085 */
/*0198*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*01a0*/ #P0 BRA.U 0x1c0; /* 0x40000000600081e7 */
/*01a8*/ #!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; /* 0x4000400090002043 */
/*01b0*/ #!P0 LDS R2, [RZ]; /* 0xc100000003f0a085 */
/*01b8*/ #!P0 ST [R0], R2; /* 0x900000000000a085 */
/*01c0*/ EXIT; /* 0x8000000000001de7 */
Lines /*0128*/-/*0148*/, /*0160*/ and /*0170*/ correspond to the shared memory loads to registers and line /*0190*/ to the shared memory store from register. The intermediate lines correspond to the summations, as performed in registers. So, the intermediate results are kept in registers (which are private to each thread) and not flushed each time to shared memory, preventing the threads to have full visibility of the intermediate results.
volatile
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0010*/ SHL R3, R0, 0x1; /* 0x6000c0000400dc03 */
/*0018*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0020*/ IMAD R3, R3, c[0x0][0x8], R2; /* 0x200440002030dca3 */
/*0028*/ IADD R4, R3, c[0x0][0x8]; /* 0x4800400020311c03 */
/*0030*/ ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT; /* 0x188e4000a031dc03 */
/*0038*/ ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT; /* 0x1b0e4000a043dc03 */
/*0040*/ #P0 ISCADD R3, R3, c[0x0][0x20], 0x2; /* 0x400040008030c043 */
/*0048*/ #!P1 ISCADD R4, R4, c[0x0][0x20], 0x2; /* 0x4000400080412443 */
/*0050*/ #!P0 MOV R5, RZ; /* 0x28000000fc0161e4 */
/*0058*/ #!P1 LD R4, [R4]; /* 0x8000000000412485 */
/*0060*/ #P0 LD R5, [R3]; /* 0x8000000000314085 */
/*0068*/ SHL R3, R2, 0x2; /* 0x6000c0000820dc03 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ #!P1 IADD R5, R4, R5; /* 0x4800000014416403 */
/*0080*/ MOV R4, c[0x0][0x8]; /* 0x2800400020011de4 */
/*0088*/ STS [R3], R5; /* 0xc900000000315c85 */
/*0090*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0098*/ MOV R6, c[0x0][0x8]; /* 0x2800400020019de4 */
/*00a0*/ ISETP.LT.U32.AND P0, PT, R6, 0x42, PT; /* 0x188ec0010861dc03 */
/*00a8*/ #P0 BRA 0x118; /* 0x40000001a00001e7 */
/*00b0*/ NOP; /* 0x4000000000001de4 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
/*00c0*/ MOV R6, R4; /* 0x2800000010019de4 */
/*00c8*/ SHR.U32 R4, R4, 0x1; /* 0x5800c00004411c03 */
/*00d0*/ ISETP.GE.U32.AND P0, PT, R2, R4, PT; /* 0x1b0e00001021dc03 */
/*00d8*/ #!P0 IADD R7, R4, R2; /* 0x480000000841e003 */
/*00e0*/ #!P0 SHL R7, R7, 0x2; /* 0x6000c0000871e003 */
/*00e8*/ #!P0 LDS R7, [R7]; /* 0xc10000000071e085 */
/*00f0*/ #!P0 IADD R5, R7, R5; /* 0x4800000014716003 */
/*00f8*/ #!P0 STS [R3], R5; /* 0xc900000000316085 */
/*0100*/ BAR.RED.POPC RZ, RZ, RZ, PT; /* 0x50ee0000ffffdc04 */
/*0108*/ ISETP.GT.U32.AND P0, PT, R6, 0x83, PT; /* 0x1a0ec0020c61dc03 */
/*0110*/ #P0 BRA 0xc0; /* 0x4003fffea00001e7 */
/*0118*/ ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT; /* 0x1a0ec0007c21dc03 */
/*0120*/ SSY 0x1f0; /* 0x6000000320000007 */
/*0128*/ #P0 NOP.S; /* 0x40000000000001f4 */
/*0130*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0138*/ LDS R4, [R3+0x80]; /* 0xc100000200311c85 */
/*0140*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*0148*/ STS [R3], R6; /* 0xc900000000319c85 */
/*0150*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0158*/ LDS R4, [R3+0x40]; /* 0xc100000100311c85 */
/*0160*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*0168*/ STS [R3], R6; /* 0xc900000000319c85 */
/*0170*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0178*/ LDS R4, [R3+0x20]; /* 0xc100000080311c85 */
/*0180*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*0188*/ STS [R3], R6; /* 0xc900000000319c85 */
/*0190*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*0198*/ LDS R4, [R3+0x10]; /* 0xc100000040311c85 */
/*01a0*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*01a8*/ STS [R3], R6; /* 0xc900000000319c85 */
/*01b0*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*01b8*/ LDS R4, [R3+0x8]; /* 0xc100000020311c85 */
/*01c0*/ IADD R6, R5, R4; /* 0x4800000010519c03 */
/*01c8*/ STS [R3], R6; /* 0xc900000000319c85 */
/*01d0*/ LDS R5, [R3]; /* 0xc100000000315c85 */
/*01d8*/ LDS R4, [R3+0x4]; /* 0xc100000010311c85 */
/*01e0*/ IADD R4, R5, R4; /* 0x4800000010511c03 */
/*01e8*/ STS.S [R3], R4; /* 0xc900000000311c95 */
/*01f0*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*01f8*/ #P0 BRA.U 0x218; /* 0x40000000600081e7 */
/*0200*/ #!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; /* 0x4000400090002043 */
/*0208*/ #!P0 LDS R2, [RZ]; /* 0xc100000003f0a085 */
/*0210*/ #!P0 ST [R0], R2; /* 0x900000000000a085 */
/*0218*/ EXIT; /* 0x8000000000001de7 */
As it can be seen from lines /*0130*/-/*01e8*/, now each time a summation is performed, the intermediate result is immediately flushed to shared memory for full thread visibility.
Maybe have a look at these Slides from Mark Harris. Why reinvent the wheel.
www.uni-graz.at/~haasegu/Lectures/GPU_CUDA/Lit/reduction.pdf?page=35
Each reduction step is dependent on the other.
So you can only leave out the synchronization in the last excecuted warp equals 32 active threads in the reduction phase.
One step before you need 64 threads and hence need a synchronisation since parallel execution is not guaranteed since you use 2 warps.