CUDA stack frame size increase by __forceinline__ - cuda

When I declare device functions with __forceinline__, the linker outputs this information:
2> nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2> nvlink : info : used 28 registers, 456 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem
and without it the output is:
2> nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2> nvlink : info : used 23 registers, 216 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem
Why is the size of the stack frame smaller when the __forceinline__ is not used?
How important is to keep the stack frame as small as possible?
Thank you for your help.

The main reason to reduce the stack frame is that the stack is allocated in local memory which resides in off-chip device memory. This makes the access to the stack (if not cached) slow.
To show this, let me make a simple example. Consider the case:
__device__ __noinline__ void func(float* d_a, float* test, int tid) {
d_a[tid]=test[tid]*d_a[tid];
}
__global__ void kernel_function(float* d_a) {
float test[16];
test[threadIdx.x] = threadIdx.x;
func(d_a,test,threadIdx.x);
}
Note that the __device__ function is declared __noinline__. In this case
ptxas : info : Function properties for _Z15kernel_functionPf
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 36 bytes cmem[0]
i.e., we have 64 bytes of stack frame. The corresponding disassembled code is
MOV R1, c[0x1][0x100];
ISUB R1, R1, 0x40;
S2R R6, SR_TID.X; R6 = ThreadIdx.x
MOV R4, c[0x0][0x20];
IADD R5, R1, c[0x0][0x4];
I2F.F32.U32 R2, R6; R2 = R6 (integer to float conversion)
ISCADD R0, R6, R1, 0x2;
STL [R0], R2; stores R2 to test[ThreadIdx.x]
CAL 0x50;
EXIT ; __device__ function part
ISCADD R2, R6, R5, 0x2;
ISCADD R3, R6, R4, 0x2;
LD R2, [R2]; loads d_a[tid]
LD R0, [R3]; loads test[tid]
FMUL R0, R2, R0; d_a[tid] = d_a[tid]*test[tid]
ST [R3], R0; store the new value of d_a[tid] to global memory
RET ;
As you can see, test is stored and loaded from global memory, forming the stack frame (it is 16 floats = 64 bytes).
Now change the device function to
__device__ __forceinline__ void func(float* d_a, float* test, int tid) {
d_a[tid]=test[tid]*d_a[tid];
}
that is, change the __device__ function from __noinline__ to __forceinline__. In this case, we have
ptxas : info : Compiling entry function '_Z15kernel_functionPf' for 'sm_20'
ptxas : info : Function properties for _Z15kernel_functionPf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
i.e., we have an empty stack frame now. Indeed, the disassembled code becomes:
MOV R1, c[0x1][0x100];
S2R R2, SR_TID.X; R2 = ThreadIdx.x
ISCADD R3, R2, c[0x0][0x20], 0x2;
I2F.F32.U32 R2, R2; R2 = R2 (integer to float conversion)
LD R0, [R3]; R2 = d_a[ThreadIdx.x] (load from global memory)
FMUL R0, R2, R0; d_a[ThreadIdx.x] = d_a[ThreadIdx.x] * ThreadIdx.x
ST [R3], R0; stores the new value of d_a[ThreadIdx.x] to global memory
EXIT ;
As you can see, forcing the inlining enables the compiler to perform proper optimizations so that now test is fully discarded from the code.
In the above example, __forceinline__ has an effect that is opposite to what you are experiencing, which also shows that, without any further information, the first question cannot be answered.

Related

IADD.X GPU instruction

When looking into the SASS output generated for the NVIDIA Fermi architecture, the instruction IADD.X is observed. From NVIDIA documentation, IADD means integer add, but not understanding what it means by IADD.X. Can somebody please help... Is this meaning an integer addition with extended number of bits?
The instruction snippet is:
IADD.X R5, R3, c[0x0][0x24]; /* 0x4800400090315c43 */
Yes, the .X stands for eXtended precision. You will see IADD.X used together with IADD.CC, where the latter adds the less significant bits, and produces a carry flag (thus the .CC), and this carry flag is then incorporated into addition of the more significant bits performed by IADD.X.
Since NVIDIA GPUs are basically 32-bit processors with 64-bit addressing capability, a frequent use of this idiom is in address (pointer) arithmetic. The use of 64-bit integer types, such as long long int or uint64_t will likewise lead to the use of these instructions.
Here is a worked example of a kernel doing 64-bit integer addition. This CUDA code was compiled for compute capability 3.5 with CUDA 7.5, and the machine code dumped with cuobjdump --dump-sass.
__global__ void addint64 (long long int a, long long int b, long long int *res)
{
*res = a + b;
}
MOV R1, c[0x0][0x44];
MOV R2, c[0x0][0x148]; // b[31:0]
MOV R0, c[0x0][0x14c]; // b[63:32]
IADD R4.CC, R2, c[0x0][0x140]; // tmp[31:0] = b[31:0] + a[31:0]; carry-out
MOV R2, c[0x0][0x150]; // res[31:0]
MOV R3, c[0x0][0x154]; // res[63:32]
IADD.X R5, R0, c[0x0][0x144]; // tmp[63:32] = b[63:32] + a[63:32] + carry-in
ST.E.64 [R2], R4; // [res] = tmp[63:0]
EXIT

The concept of branch (taken, not taken, diverged) in CUDA

In Nsight Visual Studio, we will have a graph to present the statistics of "taken", "not taken" and "diverged" branches. I am confused about the differece between "not taken" and "diverged".
For example
kernel()
{
if(tid % 32 != 31)
{...}
else
{...}
}
In my opinion, when tid %31 == 31 in a warp, the divergency will happen, but what is "not taken"?
From the Nsight Visual Studio Edition User Guide:
Not Taken / Taken Total: number of executed branch instructions with a uniform control flow decision; that is all active threads of a warp either take or not take the branch.
Diverged: Total number of executed branch instruction for which the conditional resulted in different outcomes across the threads of the warp. All code paths with at least one participating thread get executed sequentially. Lower numbers are better, however, check the Flow Control Efficiency to understand the impact of control flow on the device utilization.
Now, let us consider the following simple code, which perhaps is what you are currently considering in your tests:
#include<thrust\device_vector.h>
__global__ void test_divergence(int* d_output) {
int tid = threadIdx.x;
if(tid % 32 != 31)
d_output[tid] = tid;
else
d_output[tid] = 30000;
}
void main() {
const int N = 32;
thrust::device_vector<int> d_vec(N,0);
test_divergence<<<2,32>>>(thrust::raw_pointer_cast(d_vec.data()));
}
The Branch Statistics graph produced by Nsight is reported below. As you can see, Taken is equal to 100%, since all the threads bump into the if statement. The surprising result is that you have no Diverge. This can be explained by taking a look at the disassembled code of the kernel function (compiled for a compute capability of 2.1):
MOV R1, c[0x1][0x100];
S2R R0, SR_TID.X;
SHR R2, R0, 0x1f;
IMAD.U32.U32.HI R2, R2, 0x20, R0;
LOP.AND R2, R2, -0x20;
ISUB R2, R0, R2;
ISETP.EQ.AND P0, PT, R2, 0x1f, PT;
ISCADD R2, R0, c[0x0][0x20], 0x2;
SEL R0, R0, 0x7530, !P0;
ST [R2], R0;
EXIT;
As you can see, the compiler is able to optimize the diassembled code so that no branching is present, except the uniform one due to the EXIT instruction, as pointed out by Greg Smith in the comment below.
EDIT: A MORE COMPLEX EXAMPLE FOLLOWING GREG SMITH'S COMMENT
I'm now considering the following more complex example
/**************************/
/* TEST DIVERGENCE KERNEL */
/**************************/
__global__ void testDivergence(float *a, float *b)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < 16) a[tid] = tid + 1;
else b[tid] = tid + 2;
}
/********/
/* MAIN */
/********/
void main() {
const int N = 64;
float* d_a; cudaMalloc((void**)&d_a,N*sizeof(float));
float* d_b; cudaMalloc((void**)&d_b,N*sizeof(float));
testDivergence<<<2,32>>>(d_a, d_b);
}
This is the Branch Statistics graph
while this is the disassembled code
MOV R1, c[0x1][0x100];
S2R R0, SR_CTAID.X; R0 = blockIdx.x
S2R R2, SR_TID.X; R0 = threadIdx.x
IMAD R0, R0, c[0x0][0x8], R2; R0 = threadIdx.x + blockIdx.x * blockDim.x
ISETP.LT.AND P0, PT, R0, 0x10, PT; Checks if R0 < 16 and puts the result in predicate register P0
/*0028*/ #P0 BRA.U 0x58; If P0 = true, jumps to line 58
#!P0 IADD R2, R0, 0x2; If P0 = false, R2 = R0 + 2
#!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; If P0 = false, calculates address to store b[tid] in global memory
#!P0 I2F.F32.S32 R2, R2; "
#!P0 ST [R0], R2; "
/*0050*/ #!P0 BRA.U 0x78; If P0 = false, jumps to line 78
/*0058*/ #P0 IADD R2, R0, 0x1; R2 = R0 + 1
#P0 ISCADD R0, R0, c[0x0][0x20], 0x2;
#P0 I2F.F32.S32 R2, R2;
#P0 ST [R0], R2;
/*0078*/ EXIT;
As it can be seen, now we have two BRA instructions in the disassembled code. From the graph above, each warp bumps into 3 branches (one for the EXIT and the two BRAs). Both warps have 1 taken branch, since all the threads uniformly bump into the EXIT instruction. The first warp has 2 not taken branches, since the two BRAs paths are not followed uniformly across the warp threads. The second warp has 1 not taken branch and 1 taken branch since all the warp threads follow uniformly one of the two BRAs. I would say that again diverged* is equal to zero because the instructions in the two branches are exactly the same, although performed on different operands.

When to use volatile with register/local variables

What is the meaning of declaring register arrays in CUDA with volatile qualifier?
When I tried with volatile keyword with a register array, it removed the number of spilled register memory to local memory. (i.e. Force the CUDA to use registers instead of local memory) Is this the intended behavior?
I did not find any information about the usage of volatile with regard to register arrays in CUDA documentation.
Here is the ptxas -v output for both versions
With volatile qualifier
__volatile__ float array[32];
ptxas -v output
ptxas info : Compiling entry function '_Z2swPcS_PfiiiiS0_' for 'sm_20'
ptxas info : Function properties for _Z2swPcS_PfiiiiS0_
88 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 47 registers, 16640 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]
Without volatile qualifier
float array[32];
ptxas -v output
ptxas info : Compiling entry function '_Z2swPcS_PfiiiiS0_' for 'sm_20'
ptxas info : Function properties for _Z2swPcS_PfiiiiS0_
96 bytes stack frame, 100 bytes spill stores, 108 bytes spill loads
ptxas info : Used 51 registers, 16640 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]
The volatile qualifier specifies to the compiler that all references to a variable (read or write) should result in a memory reference and those references must be in the order specified in the program. The use of the volatile qualifier is illustrated in Chapter 12 of the Shane Cook book, "CUDA Programming".
The use of volatile will avoid some optimizations the compiler can do and so change the number of used registers used. The best way to understand what volatile is actually doing is to disassemble the relevant __global__ function with and without the qualifier.
Consider indeed the following kernel functions
__global__ void volatile_test() {
volatile float a[3];
for (int i=0; i<3; i++) a[i] = (float)i;
}
__global__ void no_volatile_test() {
float a[3];
for (int i=0; i<3; i++) a[i] = (float)i;
}
Disassembling the above kernel functions one obtains
code for sm_20
Function : _Z16no_volatile_testv
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ EXIT ; /* 0x8000000000001de7 */
Function : _Z13volatile_testv
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ ISUB R1, R1, 0x10; /* 0x4800c00040105d03 */ R1 = address of a[0]
/*0010*/ MOV32I R2, 0x3f800000; /* 0x18fe000000009de2 */ R2 = 1
/*0018*/ MOV32I R0, 0x40000000; /* 0x1900000000001de2 */ R0 = 2
/*0020*/ STL [R1], RZ; /* 0xc8000000001fdc85 */
/*0028*/ STL [R1+0x4], R2; /* 0xc800000010109c85 */ a[0] = 0;
/*0030*/ STL [R1+0x8], R0; /* 0xc800000020101c85 */ a[1] = R2 = 1;
/*0038*/ EXIT ; /* 0x8000000000001de7 */ a[2] = R0 = 2;
As you can see, when NOT using the volatile keyword, the compiler realizes that a is set but never used (indeed, the compiler returns the following warning: variable "a" was set but never used) and there is practically no disassembled code.
Opposite to that, when using the volatile keyword, all references to a are translated to memory references (write in this case).

How to force var4 in registers.

I have the following representative code:
__global__ void func()
{
register ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(1.0));
}
When compiling, result is stored in local memory. Is it possible to force this to registers? Local memory is too slow for the intended application.
Furthermore, this result must be stored to an array of var4 elements. I would like to store these results coalesced, like ((ushort4*)(output))[x + y * width] = result;. Another solution without var4 is also an option.
A vector type should be compiled into registers if there is available registers to do so. Turning your snippet into something that will survive dead code removal:
__global__ void func(ushort4 *out)
{
ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5),
__float2half_rn(0.5), __float2half_rn(1.0));
out[threadIdx.x+blockDim.x*blockIdx.x] = result;
}
and compiling it:
>nvcc -cubin -arch=sm_20 -Xptxas="-v" ushort4.cu
ushort4.cu
ushort4.cu
tmpxft_000010b8_00000000-3_ushort4.cudafe1.gpu
tmpxft_000010b8_00000000-10_ushort4.cudafe2.gpu
ptxas info : Compiling entry function '_Z4funcP7ushort4' for 'sm_20'
ptxas info : Function properties for _Z4funcP7ushort4
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 36 bytes cmem[0]
shows no spills (ie. local memory). Further, disassembling the resulting cubin file shows:
>cuobjdump --dump-sass ushort4.cubin
code for sm_20
Function : _Z4funcP7ushort4
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x01101c041000cfc0*/ F2F.F16.F32 R0, 0x3f000;
/*0010*/ /*0x94009c042c000000*/ S2R R2, SR_CTAid_X;
/*0018*/ /*0x8400dc042c000000*/ S2R R3, SR_Tid_X;
/*0020*/ /*0x01111c041000cfe0*/ F2F.F16.F32 R4, 0x3f800;
/*0028*/ /*0x00915c041c000000*/ I2I.U16.U16 R5, R0;
/*0030*/ /*0x20209c0320064000*/ IMAD.U32.U32 R2, R2, c [0x0] [0x8], R3;
/*0038*/ /*0x40019c03280ac040*/ BFI R6, R0, 0x1010, R5;
/*0040*/ /*0x4041dc03280ac040*/ BFI R7, R4, 0x1010, R5;
/*0048*/ /*0x80201c6340004000*/ ISCADD R0, R2, c [0x0] [0x20], 0x3;
/*0050*/ /*0x00019ca590000000*/ ST.64 [R0], R6;
/*0058*/ /*0x00001de780000000*/ EXIT;
.................................
ie. the ushort4 is stuffed into register and then a 64 bit store is used to write the packed vector out to global memory. No local memory access to be seen.
So if you have convinced yourself that you have a vector value compiling into local memory, it is either because you have a kernel with a lot of register pressure, or you are asking the compiler to (the volatile keyword will do that), or you have misinterpreted what the compiler/assembler are telling you at compile time.
EDIT: Using the CUDA 4.0 release tookit with Visual Studio Express 2008 and compiling on 32bit Windows 7 for a compute 1.1 device gives:
>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Fri_May_13_02:42:40_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221
>cl.exe
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
usage: cl [ option... ] filename... [ /link linkoption... ]
>nvcc -cubin -arch=sm_11 -Xptxas=-v ushort4.cu
ushort4.cu
ushort4.cu
tmpxft_00001788_00000000-3_ushort4.cudafe1.gpu
tmpxft_00001788_00000000-10_ushort4.cudafe2.gpu
ptxas info : Compiling entry function '_Z4funcP7ushort4' for 'sm_11'
ptxas info : Used 4 registers, 4+16 bytes smem
which is the exact same result as for the original build for a compute 2.0 target.

How to determine which lines of CUDA use the most registers?

I have a somewhat complex kernel with the following stats:
ptxas info : Compiling entry function 'my_kernel' for 'sm_21'
ptxas info : Function properties for my_kernel
32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads
ptxas info : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
It's not clear to me which part of the kernel is the "high water mark" in terms of register usage. The nature of the kernel is such that stubbing out various parts for constant values causes the optimizer to constant-fold later parts, etc. (at least that's how it seems, since the numbers I get back when I do so don't make much sense).
The CUDA profiler is similarly unhelpful AFAICT, simply telling me that I have register pressure.
Is there a way to get more information about register usage? I'd prefer a tool of some kind, but I'd also be interested in hearing about digging into the compiled binary directly, if that's what it takes.
Edit: It is certainly possible for me to approach this bottom-up (ie. making experimental code changes, checking the impact on register usage, etc.) but I'd rather start top-down, or at least get some guidance on where to begin bottom-up investigation.
You can get a feel for the complexity of the compiler output by compiling to annotated PTX like this:
nvcc -ptx -Xopencc="-LIST:source=on" branching.cu
which will issue a PTX assembler file with the original C code inside it as comments:
.entry _Z11branchTest0PfS_S_ (
.param .u64 __cudaparm__Z11branchTest0PfS_S__a,
.param .u64 __cudaparm__Z11branchTest0PfS_S__b,
.param .u64 __cudaparm__Z11branchTest0PfS_S__d)
{
.reg .u16 %rh<4>;
.reg .u32 %r<5>;
.reg .u64 %rd<10>;
.reg .f32 %f<5>;
.loc 16 1 0
// 1 __global__ void branchTest0(float *a, float *b, float *d)
$LDWbegin__Z11branchTest0PfS_S_:
.loc 16 7 0
// 3 unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
// 4 float aval = a[tidx], bval = b[tidx];
// 5 float z0 = (aval > bval) ? aval : bval;
// 6
// 7 d[tidx] = z0;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r1, %rh1, %rh2;
cvt.u32.u16 %r2, %tid.x;
add.u32 %r3, %r2, %r1;
cvt.u64.u32 %rd1, %r3;
mul.wide.u32 %rd2, %r3, 4;
ld.param.u64 %rd3, [__cudaparm__Z11branchTest0PfS_S__a];
add.u64 %rd4, %rd3, %rd2;
ld.global.f32 %f1, [%rd4+0];
ld.param.u64 %rd5, [__cudaparm__Z11branchTest0PfS_S__b];
add.u64 %rd6, %rd5, %rd2;
ld.global.f32 %f2, [%rd6+0];
max.f32 %f3, %f1, %f2;
ld.param.u64 %rd7, [__cudaparm__Z11branchTest0PfS_S__d];
add.u64 %rd8, %rd7, %rd2;
st.global.f32 [%rd8+0], %f3;
.loc 16 8 0
// 8 }
exit;
$LDWend__Z11branchTest0PfS_S_:
} // _Z11branchTest0PfS_S_
Note that this doesn't directly tell you anything about the register usage, because PTX uses static-single assignment, but it shows you what the assembler is given as an input and how it relates to your original code. With the CUDA 4.0 toolkit, you can then compile the C to a cubin file for the Fermi architecture:
$ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu
ptxas info : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20'
ptxas info : Function properties for _Z11branchTest1PfS_S_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
and use the cuobjdump utility to disassemble the machine code the assembler produces.
$ cuobjdump -sass branching.cubin
code for sm_20
Function : _Z11branchTest0PfS_S_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x94001c042c000000*/ S2R R0, SR_CTAid_X;
/*0010*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0018*/ /*0x10015de218000000*/ MOV32I R5, 0x4;
/*0020*/ /*0x2000dc0320044000*/ IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2;
/*0028*/ /*0x10311c435000c000*/ IMUL.U32.U32.HI R4, R3, 0x4;
/*0030*/ /*0x80319c03200b8000*/ IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20];
/*0038*/ /*0x9041dc4348004000*/ IADD.X R7, R4, c [0x0] [0x24];
/*0040*/ /*0xa0321c03200b8000*/ IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28];
/*0048*/ /*0x00609c8584000000*/ LD.E R2, [R6];
/*0050*/ /*0xb0425c4348004000*/ IADD.X R9, R4, c [0x0] [0x2c];
/*0058*/ /*0xc0329c03200b8000*/ IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30];
/*0060*/ /*0x00801c8584000000*/ LD.E R0, [R8];
/*0068*/ /*0xd042dc4348004000*/ IADD.X R11, R4, c [0x0] [0x34];
/*0070*/ /*0x00201c00081e0000*/ FMNMX R0, R2, R0, !pt;
/*0078*/ /*0x00a01c8594000000*/ ST.E [R10], R0;
/*0080*/ /*0x00001de780000000*/ EXIT;
......................................
It is usually possible to trace back from assembler to PTX and get at least a rough idea where the "greedy" code sections are. Having said all that, managing register pressure is one of the more difficult aspects of CUDA programming at the moment. If/when NVIDIA ever document their ELF format for device code, I reckon a proper code analyzing tool would be a great project for someone.