Cost of calling CUDA device functions vs MACROS [duplicate] - cuda

Two facts: CUDA 5.0 lets you compile CUDA code in different objects files for linking later on. CUDA architecture 2.x no longer inlines functions automatically.
As usual in C/C++, I've implemented a function __device__ int foo() in functions.cu and placed its header in functions.hu. The function foo is called in other CUDA source files.
When I examine functions.ptx, I see that foo() spills to local memory. For testing purposes, I commented all of the meat of foo() and just made it return 1; Something still spills to local memory according to the .ptx. (I can't imagine what it is, since the function does nothing!)
However, when I move the implementation of foo() to the header file functions.hu and add the __forceinline__ qualifier, then nothing is written to local memory!
What is going on here? Why doesn't CUDA inline such a simple function automatically?
The whole point of separate header & implementation files is to make my life easier maintaining the code. But if I have to stick a bunch of functions (or all of them) in the header and __forceinline__ them, then it kind of defeats the purpose of CUDA 5.0's different compilation units...
Is there any way around this?
Simple, real example:
functions.cu:
__device__ int foo
(const uchar param0,
const uchar *const param1,
const unsigned short int param2,
const unsigned short int param3,
const uchar param4)
{
return 1; //real code commented out.
}
The above function spills to local memory.
functions.ptx:
.visible .func (.param .b32 func_retval0) _Z45fooPKhth(
.param .b32 _Z45foohPKhth_param_0,
.param .b64 _Z45foohPKhth_param_1,
.param .b32 _Z45foohPKhth_param_2,
.param .b32 _Z45foohPKhth_param_3
)
{
.local .align 8 .b8 __local_depot72[24];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .s16 %rc<3>;
.reg .s16 %rs<4>;
.reg .s32 %r<2>;
.reg .s64 %rd<2>;

Not all local memory usage represents spilling. Called functions need to follow the ABI calling conventions which includes creation of a stack frame which is in local memory. When nvcc is passed the commandline switch -Xptxas -v the compiler reports stack usage and spilling as a subcomponent thereof.
Currently (CUDA 5.0), the CUDA toolchain does not support function inlining across the boundaries of compilation units, like some host compilers do. Thus there is a tradeoff between the flexibility of separate compilation (such as re-compiling only a small part of a large project with lengthy compile times, and the possibility to create device-side libraries), and the performance gain that usually results from function inlining (e.g. elimination of overhead due to ABI calling convention, enabling additional optimization such as constant propgation across function boundaries).
Function inlining within a single compilation unit is controlled by compiler heuristics that try to dertermine whether inlining is likely profitable in terms of performance (if possible at all). This means that not all functions may be inlined. Programmers can override the heuristic with the function attributes __forcinline__ and __noinline__.

Related

Optimizing register usage in dot product

I'm developing a kernel function with several vector operations like scalar and vector products. The kernel uses a large amount of registers so that occupancy is very low. I'm trying to reduce the amount of used registers to improve occupancy.
Consider for example the following __device__ function performing a scalar product between two float3:
__device__ float dot(float3 in1, float3 in2) { return in1.x * in2.x + in1.y * in2.y + in1.z * in2.z; }
If I generate the .ptx file using
nvcc -ptx -gencode arch=compute_52,code=sm_52 -rdc=true simpleDot2.cu
(the file simpleDot2.cu contains only the definition of the __device__ function), I essentially obtain
// .globl _Z3dot6float3S_
.visible .func (.param .b32 func_retval0) _Z3dot6float3S_(
.param .align 4 .b8 _Z3dot6float3S__param_0[12],
.param .align 4 .b8 _Z3dot6float3S__param_1[12]
)
{
.reg .f32 %f<10>;
ld.param.f32 %f1, [_Z3dot6float3S__param_0+8];
ld.param.f32 %f2, [_Z3dot6float3S__param_0];
ld.param.f32 %f3, [_Z3dot6float3S__param_0+4];
ld.param.f32 %f4, [_Z3dot6float3S__param_1+8];
ld.param.f32 %f5, [_Z3dot6float3S__param_1];
ld.param.f32 %f6, [_Z3dot6float3S__param_1+4];
mul.f32 %f7, %f3, %f6;
fma.rn.f32 %f8, %f2, %f5, %f7;
fma.rn.f32 %f9, %f1, %f4, %f8;
st.param.f32 [func_retval0+0], %f9;
ret;
}
From the .ptx code, it seems that a number of 9 registers are used, which perhaps can be lowered. I understand that the .ptx code is not the ultimate code executed by a GPU.
Question
Is there any chance to rearrange the register usage in the .ptx code, for example recycling registers f1-f6, so to reduce the overall number of occupied registers?
Thank you very much for any help.
TL;DR To first order, no.
PTX is both a virtual ISA and a compiler intermediate representation. The registers used in PTX code are virtual registers and there is no fixed relation to the physical registers of the GPU. The PTX code generated by the CUDA toolchain follows the SSA (static single assignment) convention. This means that every virtual register is written to exactly once. Stated differently: When an instruction produces a result it is assigned to a new register. This means that longer kernels may use thousands of registers.
In the CUDA toolchain, PTX code is compiled to machine code (SASS) by the ptxas component. So despite the name, this is not an assembler, but an optimizing compiler that can do loop unrolling, CSE (common subexpression elimination), and so on. Most importantly, ptxas is responsible for register allocation and instruction scheduling, plus all optimizations specific to a particular GPU architecture.
As a consequence, any examination of register usage issues needs to be focused on the machine code, which can be extracted with cuobjdump --dump-sass. Furthermore, the programmer has very limited influence on the number of registers used, because ptxas uses numerous heuristics when determining register allocation, in particular to trade off register usage with performance: scheduling loads early tends to increases register pressure by extension of the life range, so does the creation of temporary variable during CSE or the creation of induction variable for strength reduction in loops.
Modern versions of CUDA that target compute capability of 3.0 and higher usually make excellent choices when determining these trade-offs, and it is rarely necessary for programmers to consider register pressure. It is not clear what motivates asker's question in this regard.
The documented mechanisms in CUDA to control maximum register usage are the -maxrregcount command-line flag of nvcc, which applies to an entire compilation unit, and the __launch_bounds__ attribute that allows control on a per-kernel basis. See the CUDA documentation for details. Beyond that, one can try to influence register usage by choosing the pxtas optimization level with -Xptxas -O{1|2|3} (default is -O3), or by re-arranging source code, or by use of compiler flags that tend to simplify the generated code, such as -use_fast_math.
Of course such indirect methods could have numerous other effects that are generally unpredictable, and any desirable result achieved will be "brittle", e.g. easily destroyed by changing to a new version of the toolchain.

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.

How to redefine malloc/free in CUDA?

I want to redefine malloc() and free() in my code, but when I run, two errors appear:
allowing all exceptions is incompatible with previous function "malloc";
allowing all exceptions is incompatible with previous function "free";
Then I search for this error, it seems CUDA doesn't allow us to redefine libary function, is this true? If we can't redefine those functions, how can I resolve the error?
The very short answer is that you cannot.
malloc is fundamentally a C++ standard library function which the CUDA toolchain internally overloads with a device hook in device code. Attempting to define your own device version of malloc or free can and will break the toolchain's internals. Exactly how depends on platform and compiler.
In your previous question on this, you had code like this:
__device__ void* malloc(size_t)
{ return theHeap.alloc(t); }
__device__ void free(void* p)
{ the Heap.dealloc(p); }
Because of existing standard library requirements, malloc and free must be defined as __device__ __host__ at global namespace scope. It is illegal in CUDA to have separate __device__ and __host__definitions of the same function. You could probably get around this restriction by using a private namespace for the custom allocator, or using different function names. But don't try and redefine anything from the standard library in device or host code. It will break things.

PTX - where are .reg registers located?

When I use .reg to declare registers.. where are they?
I mean: if I use .reg inside a device function registers are stored on the register file that each thread has... but what if I declare a .reg variable in the module in the global scope (not .global, simply global scope)?
Any .reg declaration winds up in the PTX register state space. How that maps to hardware features is determined by the assembler, but the usual rules of register or local memory hold true. You should be aware that register state space declarations at module scope are only supported in PTX 1.x and 2.x code and can't be used with the CUDA ABI. The PTX documentation notes:
Registers differ from the other state spaces in that they are not
fully addressable, i.e., it is not possible to refer to the address of
a register. When compiling to use the Application Binary Interface
(ABI), register variables are restricted to function scope and may not
be declared at module scope. When compiling legacy PTX code (ISA
versions prior to 3.0) containing module scoped .reg variables, the
compiler silently disables use of the ABI.

cuda: device function inlining and different .cu files

Two facts: CUDA 5.0 lets you compile CUDA code in different objects files for linking later on. CUDA architecture 2.x no longer inlines functions automatically.
As usual in C/C++, I've implemented a function __device__ int foo() in functions.cu and placed its header in functions.hu. The function foo is called in other CUDA source files.
When I examine functions.ptx, I see that foo() spills to local memory. For testing purposes, I commented all of the meat of foo() and just made it return 1; Something still spills to local memory according to the .ptx. (I can't imagine what it is, since the function does nothing!)
However, when I move the implementation of foo() to the header file functions.hu and add the __forceinline__ qualifier, then nothing is written to local memory!
What is going on here? Why doesn't CUDA inline such a simple function automatically?
The whole point of separate header & implementation files is to make my life easier maintaining the code. But if I have to stick a bunch of functions (or all of them) in the header and __forceinline__ them, then it kind of defeats the purpose of CUDA 5.0's different compilation units...
Is there any way around this?
Simple, real example:
functions.cu:
__device__ int foo
(const uchar param0,
const uchar *const param1,
const unsigned short int param2,
const unsigned short int param3,
const uchar param4)
{
return 1; //real code commented out.
}
The above function spills to local memory.
functions.ptx:
.visible .func (.param .b32 func_retval0) _Z45fooPKhth(
.param .b32 _Z45foohPKhth_param_0,
.param .b64 _Z45foohPKhth_param_1,
.param .b32 _Z45foohPKhth_param_2,
.param .b32 _Z45foohPKhth_param_3
)
{
.local .align 8 .b8 __local_depot72[24];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .s16 %rc<3>;
.reg .s16 %rs<4>;
.reg .s32 %r<2>;
.reg .s64 %rd<2>;
Not all local memory usage represents spilling. Called functions need to follow the ABI calling conventions which includes creation of a stack frame which is in local memory. When nvcc is passed the commandline switch -Xptxas -v the compiler reports stack usage and spilling as a subcomponent thereof.
Currently (CUDA 5.0), the CUDA toolchain does not support function inlining across the boundaries of compilation units, like some host compilers do. Thus there is a tradeoff between the flexibility of separate compilation (such as re-compiling only a small part of a large project with lengthy compile times, and the possibility to create device-side libraries), and the performance gain that usually results from function inlining (e.g. elimination of overhead due to ABI calling convention, enabling additional optimization such as constant propgation across function boundaries).
Function inlining within a single compilation unit is controlled by compiler heuristics that try to dertermine whether inlining is likely profitable in terms of performance (if possible at all). This means that not all functions may be inlined. Programmers can override the heuristic with the function attributes __forcinline__ and __noinline__.