Is there a guideline about register and local memory in cuda programing? [duplicate] - cuda

This question already has answers here:
In a CUDA kernel, how do I store an array in "local thread memory"?
(5 answers)
Closed 3 months ago.
The number of registers is limited in gpu, e.g. A100. Each thread cannot use over 255 registers.
But during my test, even not over 255, the compiler use local memory instead of register. Is there a more detailed guideline about how to keep my data in register, and when it would be in local memory?
I try to define a local array in my kernel. It looks like the array len would affect the action of compilier.
template<int len>
global void test(){
// ...
float arr[len];
// ...
}

Local arrays are placed in local memory if it is not accessed by compile-time constant indices.
This is described in the Programming Guide Section 5.3.2 Paragraph Local Memory. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:
Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling).

Related

How is stack frame managed within a thread in Cuda?

Suppose we have a kernel that invokes some functions, for instance:
__device__ int fib(int n) {
if (n == 0 || n == 1) {
return n;
} else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
return -1;
}
__global__ void fib_kernel(int* n, int *ret) {
*ret = fib(*n);
}
The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.
According to the Documentation of Cuda PTX, it states the following:
the GPU maintains execution state per thread, including a program counter and call stack
The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.
You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.
As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.
Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.
May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?
As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).

When passing parameter by value to kernel function, where are parameters copied?

I'm beginner at CUDA programming and have a question.
When I pass parameters by value, like this:
__global__ void add(int a, int b, int *c) {
// some operations
}
Since variable a and b are passed to kernel function add as copied value in function call stack, I guessed some memory space would be needed to copy in.
If I'm right, is that additional memory space where those parameters are copied
in GPU or in Host's main memory?
The reason why I wonder this problem is that I should pass a big struct to kernel function.
I also thought pass a pointer of the struct, but these way seems to be required to call cudamalloc for the struct and each member variables.
The very short answer is that all arguments to CUDA kernels are passed by value, and those arguments are copied by the host via an API into a dedicated memory argument buffer on the GPU. At present, this buffer is stored in constant memory and there is a limit of 4kb of arguments per kernel launch -- see here.
In more details, the PTX standard (technically since compute capability 2.0 hardware and the CUDA ABI appeared) defines a dedicated logical state space call .param which hold kernel and device parameter arguments. See here. Quoting from that documentation:
Each kernel function definition includes an optional list of
parameters. These parameters are addressable, read-only variables
declared in the .param state space. Values passed from the host to the
kernel are accessed through these parameter variables using ld.param
instructions. The kernel parameter variables are shared across all
CTAs within a grid.
It further notes that:
Note: The location of parameter space is implementation specific. For example, in some implementations kernel parameters reside in
global memory. No access protection is provided between parameter and
global space in this case. Similarly, function parameters are mapped
to parameter passing registers and/or stack locations based on the
function calling conventions of the Application Binary Interface
(ABI).
So the precise location of the parameter state space is implementation specific. In the first iteration of CUDA hardware, it actually mapped to shared memory for kernel arguments and registers for device function arguments. However, since compute 2.0 hardware and the PTX 2.2 standard, it maps to constant memory for kernels under most circumstances. The documentation says the following on the matter:
The constant (.const) state space is a read-only memory initialized
by the host. Constant memory is accessed with a ld.const
instruction. Constant memory is restricted in size, currently limited
to 64 KB which can be used to hold statically-sized constant
variables. There is an additional 640 KB of constant memory,
organized as ten independent 64 KB regions. The driver may allocate
and initialize constant buffers in these regions and pass pointers to
the buffers as kernel function parameters. Since the ten regions are
not contiguous, the driver must ensure that constant buffers are
allocated so that each buffer fits entirely within a 64 KB region and
does not span a region boundary.
Statically-sized constant variables have an optional variable initializer; constant variables with no explicit initializer are
initialized to zero by default. Constant buffers allocated by the
driver are initialized by the host, and pointers to such buffers are
passed to the kernel as parameters.
[Emphasis mine]
So while kernel arguments are stored in constant memory, this is not the same constant memory which maps to the .const state space accessible by defining a variable as __constant__ in CUDA C or the equivalent in Fortran or Python. Rather, it is an internal pool of device memory managed by the driver and not directly accessible to the programmer.

Cuda optim for maxwell

I am trying to understand the parallel forall post on instruction level profiling. And especially the following lines in section Reducing Memory Dependency Stalls:
NVIDIA GPUs do not have indexed register files, so if a stack array is accessed with dynamic indices, the compiler must allocate the array in local memory. In the Maxwell architecture, local memory stores are not cached in L1 and hence the latency of local memory loads after stores is significant.
I understand what register files are but what does it mean that they are not indexed? And why does it prevent the compiler to store a stack array accessed with dynamic indices?
The quote says that the array will be stored in local memory. What block does this local memory correspond to in the architecture below?
... what does it mean that they are not indexed
It means that indirect addressing of registers is not supported. So it isn't possible to index from one register (theoretically the register holding the first element of an array) to another arbitrary register. As a result the compiler can't generate code for non static indexing of an array stored in registers.
What block does this local memory correspond to in the architecture below?
It doesn't correspond to any of them. Local memory is stored in DRAM, not on the GPU itself.

Utilizing Register memory in CUDA

I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.

Is local memory access coalesced?

Suppose, I declare a local variable in a CUDA kernel function for each thread:
float f = ...; // some calculations here
Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?
I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.
CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.