how does cuda 4.0 support recursion - cuda

I'm wondering, does cuda 4.0 support recursion using local memory or shared memory? I have to maintain a stack using global memory by myself, because the system-level recursion can't support my program (probably too many levels of recursion). When the recursion get deeper, the threads stop working.
So I really want to know how the default recursion work in CUDA, does it use local memory of shared memory? Thanks!

Use of recursion requires the use of the ABI, which requires architecture >= sm_20. The ABI has a function calling convention that includes the use of a stack frame. The stack frame is allocated in local memory ("local" means "thread-local", that is, storage private to a thread). Please refer to the CUDA C Programming Guide for basic information on CUDA memory spaces. In addition, you may want to have a look at this previous question: Where does CUDA allocate the stack frame for kernels?
For deeply recursive functions it is possible to exceed the default stack size. For example, on my current system the default stack size is 1024 bytes. You can retrieve the current stack size via the CUDA API function cudaDeviceGetLimit(). You can adjust the stack size via the CUDA API function cudaDeviceSetLimit():
cudaError_t stat;
size_t myStackSize = [your preferred stack size];
stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);
Note that the total amount of memory needed for stack frames is at least the per-thread size multiplied by the number of threads specified in the kernel launch. Often it can be larger due to allocation granularity. So increasing the stack size can eat up memory pretty quickly, and you may find that a deeply recursive function requires more local memory than can be allocated on your GPU.
While recursion is supported on modern GPUs, its use can lead to code with fairly low performance due to function call overhead, so you may want to check whether there is an iterative version of the algorithm you are implementing that may be better suited to the GPU.

Related

How is stack frame managed within a thread in Cuda?

Suppose we have a kernel that invokes some functions, for instance:
__device__ int fib(int n) {
if (n == 0 || n == 1) {
return n;
} else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
return -1;
}
__global__ void fib_kernel(int* n, int *ret) {
*ret = fib(*n);
}
The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.
According to the Documentation of Cuda PTX, it states the following:
the GPU maintains execution state per thread, including a program counter and call stack
The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.
You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.
As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.
Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.
May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?
As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).

I don't know how to export a matrix from CUDA [duplicate]

I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code.
Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf?
Sorry, if the answer has already been given in a previous topic, I can't seem to find it...
Thanks!
The short answer is, no there is not.
cuPrintf and the built-in printf support in Fermi and Kepler runtime is implemented using device to host copies. The mechanism is no different to using cudaMemcpy to transfer a buffer to the host yourself.
Just about all CUDA compatible GPUs support so-called zero-copy (AKA "pinned, mapped") memory, which allows the GPU to map a host buffer into its address space and execute DMA transfers into that mapped host memory. Note, however, that setup and initialisation of mapped memory has considerably higher overhead than conventional memory allocation (so you really need a lot of transactions to amortise that overhead throughout the life of your application), and that the CUDA driver can't use zero-copy with any other than addresses backed by physical memory. So you can't mmap a file and use zero-copy on it, ie. you will still need explicit host side file IO code to get from a zero-copy buffer to disk.

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

Programming CUDA architecture

During programming on CUDA architecture I faced a problem: device resources are too limited. In other words, the stack and heap are too small.
While researching about it, I found a function
cudaDeviceSetLimit(cudaLimitStackSize, limit_stack)
that enlarges the stack size, and a similar one for the heap. Although, their dimensions are still too limited.
I wonder how can I store more information on the device?
The stack and heap are provided for convenience. However, you may allocate memory using cudaMalloc on the device if your gpu is recent enough. In that case, the limit is the gpu on-board memory.
Should you want more, you would need a custom memory allocation managing a large array of system memory, and sharing it with the gpu (see cudaHostRegister). Then, the limit would be your system memory.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.