The manual for cuda-gdb and for cuda-memcheck mention the above CUDA_EXCEPTION_9: "Warp Hardware Stack Overflow" but I have not been able to find further details; the only comment given in both manuals is
"This occurs when any thread in a warp triggers a hardware stack overflow. This should be a rare occurrence."
In my case it sometimes (!?) occurs when I try to dynamically allocate memory on the device via malloc() (processing the same set of data!). Trying to malloc() 0 bytes (bug has been fixed) repeatedly caused the same exception.
What precisely causes this exception under which circumstances; what does it indicate, how can one fix/circumvent it?
Thank you very much
A stack overflow on a Fermi GPU is no different to a stack overflow on any other device. Each thread gets a static stack and heap allocation from global memory at launch. If you exhaust the stack via excessive recursion, allocate more that the available heap memory, or try operating out of bounds on any variable stored in heap memory, a protection fault is generated, and you will get a stack overflow error reported. From your question, I would guess that you are exhausting the available per-thread heap space via device side malloc calls.
The CUDA runtime API includes functions for managing stack and heap memory cudaDeviceSetLimit and cudaDeviceGetLimit. With these you can check how much stack, heap and printf FIFO each thread is being given by the runtime, and try increasing the heap and stack size to see if your problem goes away.
Related
During programming on CUDA architecture I faced a problem: device resources are too limited. In other words, the stack and heap are too small.
While researching about it, I found a function
cudaDeviceSetLimit(cudaLimitStackSize, limit_stack)
that enlarges the stack size, and a similar one for the heap. Although, their dimensions are still too limited.
I wonder how can I store more information on the device?
The stack and heap are provided for convenience. However, you may allocate memory using cudaMalloc on the device if your gpu is recent enough. In that case, the limit is the gpu on-board memory.
Should you want more, you would need a custom memory allocation managing a large array of system memory, and sharing it with the gpu (see cudaHostRegister). Then, the limit would be your system memory.
I'm wondering, does cuda 4.0 support recursion using local memory or shared memory? I have to maintain a stack using global memory by myself, because the system-level recursion can't support my program (probably too many levels of recursion). When the recursion get deeper, the threads stop working.
So I really want to know how the default recursion work in CUDA, does it use local memory of shared memory? Thanks!
Use of recursion requires the use of the ABI, which requires architecture >= sm_20. The ABI has a function calling convention that includes the use of a stack frame. The stack frame is allocated in local memory ("local" means "thread-local", that is, storage private to a thread). Please refer to the CUDA C Programming Guide for basic information on CUDA memory spaces. In addition, you may want to have a look at this previous question: Where does CUDA allocate the stack frame for kernels?
For deeply recursive functions it is possible to exceed the default stack size. For example, on my current system the default stack size is 1024 bytes. You can retrieve the current stack size via the CUDA API function cudaDeviceGetLimit(). You can adjust the stack size via the CUDA API function cudaDeviceSetLimit():
cudaError_t stat;
size_t myStackSize = [your preferred stack size];
stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);
Note that the total amount of memory needed for stack frames is at least the per-thread size multiplied by the number of threads specified in the kernel launch. Often it can be larger due to allocation granularity. So increasing the stack size can eat up memory pretty quickly, and you may find that a deeply recursive function requires more local memory than can be allocated on your GPU.
While recursion is supported on modern GPUs, its use can lead to code with fairly low performance due to function call overhead, so you may want to check whether there is an iterative version of the algorithm you are implementing that may be better suited to the GPU.
Is the number of resident warps also limited by the user-specified heap size?
For example, if each thread needs to allocate 1 MB memory and if the heap is set to a total of 32 MB (I'm assuming that cudaLimitMallocHeapSize is used for heap usage per kernel launch rather than per thread, is that correct?). Would it be true that only one warp is allowed on the device?
The kernel launch (or issuing of warps, or blocks) will not be limited by the heap size. Instead, the kernel launch will fail, if the number of issued threads (which have reached the per-thread malloc, but not the corresponding free) times requested allocation per thread cannot be satisfied. You may wish to refer to the heap memory allocation section of the CUDA C programmers guide. A per-thread allocation sample code is given in that section, and you can easily modify that code to prove this behavior to yourself. Simply adjust the heap size and number of threads (or blocks) launched to see the behavior when the heap limit is reached. And yes, the cudaLimitMallocHeapSize is used actually for the whole device context, so it applies to all kernel launches which come after the relevant call to cudaDeviceSetLimit(). It is not a per-thread limit. Also note that there is some allocation overhead. Setting a heap size of 128MB does not mean that all 128MB will be available for subsequent device malloc operations. It may also be useful to mention that device malloc operations are only possible on CC 2.0 and above.
My kernel call fails with "out of memory". It makes significant usage of the stack frame and I was wondering if this is the reason for its failure.
When invoking nvcc with --ptxas-options=-v it print the following profile information:
150352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 59 registers, 40 bytes cmem[0]
Hardware: GTX480, sm20, 1.5GB device memory, 48KB shared memory/multiprocessor.
My question is where is the stack frame allocated: In shared, global memory, constant memory, ..?
I tried with 1 thread per block, as well as with 32 threads per block. Same "out of memory".
Another issue: One can only enlarge the number of threads resident to one multiprocessor if the total numbers of registers do not exceed the number of available registers at the multiprocessor (32k for my card). Does something similar apply to the stack frame size?
Stack is allocated in local memory. Allocation is per physical thread (GTX480: 15 SM * 1536 threads/SM = 23040 threads). You are requesting 150,352 bytes/thread => ~3.4 GB of stack space. CUDA may reduce the maximum physical threads per launch if the size is that high. The CUDA language is not designed to have a large per thread stack.
In terms of registers GTX480 is limited to 63 registers per thread and 32K registers per SM.
Stack frame is most likely in the local memory.
I believe there is some limitation of the local memory usage, but even without it, I think CUDA driver might allocate more local memory than just for one thread in your <<<1,1>>> launch configuration.
One way or another, even if you manage to actually run your code, I fear it may be actually quite slow because of all those stack operations. Try to reduce the number of function calls (e.g. by inlining those functions).
Is the kernel stack a different structure to the user-mode stack that is used by applications we (programmers) write?
Can you explain the differences?
Conceptually, both are the same data structure: a stack.
The reason why there are two different stack per thread is because in user mode, code must not be allowed to mess up kernel memory. When switching to kernel mode, a different stack in memory only accessible in kernel mode is used for return addresses an so on.
If the user mode had access to the kernel stack, it could modify a jump address (for instance), then do a system call; when the kernel jumps to the previously modified address, your code is executed in kernel mode!
Also, security-related information/information about other processes (for synchronisation) might be on the kernel stack, so the user mode should not have read access to it either.
The stack of a typical modern operating system is just a region of memory used for storing return addresses and local data. It has the same structure in both the kernel and user-mode, but each thread gets its own memory area for storing its stack. Context switches restore the stack pointer, so no thread sees another thread's stack even though they may be able to share other memory (if the threads are in the same process).
A thread doesn't have to use the stack by the way. The operating system makes assumptions about how it will be used, but the thread doesn't have to follow them.