I'm using the Azure RTOS ThreadX in a Cortex M0+ with 20k of RAM. The port module for the cortex m0, by default, has 1024 bytes to the timer thread, but after some debugging, I noticed that the thread stack was not used by seeing the 0xEF values, so I reduced it to 256 bytes. I was testing the code, and this thread overflowed. What stack size does this need?
Thanks for all the attention!
It is not different than other threads, which means that it is dependant if the functions you are calling inside it using lots of stack or not.
If you want thread to be with very small stack consider to use just little local variables and try to do just minimum things there and make thread with bigger amount of stack to do the rest of the work. Anyway it is good approach that timer thred execution will be very short.
Related
Suppose we have a kernel that invokes some functions, for instance:
__device__ int fib(int n) {
if (n == 0 || n == 1) {
return n;
} else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
return -1;
}
__global__ void fib_kernel(int* n, int *ret) {
*ret = fib(*n);
}
The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.
According to the Documentation of Cuda PTX, it states the following:
the GPU maintains execution state per thread, including a program counter and call stack
The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.
You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.
As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.
Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.
May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?
As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).
I'm wondering, does cuda 4.0 support recursion using local memory or shared memory? I have to maintain a stack using global memory by myself, because the system-level recursion can't support my program (probably too many levels of recursion). When the recursion get deeper, the threads stop working.
So I really want to know how the default recursion work in CUDA, does it use local memory of shared memory? Thanks!
Use of recursion requires the use of the ABI, which requires architecture >= sm_20. The ABI has a function calling convention that includes the use of a stack frame. The stack frame is allocated in local memory ("local" means "thread-local", that is, storage private to a thread). Please refer to the CUDA C Programming Guide for basic information on CUDA memory spaces. In addition, you may want to have a look at this previous question: Where does CUDA allocate the stack frame for kernels?
For deeply recursive functions it is possible to exceed the default stack size. For example, on my current system the default stack size is 1024 bytes. You can retrieve the current stack size via the CUDA API function cudaDeviceGetLimit(). You can adjust the stack size via the CUDA API function cudaDeviceSetLimit():
cudaError_t stat;
size_t myStackSize = [your preferred stack size];
stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);
Note that the total amount of memory needed for stack frames is at least the per-thread size multiplied by the number of threads specified in the kernel launch. Often it can be larger due to allocation granularity. So increasing the stack size can eat up memory pretty quickly, and you may find that a deeply recursive function requires more local memory than can be allocated on your GPU.
While recursion is supported on modern GPUs, its use can lead to code with fairly low performance due to function call overhead, so you may want to check whether there is an iterative version of the algorithm you are implementing that may be better suited to the GPU.
I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.
When I run the profiler against my code, part of the output is:
Limiting Factor
Achieved Occupancy: 0.02 ( Theoretical Occupancy: 0.67 )
IPC: 1.00 ( Maximum IPC: 4 )
Achieved occupancy of 0.02 seems horribly low. Is it possible that this is due to missing .csv files from the profile run? It complains about:
Program run #18 completed.
Read profiler output file for context #0, run #1, Number of rows=6
Error : Error in profiler data file '/.../temp_compute_profiler_1_0.csv' at line number 1. No column found
Error in reading profiler output:
Application : "/.../bin/python".
Profiler data file '/.../temp_compute_profiler_2_0.csv' for application run 2 not found.
Read profiler output file for context #0, run #4, Number of rows=6
My blocks are 32*4*1, the grid is 25*100, and testing has shown that 32 registers provides the best performance (even though that results in spilling).
If the 0.02 number is correct, how can I go about debugging what's going on? I've already tried moving likely candidates to shared and/or constant memory, experimenting with launch_bounds, moving data into textures, etc.
Edit: if more data from a profile run will be helpful, just let me know and I can provide it. Thanks for reading.
Edit 2: requested data.
IPC: 1.00
Maximum IPC: 4
Divergent branches(%): 6.44
Control flow divergence(%): 96.88
Replayed Instructions(%): -0.00
Global memory replay(%): 10.27
Local memory replays(%): 5.45
Shared bank conflict replay(%): 0.00
Shared memory bank conflict per shared memory instruction(%): 0.00
L1 cache read throughput(GB/s): 197.17
L1 cache global hit ratio (%): 51.23
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 9.80
L2 cache global memory write throughput(GB/s): 6.80
L2 cache global memory throughput(GB/s): 16.60
Local memory bus traffic(%): 206.07
Peak global memory throughput(GB/s): 128.26
The following derived statistic(s) cannot be computed as required counters are not available:
Kernel requested global memory read throughput(GB/s)
Kernel requested global memory write throughput(GB/s)
Global memory excess load(%)
Global memory excess store(%)
Achieved global memory read throughput(GB/s)
Achieved global memory write throughput(GB/s)
Solution(s):
The issue with missing data was due to a too-low timeout value; certain early runs of the data would time out and the data not be written (and those error messages would get lost in the spam of later runs).
The 0.02 achieved occupancy was due to active_warps and active_cycles (and potentially other values as well) hitting maxint (2**32-1). Reducing the size of the input to the profiled script caused much more sane values to come out (including better/more realistic IPC and branching stats).
The hardware counters used by the Visual Profiler, Parallel Nsight, and the CUDA command line profiler are 32-bit counters and will overflow in 2^32 / shaderclock seconds (~5s). Some of the counters will overflow quicker than this. If you see values of MAX_INT or if your duration is in seconds then you are likely to see incorrect results in the tools.
I recommend splitting your kernel launch into 2 or more launches for profiling such that the duration of the launch is less than 1-2 seconds. In your case you have a Theoretical Occupancy of 67% (32 warps/SM) and a block size of 4 warps. When dividing work you want to make sure that each SM is fully loaded and preferably receives multiple waves of blocks. For each launch try launching NumSMs * MaxBlocksPerSM * 10 Blocks. For example, if you have a GTX560 which has 8 SMs and your reported configuration above you would break the single launch of 2500 blocks into 4 launches of 640, 640, 640, and 580.
Improved support for handling overflows should be in a future version of the tools.
Theoretical occupancy is the maximum number of warps you can execute on a a SM divided by the device limit. Theoretical occupancy can be lower than the device limit based upon the kernels use of threads per block, registers per thread, or shared memory per block.
Achieved occupancy is the measure of (active_warps / active_cyles) / max_warps_per_sm.
An achieved occupancy of .02 implies that only 1 warps is active on the SM. Given a launch of 10000 warps (2500 blocks * 128 threads / WARP_SIZE) this can only happen if you have extremely divergent code where all warps except for 1 immediately exit and 1 warp runs for a very long time. Also it is highly unlikely that you could achieve an IPC of 1 with this achieved occupancy so I suspect an error in the reported value.
If you would like help diagnosing the problem I would suggest you
post your device information
verify that you launched <<<{25,100,1}, {128, 4, 1}>>>
post your code
If you cannot post your code I would recommend capturing the counters active_cycles and active_warps and calculate achieved occupancy as
(active_warps / active_cycles) / 48
Given that you have errors in your profiler log it is possible that the results are invalid.
I believe from the output you are using an older version of the Visual Profiler. You may want to consider updating to version 4.1 which improves both collection of PM counters as well as will help provide hints on how to improve your code.
It seems like (a big part of) your issue here is:
Control flow divergence(%): 96.88
It sounds like 96.88 percent of the time, threads are not running the same instruction at the same time. The GPU can only really run the threads in parallel when each thread in a warp is running the same instruction at the same time. Things like if-else statements can cause some threads of a given warp to enter the if, and some threads to enter the else, causing divergence. What happens then is the GPU switches back and forth between executing each set of threads, causing each execution cycle to have a less than optimal occupancy.
To improve this, try to make sure that threads that will execute together in a warp (32 at a time in all NVIDIA cards today... I think) will all take the same path through the kernel code. Sometimes sorting the input data so that like data gets processed together works. Beyond that, add a barrier in strategic places in the kernel code can help. If threads of a warp are forced to diverge, a barrier will make sure that, after they reach common code again, the wait for each other to get there and then resume executing with full occupancy (for that warp). Just beware that a barrier must be hit by all threads, or you will cause a deadlock.
I can't promise this is your whole answer, but it seems to be a big problem for your code given the numbers listed in your question.
Is the kernel stack a different structure to the user-mode stack that is used by applications we (programmers) write?
Can you explain the differences?
Conceptually, both are the same data structure: a stack.
The reason why there are two different stack per thread is because in user mode, code must not be allowed to mess up kernel memory. When switching to kernel mode, a different stack in memory only accessible in kernel mode is used for return addresses an so on.
If the user mode had access to the kernel stack, it could modify a jump address (for instance), then do a system call; when the kernel jumps to the previously modified address, your code is executed in kernel mode!
Also, security-related information/information about other processes (for synchronisation) might be on the kernel stack, so the user mode should not have read access to it either.
The stack of a typical modern operating system is just a region of memory used for storing return addresses and local data. It has the same structure in both the kernel and user-mode, but each thread gets its own memory area for storing its stack. Context switches restore the stack pointer, so no thread sees another thread's stack even though they may be able to share other memory (if the threads are in the same process).
A thread doesn't have to use the stack by the way. The operating system makes assumptions about how it will be used, but the thread doesn't have to follow them.