Where does CUDA allocate the stack frame for kernels? - cuda

My kernel call fails with "out of memory". It makes significant usage of the stack frame and I was wondering if this is the reason for its failure.
When invoking nvcc with --ptxas-options=-v it print the following profile information:
150352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 59 registers, 40 bytes cmem[0]
Hardware: GTX480, sm20, 1.5GB device memory, 48KB shared memory/multiprocessor.
My question is where is the stack frame allocated: In shared, global memory, constant memory, ..?
I tried with 1 thread per block, as well as with 32 threads per block. Same "out of memory".
Another issue: One can only enlarge the number of threads resident to one multiprocessor if the total numbers of registers do not exceed the number of available registers at the multiprocessor (32k for my card). Does something similar apply to the stack frame size?

Stack is allocated in local memory. Allocation is per physical thread (GTX480: 15 SM * 1536 threads/SM = 23040 threads). You are requesting 150,352 bytes/thread => ~3.4 GB of stack space. CUDA may reduce the maximum physical threads per launch if the size is that high. The CUDA language is not designed to have a large per thread stack.
In terms of registers GTX480 is limited to 63 registers per thread and 32K registers per SM.

Stack frame is most likely in the local memory.
I believe there is some limitation of the local memory usage, but even without it, I think CUDA driver might allocate more local memory than just for one thread in your <<<1,1>>> launch configuration.
One way or another, even if you manage to actually run your code, I fear it may be actually quite slow because of all those stack operations. Try to reduce the number of function calls (e.g. by inlining those functions).

Related

Maximum number of GPU Threads on Hardware and used memory

I have already read several threads about the capacity about the GPU and understood that the concept of blocks and threads has to be seperated from the physical Hardware. Although the maximum amount of threads per block is 1024, there is no limit on the number of blocks one can use. However, as the number of streaming processors is finite, there has to be a physical limit. After I wrote a GPU program, I would be interested in evaluating the used capacity of my GPU. To do this, I have to know how many threads I could start theoretically at one time on hardware. My graphics card is a Nvidia Geforce 1080Ti, so I have 3584 CUDA-Cores. As far as I understood, each Cuda core executes one Thread, so in theory, I would be able to execute 3584 threads per cycle. Is this correct?
Another question is the one about memory. I installed and used nvprof to get some insight into the used kernels. What is displayed there is for example the number of used registers. I transfer my arrays to the GPU using cuda.to_device (in Python Numba) and as far as I understood, the arrays then reside in global memory. How do I find out how big this global memory is? Is it equivalent to the DRAM size?
Thanks in advance
I'll focus on the first part of the question. The second should really be its own separate question.
CUDA cores do not map 1-to-1 to threads. They are more like ports in a multiscalar CPU. Multiple threads can issue instructions to the same CUDA core in different clock cycles. Sort of like hyperthreading in a CPU.
You can see the relation and numbers in the documentation in chapter K Compute Capabilities compared to the table Table 3. Throughput of Native Arithmetic Instructions. Depending on your architecture, you may have for example for your card (compute capability 6.1) 2048 threads per SM and 128 32 bit floating point operations per clock cycle. That means you have 128 CUDA cores shared by a maximum of 2048 threads.
Within one GPU generation, the absolute number of threads and CUDA cores only scales with the number of multiprocessors (SMs). TechPowerup's excellent GPU database documents 28 SMs for your card which should give you 28 * 2048 threads unless I did something wrong.

CUDA: Does the compute capability impact the maximum number of active threads?

If I have a device supporting CC 3.0 that means it has maximum number of active threads equal to 2048 per multiprocessor. And If would set the CC to 2.0 (compute_20,sm_20) does it mean that the maximum number of active threads will be only 1536 per multiprocessor or the compute capability has no impact to this?
Or is it have impact to the shared memory size?
CUDA is designed for scalability; kernels will expand to use all of the resources it can. So it doesn't matter how you compile the kernel; it will fill up all of the available threads unless you do something that prevents it from doing so, like launching it with 768 threads per block.
Now, GPU threads aren't like CPU cores; you aren't losing the ability to do computation if you aren't using all of the threads. A streaming multiprocessor (SM) on a device of compute capability 3.0 can manage 2048 threads simultaneously, but is only capable of executing 256 instructions per tick. There are other limits too; e.g. if you're doing 32-bit floating point addition, it can only do 192 of those per tick. Doing left shifts on 32-bit integers? Only 64 per tick.
The point of having more threads is for latency reasons -- when one thread is blocked for some reason, such as waiting to fetch a value from memory or to get the result of an arithmetic instruction, the SM will run a different thread instead. The point of using more threads is that it gives you more opportunities to hide this latency: more chances to have independent work available to do when some instructions are blocked, waiting for data.

Understanding CUDA kernel stack usage and register spilling

I am trying to fully understand the information of PTXAS -v CUDA for kernel stack usage and register spilling (for sm_35 architecture). For one of my kernels it produces:
3536 bytes stack frame, 3612 bytes spill stores, 6148 bytes spill loads
ptxas info : Used 255 registers, 392 bytes cmem[0]
I know that the stack frame is allocated in local memory which lives physically where global memory is and is private to each thread.
My questions are:
Is the memory needed for register spillage also allocated in local
memory?
Is the total amount of memory needed for register spilling and stack
usage equal to [number of threads]x[3536 bytes]. Thus register
spillage loads/stores operate on the stack frame?
The number of spill stores/loads doesn't detail on the size of the
transfers. Are these always 32bit registers? Thus, a 64bit floating
point number spill would be counted as 2 spill stores?
Are spill stores/loads cached in L2 cache?
Registers are spilled to local memory. "local" means "thread-local", i.e. storage private to each thread.
The amount of local memory required for the entire launch is at least number_of_threads times local_memory_bytes_per_thread. Due to allocation granularity it can often be more.
The compiler statistics for spill transfers are already normalized
to bytes as individual local memory accesses may have difference
widths. Inspection of the generated machine code (run cuobjdump
--dump-sass on the binary) will show the width of the individual accesses. The relevant instructions will have names like LLD, LST,
LDL, STL.
I am reasonably sure that local memory accesses are cached in L1 and
L2 caches, but cannot quote the relevant paragraphs from the
documentation at this time.

Load from shared memory the one and same 32 bytes (ulong4) by the each warp thread

If each warp accesses the shared memory at the same address, how would that load the 32 bytes of data (ulong4)? Will it be 'broadcasted'? Would the access time be the same as if each thread loaded the 2 bytes 'unsigned short int'?
Now, in case I need to load from shared memory 32/64 same bytes in each warp, how could I do this?
On devices before compute capability 3.0 shared memory accesses are always 32 bit / 4 bytes wide and will be broadcast if all threads of a warp access the same address. Wider accesses will compile to multiple instructions.
On compute capability 3.0 shared memory accesses can be configured to be either 32 bit wide or 64 bit wide using cudaDeviceSetSharedMemConfig(). The chosen setting will apply to the entire kernel though.
[As I had originally missed the little word "shared" in the question, I gave a completely off-topic answer for global memory instead. Since that one should still be correct, I'll leave it in here:]
It depends:
Compute capability 1.0 and 1.1 don't broadcast and use 64 separate 32 byte memory transactions (two times 16 bytes, extended to the minimum 32 byte transactions size for each thread of the warp)
Compute capability 1.2 and 1.3 broadcast, so two 32 byte transactions (two times 16 bytes, extended to minimum 32 byte transaction size) suffice for all threads of the warp
Compute capability 2.0 and higher just read a 128 byte cache line and satisfy all requests from there.
The compute capability 1.x devices will waste 50% of the transferred data, as a single thread can load at most 16 bytes, but the minimum transaction size is or 32 bytes. Additionally, 32 byte transactions are a lot slower that 128 byte transactions.
The time will be the same as if just 8 bytes were read by each thread because of the minimum transaction size, and because data paths are sufficiently wide to transfer either 8 or 16 bytes to each thread per transaction.
Reading 2× or 4× the data will take 2× or 4× as long on compute capability 1.x, but only minimally longer on 2.0 and higher if the data falls into the same cache line so no further memory transactions are necessary.
So on compute capability 2.0 and higher you don't need to worry. On 1.x read the data through the constant cache or a texture if it is constant, or reorder it in shared memory otherwise (assuming your kernel is memory bandwidth bound).

My GPU has 2 multiprocessors with 48 CUDA cores each. What does this mean?

My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.