Understanding GPU heap memory and resident warps - cuda

Is the number of resident warps also limited by the user-specified heap size?
For example, if each thread needs to allocate 1 MB memory and if the heap is set to a total of 32 MB (I'm assuming that cudaLimitMallocHeapSize is used for heap usage per kernel launch rather than per thread, is that correct?). Would it be true that only one warp is allowed on the device?

The kernel launch (or issuing of warps, or blocks) will not be limited by the heap size. Instead, the kernel launch will fail, if the number of issued threads (which have reached the per-thread malloc, but not the corresponding free) times requested allocation per thread cannot be satisfied. You may wish to refer to the heap memory allocation section of the CUDA C programmers guide. A per-thread allocation sample code is given in that section, and you can easily modify that code to prove this behavior to yourself. Simply adjust the heap size and number of threads (or blocks) launched to see the behavior when the heap limit is reached. And yes, the cudaLimitMallocHeapSize is used actually for the whole device context, so it applies to all kernel launches which come after the relevant call to cudaDeviceSetLimit(). It is not a per-thread limit. Also note that there is some allocation overhead. Setting a heap size of 128MB does not mean that all 128MB will be available for subsequent device malloc operations. It may also be useful to mention that device malloc operations are only possible on CC 2.0 and above.

Related

NVIDIA Architecture: CUDA threads and thread blocks

This is mostly from the book "Computer Architecture: A Quantitative Approach."
The book states that groups of 32 threads are grouped and executed together in what's called the thread block, but shows an example with a function call that has 256 threads per thread block, and CUDA's documentation states that you can have a maximum of 512 threads per thread block.
The function call looks like this:
int nblocks = (n+255)/256
daxpy<<<nblocks,256>>>(n,2.0,x,y)
Could somebody please explain how thread blocks are structured?
The question is a little unclear in my opinion. I will highlight a difference between thread warps and thread blocks that I find important in hopes that it helps answer whatever the true question is.
The number of threads per warp is defined by the hardware. Often, a thread warp is 32 threads wide (NVIDIA) because the SIMD unit on the GPU has exactly 32 lanes of execution, each with its own ALU (this is not always the case as far as I know; some architectures have only 16 lanes even though thread warps are 32 wide).
The size of a thread block is user defined (although, constrained by the hardware). The hardware will still execute thread code in 32-wide thread warps. Some GPU resources, such as shared memory and synchronization, cannot be shared arbitrarily between any two threads on the GPU. However, the GPU will allow threads to share a larger subset of resources if they belong to the same thread block. That's the main idea behind why thread blocks are used.

Issued load/store instructions for replay

There are two nvprof metrics regarding load/store instructions and they are ldst_executed and ldst_issued. We know that executed<=issued. I expect that those load/stores that are issued but not executed are related to branch predications and other incorrent predictions. However, from this (slide 9) document and this topic, instructions that are issued but not executed are related to serialization and replay.
I don't know if that reason applies for load/store instructions or not. Moreover, I would like to know why such terminology is used for issued but not executed instructions? If there is a serialization for any reason, instructions are executed multiple times. So, why they are not counted as executed?
Any explanation for that?
The NVIDIA architecture optimized memory throughput by issuing an instruction for a group of threads called a warp. If each thread accesses a consecutive data element or the same element then the access can be performed very efficiently. However, if each thread accesses data in a different cache line or at a different address in the same bank then there is a conflict and the instruction has to be replayed.
inst_executed is the count of instructions retired.
inst_issued is the count of instructions issued. An instruction may be issued multiple times in the case of a vector memory access, memory address conflict, memory bank conflict, etc. On each issue the thread mask is reduced until all threads have completed.
The distinction is made for two reasons:
1. Retirement of an instruction indicates completion of a data dependency. The data dependency is only resolved 1 time despite possible replays.
2. The ratio between issued and executed is a simple way to show opportunities to save warp scheduler issue cycles.
In Fermi and Kepler SM if a memory conflict was encountered then the instruction was replayed (re-issued) until all threads completed. This was performed by the warp scheduler. These replays consume issue cycles reducing the ability for the SM to issue instructions to math pipes. In this SM issued > executed indicates an opportunity for optimization especially if issued IPC is high.
In the Maxwell-Turing SM replays for vector accesses, address conflicts, and memory conflicts are replayed by the memory unit (shared memory, L1, etc.) and do not steal warp scheduler issue cycles. In this SM issued is very seldom more than a few % above executed.
EXAMPLE: A kernel loads a 32-bit value. All 32 threads in the warp are active and each thread accesses a unique cache line (stride = 128 bytes).
On Kepler (CC3.*) SM the instruction is issued 1 time then replayed 31 additional times as the Kepler L1 can only perform 1 tag lookup per request.
inst_executed = 1
inst_issued = 32
On Kepler the instruction has to be replayed again for each request that missed in the L1. If all threads miss in the L1 cache then
inst_executed = 1
inst_issued >= 64 = 32 request + 32 replays for misses
On Maxwell - Turing architecture the replay is performed by the SM memory system. The replays can limit memory throughput but will not block the warp scheduler from issuing instructions to the math pipe.
inst_executed = 1
inst_issued = 1
On Maxwell-Turing Nsight Compute/Perfworks expose throughput counters for each of the memory pipelines including number of cycles due to memory bank conflicts, serialization of atomics, address divergence, etc.
GPU architecture is based on maximizing throughput rather than minimizing latency. Thus, GPUs (currently) don't really do out-of-order execution or branch prediction. Instead of building a few cores full of complex control logic to make one thread run really fast (like you'd have on a CPU), GPUs rather use those transistors to build more cores to run as many threads as possible in parallel.
As explained on slide 9 of the presentation you linked, executed instructions are the instructions that control flow passes over in your program (basically, the number of lines of assembly code that were run). When you, e.g., execute a global load instruction and the memory request cannot be served immediately (misses the cache), the GPU will switch to another thread. Once the value is ready in the cache and GPU switches back to your thread, the load instruction will have to be issued again to complete fetching the value (see also this answer and this thread). When you, e.g., access shared memory and there are a bank conflicts, the shared memory access will have to be replayed multiple times for different threads in the warp…
The main reason to differentiate between executed and issued instructions would seem to be that the ratio of the two can serve as a measurement for the amount of overhead your code produces due to instructions that cannot be completed immediately at the time they are executed…

CUDA: Does the compute capability impact the maximum number of active threads?

If I have a device supporting CC 3.0 that means it has maximum number of active threads equal to 2048 per multiprocessor. And If would set the CC to 2.0 (compute_20,sm_20) does it mean that the maximum number of active threads will be only 1536 per multiprocessor or the compute capability has no impact to this?
Or is it have impact to the shared memory size?
CUDA is designed for scalability; kernels will expand to use all of the resources it can. So it doesn't matter how you compile the kernel; it will fill up all of the available threads unless you do something that prevents it from doing so, like launching it with 768 threads per block.
Now, GPU threads aren't like CPU cores; you aren't losing the ability to do computation if you aren't using all of the threads. A streaming multiprocessor (SM) on a device of compute capability 3.0 can manage 2048 threads simultaneously, but is only capable of executing 256 instructions per tick. There are other limits too; e.g. if you're doing 32-bit floating point addition, it can only do 192 of those per tick. Doing left shifts on 32-bit integers? Only 64 per tick.
The point of having more threads is for latency reasons -- when one thread is blocked for some reason, such as waiting to fetch a value from memory or to get the result of an arithmetic instruction, the SM will run a different thread instead. The point of using more threads is that it gives you more opportunities to hide this latency: more chances to have independent work available to do when some instructions are blocked, waiting for data.

CUDA shared memory occupancy

If I have a 48kB shared memory per SM and I make a kernel where I allocate 32kB in shared memory that means that only 1 block can be running on one SM at the same time?
Yes, that is correct.
Shared memory must support the "footprint" of all "resident" threadblocks. In order for a threadblock to be launched on a SM, there must be enough shared memory to support it. If not, it will wait until the presently executing threadblock has finished.
There is some nuance to this arriving with Maxwell GPUs (cc 5.0, 5.2). These GPUs support either 64KB (cc 5.0) or 96KB (cc 5.2) of shared memory. In this case, the maximum shared memory available to a single threadblock is still limited to 48KB, but multiple threadblocks may use more than 48KB in aggregate, on a single SM. This means a cc 5.2 SM could support 2 threadblocks, even if both were using 32KB shared memory.

Where does CUDA allocate the stack frame for kernels?

My kernel call fails with "out of memory". It makes significant usage of the stack frame and I was wondering if this is the reason for its failure.
When invoking nvcc with --ptxas-options=-v it print the following profile information:
150352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 59 registers, 40 bytes cmem[0]
Hardware: GTX480, sm20, 1.5GB device memory, 48KB shared memory/multiprocessor.
My question is where is the stack frame allocated: In shared, global memory, constant memory, ..?
I tried with 1 thread per block, as well as with 32 threads per block. Same "out of memory".
Another issue: One can only enlarge the number of threads resident to one multiprocessor if the total numbers of registers do not exceed the number of available registers at the multiprocessor (32k for my card). Does something similar apply to the stack frame size?
Stack is allocated in local memory. Allocation is per physical thread (GTX480: 15 SM * 1536 threads/SM = 23040 threads). You are requesting 150,352 bytes/thread => ~3.4 GB of stack space. CUDA may reduce the maximum physical threads per launch if the size is that high. The CUDA language is not designed to have a large per thread stack.
In terms of registers GTX480 is limited to 63 registers per thread and 32K registers per SM.
Stack frame is most likely in the local memory.
I believe there is some limitation of the local memory usage, but even without it, I think CUDA driver might allocate more local memory than just for one thread in your <<<1,1>>> launch configuration.
One way or another, even if you manage to actually run your code, I fear it may be actually quite slow because of all those stack operations. Try to reduce the number of function calls (e.g. by inlining those functions).