Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount - cuda

From the NVIDIA CUDA C Programming Guide:
Register usage can be controlled using the maxrregcount compiler
option or launch bounds as described in Launch Bounds.
From my understanding (and correct me if I'm wrong), while -maxrregcount limits the number of registers the entire .cu file may use, the __launch_bounds__ qualifier defines the maxThreadsPerBlock and minBlocksPerMultiprocessor for each __global__ kernel. These two accomplish the same task, but in two different ways.
My usage requires me to have 40 registers per thread to maximize the performance. Thus, I can use -maxrregcount 40. I can also force 40 registers by using __launch_bounds__(256, 6) but this causes load & store register spills.
What is the difference between the two to cause these register spills?

The preface of this question is that, quoting the CUDA C Programming Guide,
the fewer registers a kernel uses, the more threads and thread blocks
are likely to reside on a multiprocessor, which can improve
performance.
Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms.
__launch_bounds__
nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it differently, such a choice of the number of used registers "guarantees effectiveness" for different numbers of threads per block and of blocks per multiprocessor. However, if an approximate idea of the maximum number of threads per block and (possibly) of the minimum number of blocks per multiprocessor is available at compile-time, then this information can be used to optimize the kernel for such launches. In other words
#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP 2
__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
// ... Computation of kernel
}
informs the compiler of a likely launch configuration, so that nvcc can select the number of registers for such a launch configuration in an "optimal" way.
The MAX_THREADS_PER_BLOCK parameter is mandatory, while the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK, the kernel launch will fail.
The limiting mechanism is described in the Programming Guide as follows:
If launch bounds are specified, the compiler first derives from them
the upper limit L on the number of registers the kernel should use
to ensure that minBlocksPerMultiprocessor blocks (or a single block
if minBlocksPerMultiprocessor is not specified) of
maxThreadsPerBlock threads can reside on the multiprocessor. The
compiler then optimizes register usage in the following way:
If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at
the expense of more local memory usage and/or higher number of
instructions;
Accordingly, __launch_bounds__ can lead to register spill.
maxrregcount
maxrregcount is a compiler flag that simply hardlimits the number of employed registers to a number set by the user, at variance with __launch_bounds__, by forcing the compiler to rearrange its use of registers. When the compiler can't stay below the imposed limit, it will simply spill it to local memory which is in fact DRAM. Even this local variables are stored in global DRAM memory variables can be cached in L1, L2.

Related

CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.
My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.
I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:
int *device_A;
int *device_B;
int *device_C;
cudaMalloc(device_A,sizeof(int)*N);
cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);
KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);
cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU
The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:
by reusing the data that a kernel may have populated either in registers or shared memory
by reducing (i.e. eliminating) "redundant" loads and stores
Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:
Load my element of vector A from global memory
Add a constant to, or multiply by a constant, my vector element
Store my element back out to vector A (in global memory)
This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:
Add kernel:
Load my element of vector A from global memory
Add a value to my vector element
Store my element back out to vector A (in global memory)
Multiply kernel:
Load my element of vector A from global memory
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.
But if we combine the two kernels together, we could write a kernel like this:
Load my element of vector A from global memory
Add a value to my vector element
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.
This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.
Here is a trivial code example.

Can many threads set bit on the same word simultaneously?

I need each thread of a warp deciding on setting or not its respective bit in a 32 bits word. Does this multiple setting take only one memory access, or will be one memory access for each bit set?
There is no independent bit-setting capability in CUDA. (There is a bit-field-insert instruction in PTX, but it nevertheless operates on a 32-bit quantity.)
Each thread would set a bit by doing a full 32-bit write. Such a write would need to be an atomic RMW operation in order to preserve the other bits. Therefore the accesses would effectively be serialized, at whatever the throughput of atomics are.
If memory space is not a concern, breaking the bits out into separate integers would allow you to avoid atomics.
A 32-bit packed quantity could then be quickly assembled using the __ballot() warp vote function. An example is given in the answer here.
(In fact, the warp vote function may allow you to avoid memory transactions altogether; everything can be handled in registers, if the only result you need is the 32-bit packed quantity.)

How do I appropriately size and launch a CUDA grid?

First question:
Suppose I need to launch a kernel with 229080 threads on a Tesla C1060 which has compute capability 1.3.
So according to the documentation this machine has 240 cores with 8 cores on each symmetric multiprocessor for a total of 30 SMs.
I can use up to 1024 per SM for a total of 30720 threads running "concurrently".
Now if I define blocks of 256 threads that means I can have 4 blocks for each SM because 1024/256=4. So those 30720 threads can be arranged in 120 blocks across all SMs.
Now for my example of 229080 threads I would need 229080/256=~895 (rounded up) blocks to process all the threads.
Now lets say I want to call a kernel and I must use those 229080 threads so I have two options. The first one is to I divide the problem so that I call the kernel ~8 times in a for loop with a Grid of 120 blocks and 30720 threads each time (229080/30720). That way I make sure the device will stay occupied completely. The other option is to call the kernel with a Grid of 895 blocks for the entire 229080 threads on which case many blocks will remain idle until a SM finishes with the 8 blocks it has.
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
Second question
Let's say that within the kernel I'm calling I need to access non coalesced global memory so an option is to use shared memory.
I can then use each thread to extract a value from an array on global memory say global_array which is of length 229080. Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
The problem here is that for the 229080 threads I need exactly 229080/256=894.84375 blocks because there is a residue of 216 threads. I can round up that number and get 895 blocks and the last block will just use 216 threads.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
A single kernel is preferred according to what you describe. Threadblocks queued up but not assigned to an SM don't take any resources you need to worry about, and the machine is definitely designed to handle situations just like that. The overhead of 8 kernel calls will definitely be slower, all other things being equal.
Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
This statement is not correct on the face of it. You can have branching while copying to shared memory. You just need to make sure that either:
The __syncthreads() is outside the branching construct, or,
The __syncthreads() is reached by all threads within the branching construct (which effectively means that the branch construct evaluates to the same path for all threads in the block, at least at the point where the __syncthreads() barrier is.)
Note that option 1 above is usually achievable, which makes code simpler to follow and easy to verify that all threads can reach the barrier.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
Do something like this:
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx < data_size)
shared[threadIdx.x] = global[idx];
__syncthreads();
This is perfectly legal. All threads in the block, whether they are participating in the data copy to shared memory or not, will reach the barrier.

Does CUDA automatically load-balance for you?

I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...

Threads hierarchy design in kernel in CUDA

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance?
(case 1)
1st block - 512 threads
2nd block - remaining threads
(case 2) distribute equal number of threads across certain blocks.
I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)
This link provides some insight into how CUDA will (likely) and organize your threads.
A quote from the summary:
To summarize, special parameters at a
kernel launch define the dimensions of
a grid and its blocks. Unique
coordinates in blockId and threadId
variables allow threads of a grid to
distinguish among them. It is the
programmer's responsibility to use
these variables in the kernel
functions so that the threads can
properly identify the portion of the
data to process. These variables
compel the programmers to organize
threads and there data into
hierarchical and multi-dimensional
organizations.
It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.
Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.