Does CUDA automatically load-balance for you? - cuda

I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.

Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.

As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.

If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.

Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).

Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...

Related

Optimising Monte-Carlo algorithm | Reduce operation on GPU & Eigenvalues problem | Many-body problem

This issue reminds some typical many-body problem, but with some extra calculations.
I am working on the generalized Metropolis Monte-Carlo algorithm for the modeling of large number of arbitrary quantum systems (magnetic ions for example) interacting classically with each other. But it actually doesn't matter for the question.
There is more than 100000 interacting objects, each one can be described by a coordinate and a set of parameters describing its current state r_i, s_i.
Can be translated to the C++CUDA as float4 and float4 vectors
To update the system following Monte-Carlo method for such systems, we need to randomly sample 1 object from the whole set; calculate the interaction function for it f(r_j - r_i, s_j); substitute to some matrix and find eigenvectors of it, from which one a new state will be calculated.
The interaction is additive as usual, i.e. the total interaction will be the sum between all pairs.
Formally this can be decomposed into steps
Generate random number i
Calculate the interaction function for all possible pairs f(r_j - r_i, s_j)
Sum it. The result will be a vector F
Multiply it by some tensor and add another one h = h + dot(F,t). Some basic linear algebra stuff.
Find the eigenvectors and eigenvalues, based on some simple algorithm, choose one vector V_k and write in back to the array s_j of all objects's states.
There is a big question, which parts of this can be computed on CUDA kernels.
I am quite new to CUDA programming. So far I ended up with the following algorithm
//a good random generator
std::uniform_int_distribution<std::mt19937::result_type> random_sampler(0, N-1);
for(int i=0; i\<a_lot; ++i) {
//sample a number of object
nextObject = random_sampler(rng);
//call kernel to calculate the interaction and sum it up by threads. also to write down a new state back to the d_s array
CUDACalcAndReduce<THREADS><<<blocksPerGrid, THREADS>>>(d_r, d_s, d_sum, newState, nextObject, previousObject, N);
//copy the sum
cudaMemcpy(buf, d_sum, sizeof(float)*4*blocksPerGrid, cudaMemcpyDeviceToHost);
//manually reduce the rest of the sum
total = buf[0];
for (int i=1; i<blocksPerGrid; ++i) {
total += buf[i];
}
//find eigenvalues and etc. and determine a new state of the object
//just linear algebra with complex numbers
newState = calcNewState(total);
//a new state will be written by CUDA function on the next iteration
//remember the previous number of the object
previousObject = nextObject;
}
The problem is continuous transferring data between CPU and GPU, and the actual number of bytes is blocksPerGrid*4*sizeof(float) which sometimes is just a few bytes. I optimized CUDA code following the guide from NVIDIA and now it limited by the bus speed between CPU and GPU. I guess switching to pinned memory type will not make any sense since the number of transferred bytes is low.
I used Nvidia Visual Profiler and it shows the following
the most time was waisted by the transferring the data to CPU. The speed as one can see by the inset is 57.143 MB/s and the size is only 64B!
The question is is it worth to move the logic of eigenvalues algorithm to CUDA kernel?
Therefore there will be no data transfer between CPU and GPU. The problem with this algorithm, you can update only one object per iteration. It means that I can run the eigensolver only on one CUDA core. ;( Will it be that slow compared to my CPU, that will eliminate the advantage of keeping data inside the GPU ram?
The matrix size for the eigensolver algorithm does not exceed 10x10 complex numbers. I've heard that cuBLAS can be run fully on CUDA kernels without calling the CPU functions, but not sure how it is implemented.
UPD-1
As it was mentioned in the comment section.
For the each iteration we need to diagonalize only one 10x10 complex Hermitian matrix, which depends on the total calculated interaction function f. Then, we in general it is not allowed to a compute a new sum of f, before we update the state of the sampled object based on eigenvectors and eigenvalues of 10x10 matrix.
Due to the stochastic nature of Monte-Carlo approach we need all 10 eigenvectors to pick up a new state for the sampled object.
However, the suggested idea of double-buffering (in the comments) can work out in a way if we calculate the total sum of f for the next j-th iteration without the contribution of i-th sampled object and, then, add it later. I need to test it carefully in action...
UPD-2
The specs are
CPU 4-cores Intel(R) Core(TM) i5-6500 CPU # 3.20GHz
GPU GTX960
quite outdated, but I might find an access to the better system. However, switching to GTX1660 SUPER did not affect the performance, which means that a PCI bus is a bottleneck ;)
The question is is it worth to move the logic of eigenvalues algorithm
to CUDA kernel?
Depends on the system. Old cpu + new gpu? Both new? Both old?
Generally single cuda thread is a lot slower than single cpu thread. Because cuda compiler does not vectorize its loops but host c++ compiler vectorizes. So, you need to use 10-100 cuda threads to make the comparison fair.
For the optimizations:
According to the image, currently it loses 1 microsecond as a serial part of overall algorithm. 1 microsecond is not much compared to the usual kernel-launch latency from CPU but is big when it is GPU launching the kernel (dynamic parallelism) itself.
CUDA-graph feature enables the overall algorithm re-launch every step(kernel) automatically and complete quicker if steps are not CPU-dependent. But it is intended for "graph"-like workloads where some kernel leads to multiple kernels and they later join in another kernel, etc.
CUDA-dynamic-parallelism feature lets a kernel's cuda threads launch new kernels. This has much better timings than launching from CPU due to not waiting for the synchronizations between driver and host.
Sampling part's copying could be made in chunks like 100-1000 elements at once and consumed by CUDA part at once for 100-1000 steps if all parts are in CUDA.
If I were to write it, I would do it like this:
launch a loop kernel (only 1 CUDA thread) that is parent
start loop in the kernel
do real (child) kernel-launching within the loop
since every iteration needs serial, it should sync before continuing next iteration.
end the parent after 100-1000 sized chunk is complete and get new random data from CPU
when parent kernel ends, it shows in profiler as a single kernel launch that takes a lot of time and it doesn't have any CPU-based inefficiencies.
On top of the time saved from not synching a lot, there would be consistency of performance between 10x10 matrix part and the other kernel part because they are always in same hardware, not some different CPU and GPU.
Since random-num generation is always an input for the system, at least it can be double-buffered to hide cpu-to-gpu data copying latency behind the computation. Iirc, random number generation is much cheaper than sending data over pcie bridge. So this would hide mostly the data transmission slowness.
If it is a massively parallel experiment like running the executable N times, you can still launch like 10 executable instances at once and let them keep gpu busy with good efficiency. Not practical if too much memory is required per instance. Many gpus except ancient ones can run tens of kernels in parallel if each of them can not fully occupy all resources of gpu.

CUDA coalesced memory access speed depending on word size

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.
My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.
I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:
int *device_A;
int *device_B;
int *device_C;
cudaMalloc(device_A,sizeof(int)*N);
cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);
KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);
cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU
The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:
by reusing the data that a kernel may have populated either in registers or shared memory
by reducing (i.e. eliminating) "redundant" loads and stores
Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:
Load my element of vector A from global memory
Add a constant to, or multiply by a constant, my vector element
Store my element back out to vector A (in global memory)
This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:
Add kernel:
Load my element of vector A from global memory
Add a value to my vector element
Store my element back out to vector A (in global memory)
Multiply kernel:
Load my element of vector A from global memory
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.
But if we combine the two kernels together, we could write a kernel like this:
Load my element of vector A from global memory
Add a value to my vector element
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.
This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.
Here is a trivial code example.

How do I appropriately size and launch a CUDA grid?

First question:
Suppose I need to launch a kernel with 229080 threads on a Tesla C1060 which has compute capability 1.3.
So according to the documentation this machine has 240 cores with 8 cores on each symmetric multiprocessor for a total of 30 SMs.
I can use up to 1024 per SM for a total of 30720 threads running "concurrently".
Now if I define blocks of 256 threads that means I can have 4 blocks for each SM because 1024/256=4. So those 30720 threads can be arranged in 120 blocks across all SMs.
Now for my example of 229080 threads I would need 229080/256=~895 (rounded up) blocks to process all the threads.
Now lets say I want to call a kernel and I must use those 229080 threads so I have two options. The first one is to I divide the problem so that I call the kernel ~8 times in a for loop with a Grid of 120 blocks and 30720 threads each time (229080/30720). That way I make sure the device will stay occupied completely. The other option is to call the kernel with a Grid of 895 blocks for the entire 229080 threads on which case many blocks will remain idle until a SM finishes with the 8 blocks it has.
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
Second question
Let's say that within the kernel I'm calling I need to access non coalesced global memory so an option is to use shared memory.
I can then use each thread to extract a value from an array on global memory say global_array which is of length 229080. Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
The problem here is that for the 229080 threads I need exactly 229080/256=894.84375 blocks because there is a residue of 216 threads. I can round up that number and get 895 blocks and the last block will just use 216 threads.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
A single kernel is preferred according to what you describe. Threadblocks queued up but not assigned to an SM don't take any resources you need to worry about, and the machine is definitely designed to handle situations just like that. The overhead of 8 kernel calls will definitely be slower, all other things being equal.
Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
This statement is not correct on the face of it. You can have branching while copying to shared memory. You just need to make sure that either:
The __syncthreads() is outside the branching construct, or,
The __syncthreads() is reached by all threads within the branching construct (which effectively means that the branch construct evaluates to the same path for all threads in the block, at least at the point where the __syncthreads() barrier is.)
Note that option 1 above is usually achievable, which makes code simpler to follow and easy to verify that all threads can reach the barrier.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
Do something like this:
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx < data_size)
shared[threadIdx.x] = global[idx];
__syncthreads();
This is perfectly legal. All threads in the block, whether they are participating in the data copy to shared memory or not, will reach the barrier.

how does CUDA schedule its threads

i've got a few questions regarding cuda's scheduling system.
A.When i use for example the foo<<<255, 255>>() function, what actually happens inside of the card? i know that each SM receives from the upper level a block to schedule, and each SM is responsible to schedule its incoming BLOCK, but which part does it? if for example i've got 8 SMs, when each of each contains 8 small CPUs, is the upper level responsible to schedule the remaining 255*255 - (8 * 8) threads?
B.What's the limit of maximum threads that one can define? i mean foo<<<X, Y>>>(); x,y =?
C. Regarding the last example, how many threads can be inside of one block? can we say that the more blocks / threads we have, the faster the execution will be?
Thanks for your help
A. The compute work distributor will distribute a block from the grid to a SM. The SM will convert the block in warps (WARP_SIZE = 32 on all NVIDIA GPUs). Fermi 2.0 GPUs each SM has two warp schedulers which share a set of data paths. Every cycle each warp scheduler picks a warp and issues an instruction to one of data paths (please don't think of CUDA cores). On Fermi 2.1 GPUs each warp scheduler has independent data paths as well as a set of shared data paths. On 2.1 every cycle each warp scheduler will pick a warp and attempt to dual issue instructions for each warp.
The warp schedulers attempt to optimize the use of data paths. This means that it is possible that a single warp will execute multiple instructions in back to back cycle or the warp scheduler can choose to issue from a different warp every cycle.
The number of warps/threads that each SM can handle is specified in the CUDA Programming Guide v.4.2 Table F-1. This scales from 768 threads to 2048 threads (24-64 warps).
B. The maximum threads per launch is defined by the maximum GridDims * the maximum threads per block. See Table F-1 or refer to the documentation for cudaGetDeviceProperties.
C. See the same resources as (B). The optimum distribution of threads/block is defined by your algorithm partitioning and is influenced by the occupancy calculation. There are observable performance impacts based around problem set size of the warps on the SM and the amount of time blocked at instruction barriers (among other things). For starters I recommend at least 2 blocks per SM and ~50% occupancy.
B. It depends on your device. You can use the cuda function cudaGetDeviceProperties to see the specifications for your device. A common maximum number is y=1024 threads per block and x=65535 blocks per Grid dimension.
C.A common practise is to have powers of 2 (128,256,512 etc.) threads/block. Reducing large arrays is very effective that way (see Reduction). The optimum distribution of blocks and threads actually depends on your application and your hardware. I personally use 512 threads/block for large sparse linear algebra computations on a TeslaM2050 since it's the most efficient for my applications.