CUDA coalesced memory access speed depending on word size - cuda

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?

Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

Related

CUDA memory bandwidth when reading a limited number of finite-sized chunks?

Knowing hardware limits is useful for understanding if your code is performing optimally. The global device memory bandwidth limits how many bytes you can read per second, and you can approach this limit if the chunks you are reading are large enough.
But suppose you are reading, in parallel, N chunks of D bytes each, scattered in random locations in global device memory. Is there a useful formula limiting how much of the bandwidth you'd be able to achieve then?
let's assume:
we are talking about accesses from device code
a chunk of D bytes means D contiguous bytes
when reading a chunk, the read operation is fully coalesced - those bytes are read 4 bytes per thread, by however many adjacent threads in the block are predicted by D/4.
the temporal and spatial characteristics are such that no two chunks are within 32 bytes of each other - either they are all gapped by that much, or else the distribution of loads in time is such that the L2 doesn't provide any benefit. Pretty much saying the L2 hitrate is zero. This seems evident in your statement "global device memory bandwidth" - if the L2 hitrate is not zero, you're not measuring (purely) global device memory bandwidth
we are talking about a relatively recent GPU architecture, say Pascal or newer, or else for an older architecture the L1 is disabled for global loads. Pretty much saying the L1 hitrate is zero.
the overall footprint is not so large as to thrash the TLB
the starting address of each chunk is aligned to a 32-byte boundary (&)
your GPU is sufficiently saturated with warps and blocks to make full use of all resources (e.g. all SMs, all SM partitions, etc.)
the actual chunk access pattern (distribution of addresses) does not result in partition camping or some other hard-to-predict effect
In that case, you can simply round the chunk size D up to the next multiple of 32, and do a calculation based on that. What does that mean?
The predicted bandwidth (B) is:
Bd = the device memory bandwidth of your GPU as indicated by deviceQuery
B = Bd/(((D+31)/32)*32)
And the resultant units there is chunks/sec. (bytes/sec divided by bytes/chunk). The second division operation shown is "integer division", i.e. dropping any fractional part.
(&) In the case where we don't want this assumption, the worst case is to add an additional 32-byte segment per chunk. The formula then becomes:
B = Bd/((((D+31)/32)+1)*32)
note that this condition cannot apply when the chunk size is less than 34 bytes.
All I am really doing here is calculating the number of 32-byte DRAM transactions that would be generated by a stream of such requests, and using that to "derate" the observed peak (100% coalesced/100% utilized) case.
Under #RobertCrovella's assumptions, and assuming the chunk sizes are multiples of 32 bytes and chunks are 32-byte aligned, you will get the same bandwidth as for a single chunk - as Robert's formula tells you. So, no benefit and no detriment.
But ensuring these assumptions hold is often not trivial (even merely ensuring coalesced memory reads).

CUDA Lookup Table vs. Algorithm

I know this can be tested but I am interested in the theory, on paper what should be faster.
I'm trying to work out what would be theoretically faster, a random look-up from a table in shared memory (so bank conflicts possible) vs an algorithm with say, 'n' fp multiplications.
Best case scenario is the shared memory look-up has no bank conflicts and so takes 20-40 clock cycles, worst case is 32 bank conflicts and 640-1280 clock cycles. The multiplications will be 'n' * cycles per instruction. Is this proper reasoning?
Do the fp multiplications each take 1 cycle? 5 cycles? At which point, as a number of multiplications, does it make sense to use a shared memory look-up table?
The multiplications will be 'n' x cycles per instruction. Is this proper reasoning? When doing 'n' fp multiplications, it is keeping the cores busy with those operations. It's probably not just 'mult' instructions, it will be other ones like 'mov' in-between also. So maybe it might be n*3 instructions total. When you fetch a cached value from shared memory the (20-40) * 5(avg max bank conflicts..guessing)= ~150 clocks the cores are free to do other things. If the kernel is compute bound(limited) then using shared memory might be more efficient. If the kernel has limited shared memory or using more shared memory would result in fewer in-flight warps then re-calculating it would be faster.
Do the fp multiplications each take 1 cycle? 5 cycles?
When I wrote this it was 6 cycles but that was 7 years ago. It might (or might not) be faster now. This is only for a particular core though and not the entire SM.
At which point, as a number of multiplications, does it make sense to use a shared memory look-up table? It's really hard to say. There are a lot of variables here like GPU generation, what the rest of the kernel is doing, the setup time for the shared memory, etc.
A problem with building random numbers in a kernel is also the additional registers requirements. This might cause slowdown for the rest of the kernel because there would be more register usage so that could cause less occupancy.
Another solution (again depending on the problem) would be to use a GPU RNG and fill a global memory array with random numbers. Then have your kernel access these. It would take 300-500 clock cycles but there would not be any bank conflicts. Also with Pascal(not release yet) there will be hbm2 and this will likely lower the global memory access time even further.
Hope this helps. Hopefully some other experts can chime in and give you better information.

Minimizing registers per thread + "maxregcount" effect

Profiling result of my program says maximum theoretical achieved occupancy is 50% and the limiter are registers. What are general instructions about minimizing number of registers in CUDA code? I see profiling results show number of registers are much more than number of 32 and 16 bit variables I have in my code (per thread)? What can be potentially the reason?
Plus, setting "maxregcount" to 32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX), solves the occupancy limit issue but I don't get much of speed up. Does "maxregcount" try to optimize the code more, so it won't be wasteful in using registers? Or it simply chooses L1 cache or local memory for register spilling?
As per the presentation of nvidia given here. If the source exceeds the register limit Local Memory is used. Its worth spending time studying this presentation as it describes various options to increase the performance. As Vasily Volkov says in this presentation occupancy is one of the metrics not the only one.
Also notice,
32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX) is somewhat wrong I feel.
32 * 1024 (registers per block) = 32768 < 65536 ( registers per block). You can still increase the number of registers per thread till 64.
maxrregcount does cause the compiler to rearrange its use of registers, but it's always trying to keep register count low. Where it can't stay below your imposed limit, it will simply spill it to L1, L2 and DRAM. When you have to go to DRAM to fetch your spilled local variables, it can crowd out your explicit memory fetches and/or cause your kernel to become "latency-bound"--that is, computation is held up while waiting for the data to come back.
You might have better luck choosing something between unlimited registers and 32. Often some spilling and less than perfect occupancy beats lots of spilling with 100% occupancy for the reasons given above.
As a side note, you can limit regs for a specific kernel (rather that the whole file), by using launch_bounds, which you can read about in the Programming Guide.

CUDA memory for lookup tables

I'm designing a set of mathematical functions and implementing them in both CPU and GPU (with CUDA) versions.
Some of these functions are based upon lookup tables. Most of the tables take 4KB, some of them a bit more. The functions based upon lookup tables take an input, pick one or two entry of the lookup table and then compute the result by interpolating or applying similar techniques.
My question is now: where should I save these lookup tables? A CUDA device has many places for storing values (global memory, constant memory, texture memory,...). Provided that every table could be read concurrently by many threads and that the input values, and therefore the lookup indices, can be completely uncorrelated among the threads of every warp (resulting in uncorrelated memory accesses), which memory provides the fastest access?
I add that the contents of these tables are precomputed and completely constant.
EDIT
Just to clarify: I need to store about 10 different 4KB lookup tables. Anyway it would be great to know wether the solution as for this case would be the same for the case with e.g. 100 4KB tables or with e.g. 10 16KB lookup tables.
Texture memory (now called read only data cache) would probably be a choice worth exploring, although not for the interpolation benefits. It supports 32 bit reads without reading beyond this amount. However, you're limited to 48K in total. For Kepler (compute 3.x) this is quite simple to program now.
Global memory, unless you configure it in 32 bit mode, will often drag in 128 bytes for each thread, hugely multiplying what is actually data needed from memory as you (apparently) can't coalesce the memory accesses. Thus the 32 bit mode is probably what you need if you want to use more than 48K (you mentioned 40K).
Thinking of coalescing, if you were to access a set of values in series from these tables, you might be able to interleave the tables such that these combinations could be grouped and read as a 64 or 128 bit read per thread. This would mean the 128 byte reads from global memory could be useful.
The problem you will have is that you're making the solution memory bandwidth limited by using lookup tables. Changing the L1 cache size (on Fermi / compute 2.x) to 48K will likely make a significant difference, especially if you're not using the other 32K of shared memory. Try texture memory and then global memory in 32 bit mode and see which works best for your algorithm. Finally pick a card with a good memory bandwidth figure if you have a choice over hardware.

Does CUDA automatically load-balance for you?

I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...