CUDA memory bandwidth when reading a limited number of finite-sized chunks? - cuda

Knowing hardware limits is useful for understanding if your code is performing optimally. The global device memory bandwidth limits how many bytes you can read per second, and you can approach this limit if the chunks you are reading are large enough.
But suppose you are reading, in parallel, N chunks of D bytes each, scattered in random locations in global device memory. Is there a useful formula limiting how much of the bandwidth you'd be able to achieve then?

let's assume:
we are talking about accesses from device code
a chunk of D bytes means D contiguous bytes
when reading a chunk, the read operation is fully coalesced - those bytes are read 4 bytes per thread, by however many adjacent threads in the block are predicted by D/4.
the temporal and spatial characteristics are such that no two chunks are within 32 bytes of each other - either they are all gapped by that much, or else the distribution of loads in time is such that the L2 doesn't provide any benefit. Pretty much saying the L2 hitrate is zero. This seems evident in your statement "global device memory bandwidth" - if the L2 hitrate is not zero, you're not measuring (purely) global device memory bandwidth
we are talking about a relatively recent GPU architecture, say Pascal or newer, or else for an older architecture the L1 is disabled for global loads. Pretty much saying the L1 hitrate is zero.
the overall footprint is not so large as to thrash the TLB
the starting address of each chunk is aligned to a 32-byte boundary (&)
your GPU is sufficiently saturated with warps and blocks to make full use of all resources (e.g. all SMs, all SM partitions, etc.)
the actual chunk access pattern (distribution of addresses) does not result in partition camping or some other hard-to-predict effect
In that case, you can simply round the chunk size D up to the next multiple of 32, and do a calculation based on that. What does that mean?
The predicted bandwidth (B) is:
Bd = the device memory bandwidth of your GPU as indicated by deviceQuery
B = Bd/(((D+31)/32)*32)
And the resultant units there is chunks/sec. (bytes/sec divided by bytes/chunk). The second division operation shown is "integer division", i.e. dropping any fractional part.
(&) In the case where we don't want this assumption, the worst case is to add an additional 32-byte segment per chunk. The formula then becomes:
B = Bd/((((D+31)/32)+1)*32)
note that this condition cannot apply when the chunk size is less than 34 bytes.
All I am really doing here is calculating the number of 32-byte DRAM transactions that would be generated by a stream of such requests, and using that to "derate" the observed peak (100% coalesced/100% utilized) case.

Under #RobertCrovella's assumptions, and assuming the chunk sizes are multiples of 32 bytes and chunks are 32-byte aligned, you will get the same bandwidth as for a single chunk - as Robert's formula tells you. So, no benefit and no detriment.
But ensuring these assumptions hold is often not trivial (even merely ensuring coalesced memory reads).

Related

CUDA coalesced memory access speed depending on word size

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

How are L2 transactions mapped to DRAM in GPUs?

In GPUs the transactions to the L2 cache can be of size 32B, 64B or 128B (both read and write). And the total number of such transactions can be measured using nvprof metrics like gst_transactions and gld_transactions. However, I am unable to find any material that details how these transactions are mapped for DRAM access i.e how are these transactions being handled by the DRAM which usually has a different bus width? For example, the TitanXp GPU has a 384 bit global memory bus and the P100 has a 3072 bit memory bus. So how are the 32B, 64B or 128B instructions mapped to these memory buses. And how can I measure the number of transactions generated by the DRAM controller?
PS: The dram_read_transactions metric does not seem to do this. I say that because I get the same value for dram_read_transactions on the TitanXp and the P100 (even during sequential access) in-spite of the two having widely different bus widths.
Although GPU DRAM may have different (hardware) bus widths across different GPU types, the bus is always composed of a set of partitions, each of which has an effective width of 32 bytes. A DRAM transaction from the profiler perspective actually consists of one of these 32-byte transactions, not a transaction at full "bus width".
Therefore a (single) 32 byte transaction to L2, if it misses in the L2, will convert to a single 32-byte DRAM transaction. Transactions of higher granularity, such as 64-byte or 128-byte, will convert into the requisite number of 32-byte DRAM transactions. This is discoverable using any of the CUDA profilers.
These related questions here and here may be of interest as well.
Note that an "effective width" of 32 bytes, as used above, does not necessarily mean that a transaction requires 32bytes * 8bits/byte = 256 bit wide interface. DRAM busses can be "double-pumped" or "quad-pumped" which means a transaction may consist of multiple bits transferred per "wire" of the interface. Therefore you will find GPUs that have only a 128-bit wide (or even 64-bit wide) interface to GPU DRAM, but a "transaction" on these busses will still consist of 32-bytes, which will require multiple bits to be transferred (probably in multiple DRAM bus clock cycles) per "wire" of the interface.

Minimizing registers per thread + "maxregcount" effect

Profiling result of my program says maximum theoretical achieved occupancy is 50% and the limiter are registers. What are general instructions about minimizing number of registers in CUDA code? I see profiling results show number of registers are much more than number of 32 and 16 bit variables I have in my code (per thread)? What can be potentially the reason?
Plus, setting "maxregcount" to 32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX), solves the occupancy limit issue but I don't get much of speed up. Does "maxregcount" try to optimize the code more, so it won't be wasteful in using registers? Or it simply chooses L1 cache or local memory for register spilling?
As per the presentation of nvidia given here. If the source exceeds the register limit Local Memory is used. Its worth spending time studying this presentation as it describes various options to increase the performance. As Vasily Volkov says in this presentation occupancy is one of the metrics not the only one.
Also notice,
32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX) is somewhat wrong I feel.
32 * 1024 (registers per block) = 32768 < 65536 ( registers per block). You can still increase the number of registers per thread till 64.
maxrregcount does cause the compiler to rearrange its use of registers, but it's always trying to keep register count low. Where it can't stay below your imposed limit, it will simply spill it to L1, L2 and DRAM. When you have to go to DRAM to fetch your spilled local variables, it can crowd out your explicit memory fetches and/or cause your kernel to become "latency-bound"--that is, computation is held up while waiting for the data to come back.
You might have better luck choosing something between unlimited registers and 32. Often some spilling and less than perfect occupancy beats lots of spilling with 100% occupancy for the reasons given above.
As a side note, you can limit regs for a specific kernel (rather that the whole file), by using launch_bounds, which you can read about in the Programming Guide.

CUDA memory for lookup tables

I'm designing a set of mathematical functions and implementing them in both CPU and GPU (with CUDA) versions.
Some of these functions are based upon lookup tables. Most of the tables take 4KB, some of them a bit more. The functions based upon lookup tables take an input, pick one or two entry of the lookup table and then compute the result by interpolating or applying similar techniques.
My question is now: where should I save these lookup tables? A CUDA device has many places for storing values (global memory, constant memory, texture memory,...). Provided that every table could be read concurrently by many threads and that the input values, and therefore the lookup indices, can be completely uncorrelated among the threads of every warp (resulting in uncorrelated memory accesses), which memory provides the fastest access?
I add that the contents of these tables are precomputed and completely constant.
EDIT
Just to clarify: I need to store about 10 different 4KB lookup tables. Anyway it would be great to know wether the solution as for this case would be the same for the case with e.g. 100 4KB tables or with e.g. 10 16KB lookup tables.
Texture memory (now called read only data cache) would probably be a choice worth exploring, although not for the interpolation benefits. It supports 32 bit reads without reading beyond this amount. However, you're limited to 48K in total. For Kepler (compute 3.x) this is quite simple to program now.
Global memory, unless you configure it in 32 bit mode, will often drag in 128 bytes for each thread, hugely multiplying what is actually data needed from memory as you (apparently) can't coalesce the memory accesses. Thus the 32 bit mode is probably what you need if you want to use more than 48K (you mentioned 40K).
Thinking of coalescing, if you were to access a set of values in series from these tables, you might be able to interleave the tables such that these combinations could be grouped and read as a 64 or 128 bit read per thread. This would mean the 128 byte reads from global memory could be useful.
The problem you will have is that you're making the solution memory bandwidth limited by using lookup tables. Changing the L1 cache size (on Fermi / compute 2.x) to 48K will likely make a significant difference, especially if you're not using the other 32K of shared memory. Try texture memory and then global memory in 32 bit mode and see which works best for your algorithm. Finally pick a card with a good memory bandwidth figure if you have a choice over hardware.

data per block in CUDA -- does it broadcast in one transaction?

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).