Why are CUDA memory allocations aligned to 256 bytes? - cuda

According to cuda alignment 256bytes seriously? CUDA memory allocations are guaranteed to be aligned to at least 256 bytes.
Why is that the case? 256 bytes is much larger than any numeric data type. It might be the size of a vector, but GPUs do not require load/store to be aligned to the size of the whole vector, indeed they go so far as to support gather/scatter where every individual element may be placed at any memory address that is a multiple of the size of the element.
What purpose does the 256-byte alignment serve?

Why is that the case? 256 bytes is much larger than any numeric data type.
Well, I'm sure there are multiple reasons (e.g. it's easier to manage fewer, larger, allocations), but about your specific point: Don't think about a single value of a numeric data type - think about a full warp's worth: if sizeof(float) is 4, then a warp's worth of floats is 32 * 4 = 128 bytes. And if it's a double or long int (64-bit int), then you get 32 * 8 = 256 .
Note: It is not necessary for warps to make such coalesced reads of multiple values from memory. A single thread can read a single unaligned byte and that will work. But - performance will suffer if the read pattern is not coalesced to reading contiguous, aligned, chunks (typically of 128 bytes or 32 bytes); see also:
In CUDA, what is memory coalescing, and how is it achieved?

Related

CUDA memory bandwidth when reading a limited number of finite-sized chunks?

Knowing hardware limits is useful for understanding if your code is performing optimally. The global device memory bandwidth limits how many bytes you can read per second, and you can approach this limit if the chunks you are reading are large enough.
But suppose you are reading, in parallel, N chunks of D bytes each, scattered in random locations in global device memory. Is there a useful formula limiting how much of the bandwidth you'd be able to achieve then?
let's assume:
we are talking about accesses from device code
a chunk of D bytes means D contiguous bytes
when reading a chunk, the read operation is fully coalesced - those bytes are read 4 bytes per thread, by however many adjacent threads in the block are predicted by D/4.
the temporal and spatial characteristics are such that no two chunks are within 32 bytes of each other - either they are all gapped by that much, or else the distribution of loads in time is such that the L2 doesn't provide any benefit. Pretty much saying the L2 hitrate is zero. This seems evident in your statement "global device memory bandwidth" - if the L2 hitrate is not zero, you're not measuring (purely) global device memory bandwidth
we are talking about a relatively recent GPU architecture, say Pascal or newer, or else for an older architecture the L1 is disabled for global loads. Pretty much saying the L1 hitrate is zero.
the overall footprint is not so large as to thrash the TLB
the starting address of each chunk is aligned to a 32-byte boundary (&)
your GPU is sufficiently saturated with warps and blocks to make full use of all resources (e.g. all SMs, all SM partitions, etc.)
the actual chunk access pattern (distribution of addresses) does not result in partition camping or some other hard-to-predict effect
In that case, you can simply round the chunk size D up to the next multiple of 32, and do a calculation based on that. What does that mean?
The predicted bandwidth (B) is:
Bd = the device memory bandwidth of your GPU as indicated by deviceQuery
B = Bd/(((D+31)/32)*32)
And the resultant units there is chunks/sec. (bytes/sec divided by bytes/chunk). The second division operation shown is "integer division", i.e. dropping any fractional part.
(&) In the case where we don't want this assumption, the worst case is to add an additional 32-byte segment per chunk. The formula then becomes:
B = Bd/((((D+31)/32)+1)*32)
note that this condition cannot apply when the chunk size is less than 34 bytes.
All I am really doing here is calculating the number of 32-byte DRAM transactions that would be generated by a stream of such requests, and using that to "derate" the observed peak (100% coalesced/100% utilized) case.
Under #RobertCrovella's assumptions, and assuming the chunk sizes are multiples of 32 bytes and chunks are 32-byte aligned, you will get the same bandwidth as for a single chunk - as Robert's formula tells you. So, no benefit and no detriment.
But ensuring these assumptions hold is often not trivial (even merely ensuring coalesced memory reads).

CUDA coalesced memory access speed depending on word size

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

Can many threads set bit on the same word simultaneously?

I need each thread of a warp deciding on setting or not its respective bit in a 32 bits word. Does this multiple setting take only one memory access, or will be one memory access for each bit set?
There is no independent bit-setting capability in CUDA. (There is a bit-field-insert instruction in PTX, but it nevertheless operates on a 32-bit quantity.)
Each thread would set a bit by doing a full 32-bit write. Such a write would need to be an atomic RMW operation in order to preserve the other bits. Therefore the accesses would effectively be serialized, at whatever the throughput of atomics are.
If memory space is not a concern, breaking the bits out into separate integers would allow you to avoid atomics.
A 32-bit packed quantity could then be quickly assembled using the __ballot() warp vote function. An example is given in the answer here.
(In fact, the warp vote function may allow you to avoid memory transactions altogether; everything can be handled in registers, if the only result you need is the 32-bit packed quantity.)

data per block in CUDA -- does it broadcast in one transaction?

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).

Clarifying memory transactions in CUDA

I am confused about the following statements in the CUDA programming guide 4.0 section 5.3.2.1
in the chapter of Performance Guidelines.
Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions.
These memory transactions must be naturally aligned:Only the 32-, 64- ,
128- byte segments of device memory
that are aligned to their size (i.e. whose first address is a
multiple of their size) can be read or written by memory
transactions.
1)
My understanding of device memory was that accesses to the device memory by threads is uncached: So if thread accesses memory location a[i] it will fetch only a[i] and none of the
values around a[i]. So the first statement seems to contradict this. Or perhaps I am misunderstanding the usage of the phrase "memory transaction" here?
2) The second sentence does not seem very clear. Can someone explain this?
Memory transactions are performed per warp. So 32 byte transactions is a warp sized read of an 8 bit type, 64 byte transactions is a warp sized read of an 16 bit type, and 128 byte transactions is a warp sized read of an 32 bit type.
It just means that all reads have to be aligned to a natural word size boundary. It is not possible for a warp to read a 128 byte transaction with a one byte offset. See this answer for more details.