Cuda: async-copy vs coalesced global memory read atomicity - cuda

I was reading something about the memory model in Cuda. In particular, when copying data from global to shared memory, my understanding of shared_mem_data[i] = global_mem_data[i] is that it is done in a coalesced atomic fashion, i.e each thread in the warp reads global_data[i] in a single indivisible transaction. Is that correct?

tl;dr: No.
It is not guaranteed, AFAIK, that all values are read in a single transaction. In fact, a GPU's memory bus is not even guaranteed to be wide enough for a single transaction to retrieve a full warp's width of data (1024 bits for a full warp read of 4 bytes each). It is theoretically for some values in the read-from locations in memory to change while the read is underway.

Related

CUDA coalesced memory access speed depending on word size

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

Can many threads set bit on the same word simultaneously?

I need each thread of a warp deciding on setting or not its respective bit in a 32 bits word. Does this multiple setting take only one memory access, or will be one memory access for each bit set?
There is no independent bit-setting capability in CUDA. (There is a bit-field-insert instruction in PTX, but it nevertheless operates on a 32-bit quantity.)
Each thread would set a bit by doing a full 32-bit write. Such a write would need to be an atomic RMW operation in order to preserve the other bits. Therefore the accesses would effectively be serialized, at whatever the throughput of atomics are.
If memory space is not a concern, breaking the bits out into separate integers would allow you to avoid atomics.
A 32-bit packed quantity could then be quickly assembled using the __ballot() warp vote function. An example is given in the answer here.
(In fact, the warp vote function may allow you to avoid memory transactions altogether; everything can be handled in registers, if the only result you need is the 32-bit packed quantity.)

data per block in CUDA -- does it broadcast in one transaction?

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).

Clarifying memory transactions in CUDA

I am confused about the following statements in the CUDA programming guide 4.0 section 5.3.2.1
in the chapter of Performance Guidelines.
Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions.
These memory transactions must be naturally aligned:Only the 32-, 64- ,
128- byte segments of device memory
that are aligned to their size (i.e. whose first address is a
multiple of their size) can be read or written by memory
transactions.
1)
My understanding of device memory was that accesses to the device memory by threads is uncached: So if thread accesses memory location a[i] it will fetch only a[i] and none of the
values around a[i]. So the first statement seems to contradict this. Or perhaps I am misunderstanding the usage of the phrase "memory transaction" here?
2) The second sentence does not seem very clear. Can someone explain this?
Memory transactions are performed per warp. So 32 byte transactions is a warp sized read of an 8 bit type, 64 byte transactions is a warp sized read of an 16 bit type, and 128 byte transactions is a warp sized read of an 32 bit type.
It just means that all reads have to be aligned to a natural word size boundary. It is not possible for a warp to read a 128 byte transaction with a one byte offset. See this answer for more details.

Which is better, the atomic's competition between: threads of the single Warp or threads of different Warps?

Which is better, the atomic's competition (concurrency) between threads of the single Warp or between threads of different Warps in one block? I think that when you access the shared memory is better when threads of one warp are competing with each other is less than the threads of different warps. And with access to global memory on the contrary, it is better that a threads of different warps of one block competed less than the threads of single warp, isn't it?
I need it to know how better to resolve competition (concurrency) and what better to separate store: between threads in single warp or between warps.
Incidentally it may be said that the team __ syncthreads (); synchronizes it warps in a single block and not the threads of one warp?
If a significant number of threads in a block perform atomic updates to the same value, you will get poor performance since those threads must all be serialized. In such cases, it is usually better to have each thread write its result to a separate location and then, in a separate kernel, process those values.
If each thread in a warp performs an atomic update to the same value, all the threads in the warp perform the update in the same clock cycle, so they must all be serialized at the point of the atomic update. This probably means that the warp is scheduled 32 times to get all the threads serviced (very bad).
On the other hand, if a single thread in each warp in a block performs an atomic update to the same value, the impact will be lower because the pairs of warps (the two warps processed at each clock by the two warp schedulers) are offset in time (by one clock cycle), as they move through the processing pipelines. So you end up with only two atomic updates (one from each of the two warps), getting issued within one cycle and needing to immediately be serialized.
So, in the second case, the situation is better, but still problematic. The reason is that, depending on where the shared value is, you can still get serialization between SMs, and this can be very slow since each thread may have to wait for updates to go all the way out to global memory, or at least to L2, and then back. It may be possible to refactor the algorithm in such a way that threads within a block perform atomic updates to a value in shared memory (L1), and then have one thread in each block perform an atomic update to a value in global memory (L2).
The atomic operations can be complete lifesavers but they tend to be overused by people new to CUDA. It is often better to use a separate step with a parallel reduction or parallel stream compaction algorithm (see thrust::copy_if).