CUDA: Thread-block Level Broadcast on K40 using Shuffle instrutions - cuda

indirectJ2[MAX_SUPER_SIZE] is a shared array.
My cuda device kernel contains following statement (executed by all threads in the thread block):
int nnz_col = indirectJ2[MAX_SUPER_SIZE - 1];
I suspect this would cause bank conflicts.
Is there any way I can implement the above Thread-block level broadcast efficiently using new shuffle instructions for the kepler GPUs? I understand how it works at warp level. Other solutions, which are beyond shuffle instruction (for instance use of CUB etc.), are also welcome.

There is no bank conflict for that line of code on K40. Shared memory accesses already offer a broadcast mechanism. Quoting from the programming guide
"A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32-bit word or within two 32-bit words whose indices i and j are in the same 64-word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 32-bit words are broadcast to the requesting threads "
There is no such concept as shared memory bank conflicts at the threadblock level. Bank conflicts only pertain to the access pattern generated by the shared memory request emanating from a single warp, for a single instruction in that warp.
If you like, you can write a simple test kernel and use profiler metrics (e.g. shared_replay_overhead) to test for shared memory bank conflicts.
Warp shuffle mechanisms do not extend beyond a single warp; therefore there is no short shuffle-only sequence that can broadcast a single quantity to multiple warps in a threadblock. Shared memory can be used to provide direct access to a single quantity to all threads in a warp; you are already doing that.
global memory, __constant__ memory, and kernel parameters can also all be used to "broadcast" the same value to all threads in a threadblock.

Related

Are load and store operations in shared memory atomic?

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.
On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta which does not enforce atomicity. But on the other hand, it is said in the manual that loads are serialized (9.2.3.1):
However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized
which hints to load/store atomicity "per-default" in shared memory. Thus, would the instructions ld.weak.shared.cta and ld.relaxed.shared.cta have the same effect?
Or is it an information the compiler needs anyway to avoid optimizing away load and store?
More generally, supposing variables are properly aligned, would __shared__ int and __shared__ cuda::atomic<int, cuda::thread_scope_block> provide the same guarantees (when considering only load and store operations)?
Bonus (relevant) question: with a primitive data type properly aligned, stored in global memory, accessed by threads from a single block, are __device__ int and __device__ cuda::atomic<int, cuda::thread_scope_block> equivalent in term of atomicity of load/store operations?
Thanks for any insight.
Serialization does not imply atomicity: thread A writes the 2 first bytes of an integer, then thread B reads the variable a, and finally thread A writes the last 2 bytes. All of this happening in sequence (not in parallel), but still not being atomic.
Further, serialization is not guaranteed in all cases, see:
Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.
Conclusion: use atomic.

Does CUDA broadcast shared memory to all threads in a block without a bank conflict?

In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads.
However it doesn't state what happens if the entire block requests the same memory address. Are the accesses between warps serialized or can CUDA broadcast to the whole block.
Demo code for my case
// Assume 1024 sized int array
__global__ add_from_shared(int* i, int* j, int* out)
{
__shared__ int shmem[1024];
shmem[threadIdx.x] = i[threadIdx.x];
...
Do some stuff
...
// Is the shared memory call here serilized between warps or is it a broadcast over the entire block?
j[threadIdx.x] += shmem[0];
}
Thanks
Shared memory bank conflicts are only relevant for threads within a warp, on a particular instruction/cycle. All instructions in the GPU are issued warp-wide. They are not issued to all warps in threadblock, from a single warp scheduler, in the same cycle.
There is no such concept as shared memory bank conflicts between threads in different warps, nor is there any concept of shared memory bank conflicts between threads that are executing different issued instructions.
The warp scheduler will issue the shared read instructions (LDS) to each warp individually. Depending on the access pattern evident among threads in that warp, for that issued instruction, bank conflicts may or may not occur. There are no bank conflicts possible between threads of one warp and threads of another warp.
There is likewise no broadcast mechanism that extends beyond a warp.
All instructions in the GPU are issued per warp.
If all threads in a block read the same address, then the warp scheduler will issue that instruction to one warp, and for the threads in that warp, broadcast will apply. At the same time or at a different time, from the same warp scheduler or a different warp scheduler, the same instruction (i.e. from the same point in the instruction stream) will be issued to another warp. Broadcast will happen within that reqest. Repeat for as many warps in the threadblock.
Your code doesn't contain atomics, or shared memory writes to the same location, and almost nothing I've said here pertains to atomics. atomics are either warp-aggregated or serialized by the atomic handling mechanism, and multiple (non-atomic) writes to the same location lead to undefined behavior. You can expect that one of the writes will show up in that location, but which one is undefined. From a performance perspective, I don't know of any statements about same-location-shared-write performance. And from a performance perspective, atomics are a completely different animal.

Nvidia GPU simultaneous access to a single location in global memory

I was wondering what happens when multiple threads within a single warp try to access the same location in global memory (e.g., the same 4 byte word), particularly in Turing GPUs with compute capability 7.5. I believe that in shared memory that would result in a bank conflict, unless all of the threads access the same location, then the data would be broadcasted.
Just to give a contrived example:
1) Consider that the first 16 threads of a warp access a single 4-byte word, whereas the remaining 16 threads access the next single 4-byte word. How the access is handled in such situation? Is it serialized for every thread of a half warp?
2) What if the entire warp tries to access a single 4-byte word from a global memory?
There is no serialization. All CUDA GPUs Kepler and newer will broadcast in that scenario. There is no performance hit.
No difference. Any pattern of overlapping read access is handled in a single request, with an optimized number of transactions per request. The transactions per request would be no higher than for an ordinary coalesced one-thread-per-adjacent-location type read, and may be lower. For example, on a modern GPU, you might observe 4 (32-byte) transactions per coalesced global read request. In the case of all threads (in a warp) accessing a single location, it would be only one transaction per request.

Which is faster for CUDA shared-mem atomics - warp locality or anti-locality?

Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly.
In which of the cases will such work be completed faster? :
The case of intra-warp access locality, e.g. the total number of memory position accessed by each warp is small and most of them are indeed accessed by multiple lanes
The case of access anti-locality, where all lanes typically access distinct positions (and perhaps with an effort to avoid bank conflicts)?
and no less importantly - is this microarchitecture-dependent, or is it essentially the same on all recent NVIDIA microarchitectures?
Anti-localized access will be faster.
On SM5.0 (Maxwell) and above GPUs the shared memory atomics (assume add) the shared memory unit will replay the instruction due to address conflicts (two lanes with the same address). Normal bank conflict replays also apply. On Maxwell/Pascal the shared memory unit has fixed round robin access between the two SM partitions (2 scheduler in each partition). For each partition the shared memory unit will complete all replays of the instruction prior to moving to the next instruction. The Volta SM will complete the instruction prior to any other shared memory instruction.
Avoid bank conflicts
Avoid address conflicts
On Fermi and Kepler architecture a shared memory lock operation had to be performed prior to the read modify write operation. This blocked all other warp instructions.
Maxwell and newer GPUs have significantly faster shared memory atomic performance thank Fermi/Kepler.
A very simple kernel could be written to micro-benchmark your two different cases. The CUDA profilers provide instruction executed counts and replay counts for shared memory accesses but do not differentiate between replays due to atomics and replays due to load/store conflicts or vector accesses.
There's a quite simple argument to be made even without needing to know anything about how shared memory atomics are implemented in CUDA hardware: At the end of the day, atomic operations must be serialized somehow at some point. This is true in general, it doesn't matter which platform or hardware you're running on. Atomicity kinda requires that simply by nature. If you have multiple atomic operations issued in parallel, you have to somehow execute them in a way that ensures atomicity. That means that atomic operations will always become slower as contention increases, no matter if we're talking GPU or CPU. The only question is: by how much. That depends on the concrete implementation.
So generally, you want to keep the level of contention, i.e., the number of threads what will be trying to perform atomic operations on the same memory location in parallel, as low as possible…
This is a speculative partial answer.
Consider the related question: Performance of atomic operations on shared memory and its accepted answer.
If the accepted answer there is correct (and continues to be correct even today), then warp threads in a more localized access would get in each other's way, making it slower for many lanes to operate atomically, i.e. making anti-locality of warp atomics better.
But to be honest - I'm not sure I completely buy into this line of argumentation, nor do I know if things have changed since that answer was written.

What's the mechanism of the warps and the banks in CUDA?

I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence.
There are some points:
It's said that threads in one block are split into warps. In each warp there are at most 32 threads. That means all these threads of the same warp will execute simultaneously with the same processor. So what's the senses of half-warp?
When it comes to the shared memory of one block, it would be split into 16 banks. To avoid bank conflicts, multiple threads can READ one bank at the same time rather than write in the same bank. Is this a correct interpretation?
Thanks in advance!
The principal usage of "half-warp" was applied to CUDA processors
prior to the Fermi generation (e.g. the "Tesla" or GT200 generation,
and the original G80/G92 generation). These GPUs were architected
with a SM (streaming multiprocessor -- a HW block inside the GPU)
that had fewer than 32 thread
processors.
The definition of warp was still the same, but the actual HW
execution took place in "half-warps" at a time. Actually the
granular details are more complicated than this, but suffice it to
say that the execution model caused memory requests to be issued
according to the needs of a half-warp, i.e. 16 threads within the warp.
A full warp that hit a memory transaction would thus generate a
total of 2 requests for that transaction.
Fermi and newer GPUs have at least 32 thread processors per
SM.
Therefore a memory transaction is immediately visible across a full
warp. As a result, memory requests are issued at the per-warp
level, rather than per-half-warp. However, a full memory request
can only retrieve 128 bytes at a time. Therefore, for data sizes
larger than 32 bits per thread per transaction, the memory
controller may still break the request down into a half-warp size.
My view is that, especially for a beginner, it's not necessary to
have a detailed understanding of half-warp. It's generally
sufficient to understand that it refers to a group of 16 threads
executing together and it has implications for memory requests.
Shared memory for example on the Fermi-class
GPUs
is broken into 32 banks. On previous
GPUs
it was broken into 16 banks. Bank conflicts occur any time an
individual bank is accessed by more than one thread in the same
memory request (i.e. originating from the same code instruction).
To avoid bank conflicts, basic strategies are very similar to the
strategies for coalescing memory requests, eg. for global memory. On Fermi and newer GPUs, multiple threads can read the same address without causing a bank conflict, but in general the definition of a bank conflict is when multiple threads read from the same bank. For further understanding of shared memory and how to avoid bank conflicts, I would recommend the NVIDIA webinar on this topic.