Nvidia GPU simultaneous access to a single location in global memory - cuda

I was wondering what happens when multiple threads within a single warp try to access the same location in global memory (e.g., the same 4 byte word), particularly in Turing GPUs with compute capability 7.5. I believe that in shared memory that would result in a bank conflict, unless all of the threads access the same location, then the data would be broadcasted.
Just to give a contrived example:
1) Consider that the first 16 threads of a warp access a single 4-byte word, whereas the remaining 16 threads access the next single 4-byte word. How the access is handled in such situation? Is it serialized for every thread of a half warp?
2) What if the entire warp tries to access a single 4-byte word from a global memory?

There is no serialization. All CUDA GPUs Kepler and newer will broadcast in that scenario. There is no performance hit.
No difference. Any pattern of overlapping read access is handled in a single request, with an optimized number of transactions per request. The transactions per request would be no higher than for an ordinary coalesced one-thread-per-adjacent-location type read, and may be lower. For example, on a modern GPU, you might observe 4 (32-byte) transactions per coalesced global read request. In the case of all threads (in a warp) accessing a single location, it would be only one transaction per request.

Related

What are CUDA Global Memory 32-, 64- and 128-byte transactions?

I am relatively new to CUDA programming.
In this blog (How to Access Global Memory Efficiently in CUDA C/C++ Kernels), we have the following:
"The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size."
The 128-byte transaction is also mentioned in this post (The cost of CUDA global memory transactions)
In addition, 32-and 128-byte memory transactions are also mentioned in the CUDA C Programming Guide. This guide also show Figure 20 about aligned and mis-aligned access, that I couldn't quite understand.
Explain and give examples on how 32-, 64-, 128-byte transactions would happen?
Go through Figure 20 in more details. What is the point that the Figure is making?
Both of these need to be understood in the context of a CUDA warp. All operations are issued warp-wide, and this includes instructions that access memory.
An individual CUDA thread can access 1,2,4,8,or 16 bytes in a single instruction or transaction. When considered warp-wide, that translates to 32 bytes all the way up to 512 bytes. The GPU memory controller can typically issue requests to memory in granularities of 32 bytes, up to 128 bytes. Larger requests (say, 512 bytes, considered warp wide) will get issued via multiple "transactions" of typically no more than 128 bytes.
Modern DRAM memory has the design characteristic that you don't typically ask for a single byte, you request a "segment" typically of 32 bytes at a time for typical GPU designs. The division of memory into segments is fixed at design time. As a result, you can request either the first 32 bytes (the first segment) or the second 32 bytes (the second segment). You cannot request bytes 16-47 for example. This is all a function of the DRAM design, but it manifests in terms of memory behavior.
The diagram(s) depicts the behavior of each thread in a warp. Individually, they are depicted by the gray/black arrows pointing upwards. Each arrow represents the request from a thread, and the arrow points to a relative location in memory that that thread would like to load or store.
The diagrams are presented in comparison to each other to show the effect of "alignment". When considered warp-wide, if all 32 threads are requesting bytes of data that belong to a single segment, this would require the memory controller to retrieve only one segment to satisfy the request. This would arguably be the most efficient possible behavior (and therefore data organization as well as access pattern, considered warp-wide) for a single request (i.e. a single load or store instruction).
However if the addresses emanating from each thread in the warp result in a pattern depicted in the 2nd figure, this would be "unaligned", and even though you are effectively asking for a similar data "footprint", the lack of alignment to a single segment means the memory controller will need to retrieve 2 segments from memory to satisfy the request.
That is the key point of understanding associated with the figure. But there is more to the story than that. Misaligned access is not necessarily as tragic (performance cut in half) as this might suggest. The GPU caches play a role here, when we consider these transactions not just in the context of a single warp, but across many warps.
To get a more complete and orderly treatment of these topics, I suggest referring to various training material. It's by no means the only one, but unit 4 of this training series will cover the topic in more detail.

How many simultaneous read instructions per thread on modern GPU?

On a modern GPU (let's say, Kepler), if I have 4 independent global memory reads (no dependencies between reads) from a single thread, will all 4 reads be pipelined at once, so that I only pay the latency penalty of a single global memory read? What about from shared memory? How many reads can be in the pipeline at once, is this documented somewhere?
The GPU threads do not work in that way. Multiple global memory read from a single thread will never be combined.
However multiple global memory reads from different threads may be combined if they are launched at the same time, and the locations they are reading are within 128 bytes. This happens in a warp (a group of threads that always execute the same instruction). For example if thread 0~31 in a warp read input[0~31] of the type float. All these reads will be combine into one memory transaction (assuming the data is properly aligned). But if thread 0~31 in a warp read input[0,2,4,...,62], these read will combine into two memory transactions and half of the data will be read and abandoned.
For shared memory, the latency is ~100x smaller than global memory access. The main concern here is to avoid the bank conflict.
You may want to read the following links for more information.
https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/
https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-memory-throughput
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces

CUDA: Thread-block Level Broadcast on K40 using Shuffle instrutions

indirectJ2[MAX_SUPER_SIZE] is a shared array.
My cuda device kernel contains following statement (executed by all threads in the thread block):
int nnz_col = indirectJ2[MAX_SUPER_SIZE - 1];
I suspect this would cause bank conflicts.
Is there any way I can implement the above Thread-block level broadcast efficiently using new shuffle instructions for the kepler GPUs? I understand how it works at warp level. Other solutions, which are beyond shuffle instruction (for instance use of CUB etc.), are also welcome.
There is no bank conflict for that line of code on K40. Shared memory accesses already offer a broadcast mechanism. Quoting from the programming guide
"A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32-bit word or within two 32-bit words whose indices i and j are in the same 64-word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 32-bit words are broadcast to the requesting threads "
There is no such concept as shared memory bank conflicts at the threadblock level. Bank conflicts only pertain to the access pattern generated by the shared memory request emanating from a single warp, for a single instruction in that warp.
If you like, you can write a simple test kernel and use profiler metrics (e.g. shared_replay_overhead) to test for shared memory bank conflicts.
Warp shuffle mechanisms do not extend beyond a single warp; therefore there is no short shuffle-only sequence that can broadcast a single quantity to multiple warps in a threadblock. Shared memory can be used to provide direct access to a single quantity to all threads in a warp; you are already doing that.
global memory, __constant__ memory, and kernel parameters can also all be used to "broadcast" the same value to all threads in a threadblock.

What's the mechanism of the warps and the banks in CUDA?

I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence.
There are some points:
It's said that threads in one block are split into warps. In each warp there are at most 32 threads. That means all these threads of the same warp will execute simultaneously with the same processor. So what's the senses of half-warp?
When it comes to the shared memory of one block, it would be split into 16 banks. To avoid bank conflicts, multiple threads can READ one bank at the same time rather than write in the same bank. Is this a correct interpretation?
Thanks in advance!
The principal usage of "half-warp" was applied to CUDA processors
prior to the Fermi generation (e.g. the "Tesla" or GT200 generation,
and the original G80/G92 generation). These GPUs were architected
with a SM (streaming multiprocessor -- a HW block inside the GPU)
that had fewer than 32 thread
processors.
The definition of warp was still the same, but the actual HW
execution took place in "half-warps" at a time. Actually the
granular details are more complicated than this, but suffice it to
say that the execution model caused memory requests to be issued
according to the needs of a half-warp, i.e. 16 threads within the warp.
A full warp that hit a memory transaction would thus generate a
total of 2 requests for that transaction.
Fermi and newer GPUs have at least 32 thread processors per
SM.
Therefore a memory transaction is immediately visible across a full
warp. As a result, memory requests are issued at the per-warp
level, rather than per-half-warp. However, a full memory request
can only retrieve 128 bytes at a time. Therefore, for data sizes
larger than 32 bits per thread per transaction, the memory
controller may still break the request down into a half-warp size.
My view is that, especially for a beginner, it's not necessary to
have a detailed understanding of half-warp. It's generally
sufficient to understand that it refers to a group of 16 threads
executing together and it has implications for memory requests.
Shared memory for example on the Fermi-class
GPUs
is broken into 32 banks. On previous
GPUs
it was broken into 16 banks. Bank conflicts occur any time an
individual bank is accessed by more than one thread in the same
memory request (i.e. originating from the same code instruction).
To avoid bank conflicts, basic strategies are very similar to the
strategies for coalescing memory requests, eg. for global memory. On Fermi and newer GPUs, multiple threads can read the same address without causing a bank conflict, but in general the definition of a bank conflict is when multiple threads read from the same bank. For further understanding of shared memory and how to avoid bank conflicts, I would recommend the NVIDIA webinar on this topic.

The relationship between bank conflict and coalesced access in CUDA

I try to transfer some data from shared memory to global memory. Some consecutive threads will access one bank (but not the same 32 bits). So there are some bank conflicts. (I use Visual Profiler to check this)
However, those data also be coalesced and then be transfered to global memory. (I use Visual Profiler to check this)
Why are the data wrote into global memory with coalesced way? In my opinion, a streaming multi-processor pops the 32-bit word one by one (based on bank's bandwidth). So the memory transactions can not be coalesced in global memory.
I may make some mistakes here. Please help to find the mistakes out or give me a reasonable explanation. Thank you.
You have two different things going on here: the read, which incurs a bank conflict, and the write, which may not be coalesced. Since shared memory is much much faster than global, you usually need to worry about the coalesced access first.
Coalescing means that the threads are writing into a small range of memory addresses. For example, if thread 1 writes to address 1 and thread 2 to address 2, this is good. If the write to addresses* 1 and 4, respectively, this is worse. Less importantly, it's optimal if threads write into increasing addresses starting at a multiple of 32, for example addresses 32 and 33. *(I use "address" loosely here, meaning a 4-byte offset).
Bank conflicts occur when multiple threads access shared memory addresses that have the same lower bits (specifically, are equivalent mod 16). If two threads use the same bank, then they will be serialized, meaning one will execute after the other, instead of both accessing memory at the same time.