How many simultaneous read instructions per thread on modern GPU? - cuda

On a modern GPU (let's say, Kepler), if I have 4 independent global memory reads (no dependencies between reads) from a single thread, will all 4 reads be pipelined at once, so that I only pay the latency penalty of a single global memory read? What about from shared memory? How many reads can be in the pipeline at once, is this documented somewhere?

The GPU threads do not work in that way. Multiple global memory read from a single thread will never be combined.
However multiple global memory reads from different threads may be combined if they are launched at the same time, and the locations they are reading are within 128 bytes. This happens in a warp (a group of threads that always execute the same instruction). For example if thread 0~31 in a warp read input[0~31] of the type float. All these reads will be combine into one memory transaction (assuming the data is properly aligned). But if thread 0~31 in a warp read input[0,2,4,...,62], these read will combine into two memory transactions and half of the data will be read and abandoned.
For shared memory, the latency is ~100x smaller than global memory access. The main concern here is to avoid the bank conflict.
You may want to read the following links for more information.
https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/
https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-memory-throughput
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces

Related

Nvidia GPU simultaneous access to a single location in global memory

I was wondering what happens when multiple threads within a single warp try to access the same location in global memory (e.g., the same 4 byte word), particularly in Turing GPUs with compute capability 7.5. I believe that in shared memory that would result in a bank conflict, unless all of the threads access the same location, then the data would be broadcasted.
Just to give a contrived example:
1) Consider that the first 16 threads of a warp access a single 4-byte word, whereas the remaining 16 threads access the next single 4-byte word. How the access is handled in such situation? Is it serialized for every thread of a half warp?
2) What if the entire warp tries to access a single 4-byte word from a global memory?
There is no serialization. All CUDA GPUs Kepler and newer will broadcast in that scenario. There is no performance hit.
No difference. Any pattern of overlapping read access is handled in a single request, with an optimized number of transactions per request. The transactions per request would be no higher than for an ordinary coalesced one-thread-per-adjacent-location type read, and may be lower. For example, on a modern GPU, you might observe 4 (32-byte) transactions per coalesced global read request. In the case of all threads (in a warp) accessing a single location, it would be only one transaction per request.

Which is faster for CUDA shared-mem atomics - warp locality or anti-locality?

Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly.
In which of the cases will such work be completed faster? :
The case of intra-warp access locality, e.g. the total number of memory position accessed by each warp is small and most of them are indeed accessed by multiple lanes
The case of access anti-locality, where all lanes typically access distinct positions (and perhaps with an effort to avoid bank conflicts)?
and no less importantly - is this microarchitecture-dependent, or is it essentially the same on all recent NVIDIA microarchitectures?
Anti-localized access will be faster.
On SM5.0 (Maxwell) and above GPUs the shared memory atomics (assume add) the shared memory unit will replay the instruction due to address conflicts (two lanes with the same address). Normal bank conflict replays also apply. On Maxwell/Pascal the shared memory unit has fixed round robin access between the two SM partitions (2 scheduler in each partition). For each partition the shared memory unit will complete all replays of the instruction prior to moving to the next instruction. The Volta SM will complete the instruction prior to any other shared memory instruction.
Avoid bank conflicts
Avoid address conflicts
On Fermi and Kepler architecture a shared memory lock operation had to be performed prior to the read modify write operation. This blocked all other warp instructions.
Maxwell and newer GPUs have significantly faster shared memory atomic performance thank Fermi/Kepler.
A very simple kernel could be written to micro-benchmark your two different cases. The CUDA profilers provide instruction executed counts and replay counts for shared memory accesses but do not differentiate between replays due to atomics and replays due to load/store conflicts or vector accesses.
There's a quite simple argument to be made even without needing to know anything about how shared memory atomics are implemented in CUDA hardware: At the end of the day, atomic operations must be serialized somehow at some point. This is true in general, it doesn't matter which platform or hardware you're running on. Atomicity kinda requires that simply by nature. If you have multiple atomic operations issued in parallel, you have to somehow execute them in a way that ensures atomicity. That means that atomic operations will always become slower as contention increases, no matter if we're talking GPU or CPU. The only question is: by how much. That depends on the concrete implementation.
So generally, you want to keep the level of contention, i.e., the number of threads what will be trying to perform atomic operations on the same memory location in parallel, as low as possible…
This is a speculative partial answer.
Consider the related question: Performance of atomic operations on shared memory and its accepted answer.
If the accepted answer there is correct (and continues to be correct even today), then warp threads in a more localized access would get in each other's way, making it slower for many lanes to operate atomically, i.e. making anti-locality of warp atomics better.
But to be honest - I'm not sure I completely buy into this line of argumentation, nor do I know if things have changed since that answer was written.

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

GPU coalesced global memory access vs using shared memory

If a thread is accessing global memory, why does it access a large chunk? Where is this large chunk stored?
If your reading from global memory in a coalesced manner, would it be beneficial to copy a common chunk of the global memory into shared memory, or would there not be any improvement.
ie: If each thread is reading the next 5 or 10 or 100 memory locations, and averaging them, if you could fit a chunk of X points from global memory into shared memory, could you not write an if statement saying if you looking for one of these memory values, read from shared memory rather than global? Im assuming the warp divergence penalty would be less than reading from global memory each time.
When you read from global memory, the data are searched first in the L1 cache (high bandwidth, 1.600GB/s on Fermi, but limited in size, 48KB on Fermi), then, if not present in L1, they are searched in L2 (lower bandwidth, but larger than L1, 768KB on Fermi) and, and finally, if not present in L2, they are loaded from global memory*.
When a global memory load occurs, the data are moved to L2 and then to L1, so to be able to access them in a faster way next time a global memory read is required.
Possibly, such data are evicted by a subsequent global memory load, possibly not. So, in principle, if you are reading "small" chunks of data, you do not need to necessarily force the data to be located in the shared memory to access them next time in a fast way.
Take into account that, on Fermi and Kepler, shared memory is made by the same circuitry of the L1 cache. You can then see the shared memory as a controlled L1 cache.
You should then force the data to reside in the shared memory to be sure that they reside on, say, the "fastest available cache" and you do it whenever you need to access those same data a multiple number of times.
Note that the above is the general philosophy behind global memory transfers. Implementation details can differ depending on the underlying architecture.
*Il should be noticed that the L1 cache line could be disabled by a compiler option. This is useful in terms of performance for random access data patterns.

CUDA Compute Capability 2.0. Global memory access pattern

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.
L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.