What are CUDA Global Memory 32-, 64- and 128-byte transactions? - cuda

I am relatively new to CUDA programming.
In this blog (How to Access Global Memory Efficiently in CUDA C/C++ Kernels), we have the following:
"The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size."
The 128-byte transaction is also mentioned in this post (The cost of CUDA global memory transactions)
In addition, 32-and 128-byte memory transactions are also mentioned in the CUDA C Programming Guide. This guide also show Figure 20 about aligned and mis-aligned access, that I couldn't quite understand.
Explain and give examples on how 32-, 64-, 128-byte transactions would happen?
Go through Figure 20 in more details. What is the point that the Figure is making?

Both of these need to be understood in the context of a CUDA warp. All operations are issued warp-wide, and this includes instructions that access memory.
An individual CUDA thread can access 1,2,4,8,or 16 bytes in a single instruction or transaction. When considered warp-wide, that translates to 32 bytes all the way up to 512 bytes. The GPU memory controller can typically issue requests to memory in granularities of 32 bytes, up to 128 bytes. Larger requests (say, 512 bytes, considered warp wide) will get issued via multiple "transactions" of typically no more than 128 bytes.
Modern DRAM memory has the design characteristic that you don't typically ask for a single byte, you request a "segment" typically of 32 bytes at a time for typical GPU designs. The division of memory into segments is fixed at design time. As a result, you can request either the first 32 bytes (the first segment) or the second 32 bytes (the second segment). You cannot request bytes 16-47 for example. This is all a function of the DRAM design, but it manifests in terms of memory behavior.
The diagram(s) depicts the behavior of each thread in a warp. Individually, they are depicted by the gray/black arrows pointing upwards. Each arrow represents the request from a thread, and the arrow points to a relative location in memory that that thread would like to load or store.
The diagrams are presented in comparison to each other to show the effect of "alignment". When considered warp-wide, if all 32 threads are requesting bytes of data that belong to a single segment, this would require the memory controller to retrieve only one segment to satisfy the request. This would arguably be the most efficient possible behavior (and therefore data organization as well as access pattern, considered warp-wide) for a single request (i.e. a single load or store instruction).
However if the addresses emanating from each thread in the warp result in a pattern depicted in the 2nd figure, this would be "unaligned", and even though you are effectively asking for a similar data "footprint", the lack of alignment to a single segment means the memory controller will need to retrieve 2 segments from memory to satisfy the request.
That is the key point of understanding associated with the figure. But there is more to the story than that. Misaligned access is not necessarily as tragic (performance cut in half) as this might suggest. The GPU caches play a role here, when we consider these transactions not just in the context of a single warp, but across many warps.
To get a more complete and orderly treatment of these topics, I suggest referring to various training material. It's by no means the only one, but unit 4 of this training series will cover the topic in more detail.

Related

Nvidia GPU simultaneous access to a single location in global memory

I was wondering what happens when multiple threads within a single warp try to access the same location in global memory (e.g., the same 4 byte word), particularly in Turing GPUs with compute capability 7.5. I believe that in shared memory that would result in a bank conflict, unless all of the threads access the same location, then the data would be broadcasted.
Just to give a contrived example:
1) Consider that the first 16 threads of a warp access a single 4-byte word, whereas the remaining 16 threads access the next single 4-byte word. How the access is handled in such situation? Is it serialized for every thread of a half warp?
2) What if the entire warp tries to access a single 4-byte word from a global memory?
There is no serialization. All CUDA GPUs Kepler and newer will broadcast in that scenario. There is no performance hit.
No difference. Any pattern of overlapping read access is handled in a single request, with an optimized number of transactions per request. The transactions per request would be no higher than for an ordinary coalesced one-thread-per-adjacent-location type read, and may be lower. For example, on a modern GPU, you might observe 4 (32-byte) transactions per coalesced global read request. In the case of all threads (in a warp) accessing a single location, it would be only one transaction per request.

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

Weak guarantees for non-atomic writes on GPUs?

OpenCL and CUDA have included atomic operations for several years now (although obviously not every CUDA or OpenCL device supports these). But - my question is about the possibility of "living with" races due to non-atomic writes.
Suppose several threads in a grid all write to the same location in global memory. Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?
Relevant parameters for this question (choose any combination(s), edit except nVIDIA+CUDA which already got an answer):
Memory space: Global memory only; this question is not about local/shared/private memory.
Alignment: Within a single memory write width (e.g. 128 bits on nVIDIA GPUs)
GPU Manufacturer: AMD / nVIDIA
Programming framework: CUDA / OpenCL
Position of store instruction in code: Same line of code for all threads / different lines of code.
Write destination: Fixed address / fixed offset from the address of a function parameter / completely dynamic
Write width: 8 / 32 / 64 bits.
Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?
For current CUDA GPUs, and I'm pretty sure for NVIDIA GPUs with OpenCL, the answer is yes. Most of my terminology below will have CUDA in view. If you require an exhaustive answer for both CUDA and OpenCL, let me know, and I'll delete this answer. Very similar questions to this one have been asked, and answered, before anyway. Here's another, and I'm sure there are others.
When multiple "simultaneous" writes occur to the same location, one of them will win, intact.
Which one will win is undefined. The behavior of the non-winning writes is also undefined (they may occur, but be replaced by the winner, or they may not occur at all.) The actual contents of the memory location may transit through various values (such as the original value, plus any of the valid written values), but the transit will not pass through "junk" values (i.e. values that were not already there and were not written by any thread.) The transit ends up at the "winner", eventually.
Example 1:
Location X contains zero. Threads 1,5,32, 30000, and 450000 all write one to that location. If there is no other write traffic to that location, that location will eventually contain the value of one (at kernel termination, or earlier).
Example 2:
Location X contains 5. Thread 32 writes 1 to X. Thread 90303 writes 7 to X. Thread 432322 writes 972 to X. If there is no other write traffic to that location, upon kernel termination, or earlier, location X will contain either 1, 7 or 972. It will not contain any other value, including 5.
I'm assuming X is in global memory, and all traffic to it is naturally aligned to it, and all traffic to it is of the same size, although these principles apply to shared memory as well. I'm also assuming you have not violated CUDA programming principles, such as the requirement for naturally aligned traffic to device memory locations. The transactions I have in view here are those transactions that originate from a single SASS instruction (per thread) Such transactions can have a width of 1,2,4,or 8bytes. The claims I've made here apply whether the writes are originating from "the same line of code" or "different lines".
These claims are based on the PTX memory consistency model, and therefore the "correctness" is ensured by the GPU hardware, not by the compiler, the CUDA programming model, or the C++ standard that CUDA is based on.
This is a fairly complex topic (especially when we factor in cache behavior, and what to expect when we throw reads in the mix), but "junk" values should never occur. The only values that should occur in global memory are those values that were there to begin with, or those values that were written by some thread, somewhere.

filtering an image, best practices

I have an input image "let it be a buffer of 1024 * 1024 pixels, with RGBA color data"
what I want to do for each pixel, is to filter it depending on neighbors , like [-15,15] in x and y directions
so my concern is, doing this with global memory will do like 31 * 31 global memory access for each pixel "which would be very performance bottleneck" , also I'm not sure about the behavior of multiple threads trying to read from the same memory location at the same time "may be some of them fail to read so -> rubbish data in -> rubbish data out"
this question is for CUDA or OpenCL as the concept should be the same
I know that shared memory (per work group) or local memory (per thread) won't solve this as I can't read another thread local memory, or another group shared memory "correct me if I misunderstand this concept"
Shared memory is a typical approach to this problem, although the stencil area (31*31) is quite large. Data re-use benefit can still be gained however. Since adjacent pixel computations only extend the region required by one column, in a 16KB shared memory array of 32bit RGBA pixels, you could have enough data for at least 64 threads to cooperatively compute their pixel values out of a single shared memory load.
Regarding the concern about multiple threads reading the same location, there is no possibility for garbage data reads. Certainly there is a possibility for contention leading to a performance impact, but in fact with an orderly for-loop progression in the kernel, no threads will be reading the same location at the same time anyway. With appropriate data organization there will be good opportunity for coalesced reads from global memory and no bank conflicts in shared memory.
This type of problem is well-suited for GPUs e.g. CUDA or OpenCL, and there are many examples of programs like this on SO.

Memory coalescing and transaction

After reading about the topic, I have 2 questions related to Global Memory coalescing access:
1- I read that one requirement for Memory coalescing is that words accessed by the threads must be 4, 8, or 16 byte but apparently this is valid only for device with compute capability less than 1.3. Is that right? for the latter device (>=1.3), a thread can even access one or 2 bytes and have a successful coalesced memory access
2- Will it matter (time mainly) if a (half) warp Global Memory access generates a 128-byte instead of 64-byte memory transaction because of the words misalignment and what about the extra data transferred, will it be discarded by the system?
Thank you
1) You can access the data any way you want on later devices, but the performance will still be poor if you request a data segment that is narrow, i.e. you will not achieve the full memory bandwidth of your GPU.
2) This again depends on the overall scheme of you code. Generally, the improvement in later version of CUDA was that non-aligned reads/writes did not result in abysmal performance, but resulted in e.g. 2 write commands being issues instead of one.
Think of it like putting people on a bus. If you can fill your whole crowd into a single bus with one destination, you get better efficiency than using two buses that are only half filled.
So yes, it will matter, but depending on whether you are memory or compute bound, it will matter differently.
Arranging your read/write patterns to utilize the full bandwidth have given me the last 20-30% performance in many applications.
/Henrik