How do these PTX instructions accelerate cuda applications? - cuda

In cuda 11 we have cp.async to load data from GMEM to SMEM and ldmatrix.sync to move data from SMEM to register. How do those instructions help us? by vectorizing?

The new, asynchronous, copy-to-share-mem instructions help us by freeing the warp which issues the instruction to execute other instructions in parallel to the copying.
Of course, this is mostly useful if the warp has such instructions to execute which aren't other loads from global memory. So, arithmetic, special functions, random number generation, etc. etc.
The matrix-load instructions are useful w.r.t. matrix operations. Recent NVIDIA GPUs have specialty hardware for performing matrix operations multiplication ("tensor cores"). These are obviously faster than performing these with the GPU's general-purpose hardware. To use the matrix ops, one needs to load data into specialty registers, and that's what the ldmatrix instructions do.

Related

Memory Coalescing vs. Vectorized Memory Access

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.
It is my understanding that:
Memory coalescing is a run-time optimization of the memory controller (implemented in hardware). How many memory transactions are required to fulfill the load/store of a warp is determined at run-time. A load/store instruction of a warp may be issued repeatedly unless there is perfect coalescing.
Memory vectorization is a compile-time optimization. The number of memory transactions for a vectorized load/store is fixed. Each vector load/store instruction is issued exactly once.
Coalescable GPU load/store instructions are more expressive than SSE vector load/store instructions. E.g., a st.global.s32 PTX instruction may store into 32 arbitrary memory locations (warp size 32), whereas a movdqa SSE instruction can only store into a consecutive block of memory.
Memory coalescing in CUDA seems to guarantee efficient vectorized memory access (when accesses are coalescable), whereas on x86-SSE, we have to hope that the compiler actually vectorizes the code (it may fail to do so) or vectorize code manually with SSE intrinsics, which is more difficult for programmers.
Is this correct? Did I miss an important aspect (thread masking, maybe)?
Now, why do GPUs have run-time coalescing? This probably requires extra circuits in hardware. What are the main benefits over compile-time coalescing as in CPUs? Are there applications/memory access patterns that are harder to implement on CPUs because of missing run-time coalescing?
caveat: I don't really know / understand the architecture / microarchitecture of GPUs very well. Some of this understanding is cobbled together from the question + what other people have written in comments / answers here.
The way GPUs let one instruction operate on multiple data is very different from CPU SIMD. That's why they need special support for memory coalescing at all. CPU-SIMD can't be programmed in a way that needs it.
BTW, CPUs have cache to absorb multiple accesses to the same cache line before the actual DRAM controllers get involved. GPUs have cache too, of course.
Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". The CPU-SIMD equivalent would be gather/scatter loads/stores that could optimize to a single wide access to cache for indices that were adjacent. Existing CPUs don't do this: each element accesses cache separately in a gather. You shouldn't use a gather load if you know that many indices will be adjacent; it will be faster to shuffle 128-bit or 256-bit chunks into place. For the common case where all your data is contiguous, you just use a normal vector load instruction instead of a gather load.
The point of modern short-vector CPU SIMD is to feed more work through a fetch/decode/exec pipeline without making it wider in terms of having to decode + track + exec more CPU instructions per clock cycle. Making a CPU pipeline wider quickly hits diminishing returns for most use-cases, because most code doesn't have a lot of ILP.
A general-purpose CPU spends a lot of transistors on instruction-scheduling / out-of-order execution machinery, so just making it wider to be able to run many more uops in parallel isn't viable. (https://electronics.stackexchange.com/questions/443186/why-not-make-one-big-cpu-core).
To get more throughput, we can raise the frequency, raise IPC, and use SIMD to do more work per instruction/uop that the out-of-order machinery has to track. (And we can build multiple cores on a single chip, but cache-coherent interconnects between them + L3 cache + memory controllers are hard). Modern CPUs use all of these things, so we get a total throughput capability of frequency * IPC * SIMD, and times number of cores if we multithread. They aren't viable alternatives to each other, they're orthogonal things that you have to do all of to drive lots of FLOPs or integer work through a CPU pipeline.
This is why CPU SIMD has wide fixed-width execution units, instead of a separate instruction for each scalar operation. There isn't a mechanism for one scalar instruction to flexibly be fed to multiple execution units.
Taking advantage of this requires vectorization at compile time, not just of your loads / stores but also your ALU computation. If your data isn't contiguous, you have to gather it into SIMD vectors either with scalar loads + shuffles, or with AVX2 / AVX512 gather loads that take a base address + vector of (scaled) indices.
But GPU SIMD is different. It's for massively parallel problems where you do the same thing to every element. The "pipeline" can be very lightweight because it doesn't need to support out-of-order exec or register renaming, or especially branching and exceptions. This makes it feasible to just have scalar execution units without needing to handle data in fixed chunks from contiguous addresses.
These are two very different programming models. They're both SIMD, but the details of the hardware that runs them is very different.
Each vector load/store instruction is issued exactly once.
Yes, that's logically true. In practice the internals can be slightly more complicated, e.g. AMD Ryzen splitting 256-bit vector operations into 128-bit halves, or Intel Sandybridge/IvB doing that for just loads+stores while having 256-bit wide FP ALUs.
There's a slight wrinkle with misaligned loads/stores on Intel x86 CPUs: on a cache-line split, the uop has to get replayed (from the reservation station) to do the other part of the access (to the other cache line).
In Intel terminology, the uop for a split load gets dispatched twice, but only issues + retires once.
Aligned loads/stores like movdqa, or movdqu when the memory happens to be aligned at runtime, are just a single access to L1d cache (assuming a cache hit). Unless you're on a CPU that decodes a vector instruction into two halves, like AMD for 256-bit vectors.
But that stuff is purely inside the CPU core for access to L1d cache. CPU <-> memory transactions are in whole cache lines, with write-back L1d / L2 private caches, and shared L3 on modern x86 CPUs - Which cache mapping technique is used in intel core i7 processor? (Intel since Nehalem, the start of the i3/i5/i7 series, AMD since Bulldozer I think introduced L3 caches for them.)
In a CPU, it's the write-back L1d cache that basically coalesces transactions into whole cache lines, whether you use SIMD or not.
What SIMD helps with is getting more work done inside the CPU, to keep up with faster memory. Or for problems where the data fits in L2 or L1d cache, to go really fast over that data.
Memory coalescing is related to parallel accesses: when each core in a SM will access a subsequent memory location, the memory access is optimized.
Viceversa, SIMD is a single core optimization: when a vector register is filled with operands and a SSE operation is performed, the parallelism is inside the CPU core, with one operation being performed on each internal logical unit per clock cycle.
However you are right: coalesced/uncoalesced memory access is a runtime aspect. SIMD operations are compiled in. I don't think they can compare well.
If I would make a parallelism, I would compare coalesing in GPUs to memory prefetching in CPUs. This is a very important runtime optimization as well - and I believe it's active behind the scene using SSE as well.
However there is nothing similar to colescing in Intel CPU cores. Because of cache coherency, the best you can do in optimizing parallel memory accesses, is to let each core access to independent memory regions.
Now, why do GPUs have run-time coalescing?
Graphical processing is optimized for executing a single task in parallel on adjacent elements.
For example, think to perform an operation on every pixel of an image, assigning each pixel to a different core. Now it's clear that you want to have an optimal path to load the image spreading one pixel to each core.
That's why memory coalescing is deeply buried in the GPUs architecture.

Which is faster for CUDA shared-mem atomics - warp locality or anti-locality?

Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly.
In which of the cases will such work be completed faster? :
The case of intra-warp access locality, e.g. the total number of memory position accessed by each warp is small and most of them are indeed accessed by multiple lanes
The case of access anti-locality, where all lanes typically access distinct positions (and perhaps with an effort to avoid bank conflicts)?
and no less importantly - is this microarchitecture-dependent, or is it essentially the same on all recent NVIDIA microarchitectures?
Anti-localized access will be faster.
On SM5.0 (Maxwell) and above GPUs the shared memory atomics (assume add) the shared memory unit will replay the instruction due to address conflicts (two lanes with the same address). Normal bank conflict replays also apply. On Maxwell/Pascal the shared memory unit has fixed round robin access between the two SM partitions (2 scheduler in each partition). For each partition the shared memory unit will complete all replays of the instruction prior to moving to the next instruction. The Volta SM will complete the instruction prior to any other shared memory instruction.
Avoid bank conflicts
Avoid address conflicts
On Fermi and Kepler architecture a shared memory lock operation had to be performed prior to the read modify write operation. This blocked all other warp instructions.
Maxwell and newer GPUs have significantly faster shared memory atomic performance thank Fermi/Kepler.
A very simple kernel could be written to micro-benchmark your two different cases. The CUDA profilers provide instruction executed counts and replay counts for shared memory accesses but do not differentiate between replays due to atomics and replays due to load/store conflicts or vector accesses.
There's a quite simple argument to be made even without needing to know anything about how shared memory atomics are implemented in CUDA hardware: At the end of the day, atomic operations must be serialized somehow at some point. This is true in general, it doesn't matter which platform or hardware you're running on. Atomicity kinda requires that simply by nature. If you have multiple atomic operations issued in parallel, you have to somehow execute them in a way that ensures atomicity. That means that atomic operations will always become slower as contention increases, no matter if we're talking GPU or CPU. The only question is: by how much. That depends on the concrete implementation.
So generally, you want to keep the level of contention, i.e., the number of threads what will be trying to perform atomic operations on the same memory location in parallel, as low as possible…
This is a speculative partial answer.
Consider the related question: Performance of atomic operations on shared memory and its accepted answer.
If the accepted answer there is correct (and continues to be correct even today), then warp threads in a more localized access would get in each other's way, making it slower for many lanes to operate atomically, i.e. making anti-locality of warp atomics better.
But to be honest - I'm not sure I completely buy into this line of argumentation, nor do I know if things have changed since that answer was written.

how exactly does CUDA handle a memory access?

i would like to know how CUDA hardware/run-time system handles the following case.
If a warp (warp1 in the following) instruction involves access to global memory (load/store); the run-time system schedules the next ready warp for execution.
When the new warp is executed,
Will the "memory access" of warp1 be conducted in parallel, i.e. while the new warp is running ?
Will the run time system put warp1 into a memory access waiting queue; once the memory request is completed, the warp is then moved into the runnable queue?
Will the instruction pointer related to warp1 execution be incremented automatically and in parallel to the new warp execution, to annotate that the memory request is completed?
For instance, consider this pseudo code output=input+array[i]; where output and input are both scalar variables mapped into registers, whereas array is saved in the global memory.
To run the above instruction, we need to load the value of array[i] into a (temporary) register before updating output; i.e the above instruction can be translated into 2 macro assembly instructions load reg, reg=&array[i], output_register=input_register+reg.
I would like to know how the hardware and runtime system handle the execution of the above 2 macro assembly instructions, given that load can't return immediately
I am not sure I understand your questions correctly, so I'll just try to answer them as I read them:
Yes, while a memory transaction is in flight further independent instructions will continue to be issued. There isn't necessarily a switch to a different warp though - while instructions from other warps will always be independent, the following instructions from the same warp might be independent as well and the same warp may keep running (i.e. further instructions may be issued from the same warp).
No. As explained under 1. the warp can and will continue executing instructions until either the result of the load is needed by dependent instruction, or a memory fence / barrier instruction requires it to wait for the effect of the store being visible to other threads.
This can go as far as issuing further (independent) load or store instructions, so that multiple memory transactions can be in flight for the same warp at the same time. So the status of a warp after issuing a load/store doesn't change fundamentally and it is not halted until necessary.
The instruction pointer will always be incremented automatically (there is no situation where you ever do this manually, nor are there instructions allowing to do so). However, as 2. implies, this doesn't necessarily indicate that the memory access has been performed - there is separate hardware to track progress of memory accesses.
Please note that the hardware implementation is completely undocumented by Nvidia. You might find some indications of possible implementations if you search through Nvidia's patent applications.
GPUs up to the Fermi generation (compute capability 2.x) tracked outstanding memory transaction completely in hardware. While undocumented by Nvidia, the common mechanism to track (memory) transactions in flight is scoreboarding.
GPUs from newer generations starting with Kepler (compute capability 3.x) use some assistance in the form of control words embedded in the shader assembly code. While again undocumented, Scott Gray has reversed engineered these for his Maxas Maxwell assembler. He found that (amongst other things) the control words contain barrier instructions for tracking memory transactions and was kind enough to document his findings on his Control-Codes wiki page.

Are GPU/CUDA cores SIMD ones?

Let's take the nVidia Fermi Compute Architecture. It says:
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each.
[...]
Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU).
[...]
In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. V
From what I know, and what is unclear for me, is that GPUs execute the threads in so called warps, each warp consists of ~32 threads. Each warp is assigned to only one core (is that true?). So does that mean, that each of the 32 cores of a single SM is a SIMD processor, where a single instruction handles 32 data portions ? If so, then why we say there are 32 threads in a warp, not a single SIMD thread? Why cores are sometimes referred to as scalar processors, not vector processors ?
Each warp is assigned to only one core (is that true?).
No, it's not true. A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).
Cores are in fact scalar processors, not vector processors. 32 cores (or execution units) are marshalled by the warp scheduler to execute a single instruction, across 32 threads, which is where the "SIMT" moniker comes from.
CUDA "cores" can be thought of as SIMD lanes.
First let's recall that the term "CUDA core" is nVIDIA marketing-speak. These are not cores the same way a CPU has cores. Similarly, "CUDA threads" are not the same as the threads we know on CPUs.
The equivalent of a CPU core on a GPU is a "symmetric multiprocessor": It has its own instruction scheduler/dispatcher, its own L1 cache, its own shared memory etc. It is CUDA thread blocks rather than warps that are assigned to a GPU core, i.e. to a streaming multiprocessor. Within an SM, warps get selected to have instructions scheduled, for the entire warp. From a CUDA perspective, those are 32 separate threads, which are instruction-locked; but that's really no different than saying that a warp is like a single thread, which only executes 32-lane-wide SIMD instructions. Of course this isn't a perfect analogy, but I feel it's pretty sound. Something you don't quite / don't always have on CPU SIMD lanes is a masking of which lanes are actively executing, where inactive lanes will have not have the effect of active lanes' setting of register values, memory writes etc.
I hope this helps you makes intuitive sense of things.

Role of Warps in NVIDIA GPU architecture with OpenCL

I'm studying OpenCL concepts as well as the CUDA architecture for a small project, and there is one thing that is unclear to me: the necessity for Warps.
I know a lot of questions have been asked on this subject, however after having read some articles i still don't get the "meaning" of warps.
As far as I understand (speaking for my GPU card which is a Tesla, but i guess this easily translates to other boards):
A work-item is linked to a CUDA thread, which several of them can be executed by a Streaming Processor (SP). BTW, does a SP treats those WI in parallel?
Work-items are grouped into Work-groups. Work-groups operate on a Stream Multiprocessor and can not migrate. However, work-items in a work-group can collaborate via shared memory (a.k.a local memory). One or more work-groups may be executed by a Stream MultiProcessor. BTW, does a SM treats those WG in parallel?
Work-item are executed in parallel inside a work-group. However, synchronization is NOT guaranteed, that's why you need concurrent programming primitives, such as barriers.
As far as I understand, all of this is rather a logical view than a 'physical', hardware perspective.
If all of the above is correct, can you help me on the following. Is that true to say that:
1 - Warps execute 32 threads or work-items simultaneously. Thus, they will 'consume' parts of a work-group. And that's why in the end you need stuff like memory fences to synchronize work-items in work groups.
2 - The Warp scheduler allocates the registers for the 32 threads of warp when it becomes active.
3 - Also, are executed thread in a warp synchronized at all?
Thanks for any input on Warps, and especially why they are necessary in the CUDA architecture.
My best analogon is that a Warp is the vector that be processed in parallel, not unlike an AVX or SSE vector with an Intel CPU. This makes an SM a 32-length vector processor.
Then, to your questions:
Yes, all 32 elements will be run in parallel. Note that also a GPU puts hyperthreading to the extreme: a workgroup will consist of multiple Warps, which all are run more-or-less in parallel. You will need memory fences to sychronise that all.
Yes, typically all 32 work elements (CUDA: thread) in a Warp will work in parallel. Note that you typically will have multiple regsters per work element.
Not guaranteed, AFAIK.