Interpreting compute workload analysis in Nsight Compute - cuda

Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which pipeline represents which hardware unit for the other pipelines. I also couldn't find any documentation online about abbreviations and interpretations of the pipelines.
My questions are:
1) What are the full names of ADU, CBU, TEX, XU? How do they map to the hardware?
2) Which of the pipelines utilize the same hardware unit(e.g. FP16, FMA, FP64 uses floating point unit)?
3) A warp scheduler in a modern GPU can schedule 2 instructions per cycle(using different pipelines). Which pipelines can be used at the same time(e.g FMA-ALU, FMA-SFU, ALU-Tensor etc.)?
P.s.: I am adding the screenshot for those who are not familiar with Nsight Compute.

The Volta (CC 7.0) and Turing (CC 7.5) SM is comprised of 4 sub-partitions (SMSP). Each sub-partition contains
warp scheduler
register file
immediate constant cache
execution units
ALU, FMA, FP16, UDP (7.5+), and XU
FP64 on compute centric parts (GV100)
Tensor units
The contains several other partitions that contains execution units and resources shared by the 4 sub-partitions including
instruction cache
index constant cache
L1 data cache that is partitioned into tagged RAM and shared memory
execution units
ADU, LSU, TEX
On non-compute parts FP64 and Tensor may be implemented as a shared execution unit
In Volta (CC7.0, 7.2) and Turing (CC7.5) each SM sub-partition can issue 1 instruction per cycle. The instruction can be issued to a local execution unit or the SM shared execution units.
ADU - Address Divergence Unit. The ADU is reponsible per thread address divergence handling for branches/jumps and indexed constant loads prior to instructions being forwarded to other execution units.
ALU - Arithmetic Logic Unit. The ALU is responsible for execution of most integer instructions, bit manipulation instructions, and logic instructions.
CBU - Convergence Barrier Unit. The CBU is repsonsible for barrier, convergence, and branch instructions.
FMA - Floating point Multiply and Accumulate Unit. The FMA is responsible for most FP32 instructions, integer multiply and accumulate instructions, and integer dot product.
FP16 - Paired half-precision floating point unit. The FP16 unit is responisble for execution of paired half-precision floating point instructions.
FP64 - Double precision floating point unit. The FP64 unit is responsible for all FP64 instructions. FP64 is often implemented as several different pipes on NVIDIA GPUs. The throughput varies greatly per chip.
LSU - Load Store Unit. The LSU is responsible for load, store and atomic instructions to global, local, and shared memory.
Tensor (FP16) - Half-precision floating point matrix multiply and accumulate unit.
Tensor (INT) - Integer matrix multiply and accumulate unit.
TEX - Texture Unit. The texture unit is responsible for sampling, load, and filtering instructions on textures and surfaces.
UDP (Uniform) - Uniform Data Path - A scalar unit used to execute instructions where input and output is identical for all threads in a warp.
XU - Transcendental and Data Type Conversion Unit - The XU is responsible for special functions such as sin, cos, and reciprocal square root as well as data type conversions.

Related

What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states.
Two of the items in this taxonomy are:
Short scoreboard - scoreboard dependency on an MIO queue operation.
Long scoreboard - scoreboard dependency on an L1TEX operation.
where, I presume, "scoreboard" is used the sense of out-of-order execution data dependency tracking (see e.g. here).
My questions:
What do the adjectives "short" or "long" describe? Is it the length of a single scoreboard? Two different scoreboards for the two different kinds of operations?
What's the meaning of this somewhat non-intuitive dichotomy between MIO - some, but not all of which are memory operations; and L1TEX operations, which are all memory operations? Is it a dichotomy w.r.t. stall reasons only or is it about actual hardware?
The NVIDIA GPU has two classification of instructions:
Fixed latency - math, bitwise, register movement
Variable latency - ld/st to shared, local, global, and texture as well as slow math operations
The Short Scoreboard and Long Scoreboard are reported on instructions dependent on data returned from a variable latency instruction. Short scoreboards are reported for dependencies coming for variable latency instructions that will not leave the SM such as slow math such as reciprocal sqrt or shared memory). Long scoreboards are reported for dependencies that may leave the SM such as global/local memory accesses and texture fetches.
Detailed descriptions from the Nsight Cmpute v2020.3.1 Kernel Profiling Guide
Long Scoreboard
Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, tex) operation. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality, or by changing the cache configuration, and consider moving frequently used data to shared memory.
Short Scoreboard
Warp was stalled waiting for a scoreboard dependency on a MIO (memory input/output) operation (not to L1TEX). The primary reason for a high number of stalls due to short scoreboards is typically memory operations to shared memory. Other reasons include frequent execution of special math instructions (e.g. MUFU) or dynamic branching (e.g. BRX, JMX). Verify if there are shared memory operations and reduce bank conflicts, if applicable.
MIO vs. L1TEX
MIO and L1TEX are partitions in the NVIDIA SM. The MIO units is responsible for shared execution units (shared by 1 or more SM sub-partitions) including lower rate math units (e.g. double precision on a GeForce chip) and memory input/output. The memory subsystems contains L1, TEX unit, shared memory unit, and other domain specific (e.g. graphics) interfaces to the SM. The implementation of the MIO subsystem including L1, TEX, and shared memory varies greatly between Kepler, Maxwell-Pascal, and Volta-Ampere. SM sub-partitions (warp schedulers) issues instructions to shared execution units through instruction queues vs. direct dispatch. For SM 7.0+ there are stall reasons (mio_throttle, lg_throttle, and tex_throttle) that occur if the instruction queues for those units are full.
What is included in the definition of MIO varies by architecture. L1TEX is technically in the MIO partition. The L1TEX has is complicated as it has two input interfaces:
LSU interface is for shared memory, local/global memory (tagged), and special operations such as shuffle and special purpose registers.
TEX interface is for texture fetches and on 7.0-8.x a subset of the slow math operations (e.g. FP64 on a GeForce card). The latter is a little confusing. The slow math units exist for binary compatibility and not expected to be used at the same time as texture fetches.
The term MIO can be confusing.
The term L1TEX can also be confusing given two different interfaces. While there are two interfaces local/global and texture/surface share the same cache lookup stages, same cache RAM, and same SM to L2 interface so for many metrics the term L1TEX is used to refer to the unit.

Number of clock cycles per one instruction CUDA

I am a beginner on CUDA. Now I am calculating the number of clock cycles per one instruction (e.g. addition). In https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions, it only gives the instruction throughput for different arithmetic operations. For example, the throughput in 7.x is 64 for 32-bit floating-point add. So, can i take 64/32=2 as the number of clock cycles per one instruction? If not, how can i calculate it?
In the general case, the CUDA documentation does not give you enough information to calculate the number of clock cycles that a particular instruction requires. This would be related to the pipeline depth for the instruction (i.e. for the functional unit servicing that instruction) and this is not documented. The throughput table is largely useless for this exercise.
This is one reason why you will find various microbenchmarking papers for CUDA. Here is one such example.
It has to be measured empirically (and carefully), for each architecture of interest, and for each SASS instruction of interest; it is not documented.

Memory Coalescing vs. Vectorized Memory Access

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.
It is my understanding that:
Memory coalescing is a run-time optimization of the memory controller (implemented in hardware). How many memory transactions are required to fulfill the load/store of a warp is determined at run-time. A load/store instruction of a warp may be issued repeatedly unless there is perfect coalescing.
Memory vectorization is a compile-time optimization. The number of memory transactions for a vectorized load/store is fixed. Each vector load/store instruction is issued exactly once.
Coalescable GPU load/store instructions are more expressive than SSE vector load/store instructions. E.g., a st.global.s32 PTX instruction may store into 32 arbitrary memory locations (warp size 32), whereas a movdqa SSE instruction can only store into a consecutive block of memory.
Memory coalescing in CUDA seems to guarantee efficient vectorized memory access (when accesses are coalescable), whereas on x86-SSE, we have to hope that the compiler actually vectorizes the code (it may fail to do so) or vectorize code manually with SSE intrinsics, which is more difficult for programmers.
Is this correct? Did I miss an important aspect (thread masking, maybe)?
Now, why do GPUs have run-time coalescing? This probably requires extra circuits in hardware. What are the main benefits over compile-time coalescing as in CPUs? Are there applications/memory access patterns that are harder to implement on CPUs because of missing run-time coalescing?
caveat: I don't really know / understand the architecture / microarchitecture of GPUs very well. Some of this understanding is cobbled together from the question + what other people have written in comments / answers here.
The way GPUs let one instruction operate on multiple data is very different from CPU SIMD. That's why they need special support for memory coalescing at all. CPU-SIMD can't be programmed in a way that needs it.
BTW, CPUs have cache to absorb multiple accesses to the same cache line before the actual DRAM controllers get involved. GPUs have cache too, of course.
Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". The CPU-SIMD equivalent would be gather/scatter loads/stores that could optimize to a single wide access to cache for indices that were adjacent. Existing CPUs don't do this: each element accesses cache separately in a gather. You shouldn't use a gather load if you know that many indices will be adjacent; it will be faster to shuffle 128-bit or 256-bit chunks into place. For the common case where all your data is contiguous, you just use a normal vector load instruction instead of a gather load.
The point of modern short-vector CPU SIMD is to feed more work through a fetch/decode/exec pipeline without making it wider in terms of having to decode + track + exec more CPU instructions per clock cycle. Making a CPU pipeline wider quickly hits diminishing returns for most use-cases, because most code doesn't have a lot of ILP.
A general-purpose CPU spends a lot of transistors on instruction-scheduling / out-of-order execution machinery, so just making it wider to be able to run many more uops in parallel isn't viable. (https://electronics.stackexchange.com/questions/443186/why-not-make-one-big-cpu-core).
To get more throughput, we can raise the frequency, raise IPC, and use SIMD to do more work per instruction/uop that the out-of-order machinery has to track. (And we can build multiple cores on a single chip, but cache-coherent interconnects between them + L3 cache + memory controllers are hard). Modern CPUs use all of these things, so we get a total throughput capability of frequency * IPC * SIMD, and times number of cores if we multithread. They aren't viable alternatives to each other, they're orthogonal things that you have to do all of to drive lots of FLOPs or integer work through a CPU pipeline.
This is why CPU SIMD has wide fixed-width execution units, instead of a separate instruction for each scalar operation. There isn't a mechanism for one scalar instruction to flexibly be fed to multiple execution units.
Taking advantage of this requires vectorization at compile time, not just of your loads / stores but also your ALU computation. If your data isn't contiguous, you have to gather it into SIMD vectors either with scalar loads + shuffles, or with AVX2 / AVX512 gather loads that take a base address + vector of (scaled) indices.
But GPU SIMD is different. It's for massively parallel problems where you do the same thing to every element. The "pipeline" can be very lightweight because it doesn't need to support out-of-order exec or register renaming, or especially branching and exceptions. This makes it feasible to just have scalar execution units without needing to handle data in fixed chunks from contiguous addresses.
These are two very different programming models. They're both SIMD, but the details of the hardware that runs them is very different.
Each vector load/store instruction is issued exactly once.
Yes, that's logically true. In practice the internals can be slightly more complicated, e.g. AMD Ryzen splitting 256-bit vector operations into 128-bit halves, or Intel Sandybridge/IvB doing that for just loads+stores while having 256-bit wide FP ALUs.
There's a slight wrinkle with misaligned loads/stores on Intel x86 CPUs: on a cache-line split, the uop has to get replayed (from the reservation station) to do the other part of the access (to the other cache line).
In Intel terminology, the uop for a split load gets dispatched twice, but only issues + retires once.
Aligned loads/stores like movdqa, or movdqu when the memory happens to be aligned at runtime, are just a single access to L1d cache (assuming a cache hit). Unless you're on a CPU that decodes a vector instruction into two halves, like AMD for 256-bit vectors.
But that stuff is purely inside the CPU core for access to L1d cache. CPU <-> memory transactions are in whole cache lines, with write-back L1d / L2 private caches, and shared L3 on modern x86 CPUs - Which cache mapping technique is used in intel core i7 processor? (Intel since Nehalem, the start of the i3/i5/i7 series, AMD since Bulldozer I think introduced L3 caches for them.)
In a CPU, it's the write-back L1d cache that basically coalesces transactions into whole cache lines, whether you use SIMD or not.
What SIMD helps with is getting more work done inside the CPU, to keep up with faster memory. Or for problems where the data fits in L2 or L1d cache, to go really fast over that data.
Memory coalescing is related to parallel accesses: when each core in a SM will access a subsequent memory location, the memory access is optimized.
Viceversa, SIMD is a single core optimization: when a vector register is filled with operands and a SSE operation is performed, the parallelism is inside the CPU core, with one operation being performed on each internal logical unit per clock cycle.
However you are right: coalesced/uncoalesced memory access is a runtime aspect. SIMD operations are compiled in. I don't think they can compare well.
If I would make a parallelism, I would compare coalesing in GPUs to memory prefetching in CPUs. This is a very important runtime optimization as well - and I believe it's active behind the scene using SSE as well.
However there is nothing similar to colescing in Intel CPU cores. Because of cache coherency, the best you can do in optimizing parallel memory accesses, is to let each core access to independent memory regions.
Now, why do GPUs have run-time coalescing?
Graphical processing is optimized for executing a single task in parallel on adjacent elements.
For example, think to perform an operation on every pixel of an image, assigning each pixel to a different core. Now it's clear that you want to have an optimal path to load the image spreading one pixel to each core.
That's why memory coalescing is deeply buried in the GPUs architecture.

Do CUDA cores have vector instructions?

According to most NVidia documentation CUDA cores are scalar processors and should only execute scalar operations, that will get vectorized to 32-component SIMT warps.
But OpenCL has vector types like for example uchar8.It has the same size as ulong (64 bit), which can be processed by a single scalar core. If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
If there are 1024 work items in a block (work group), and each work items processes a uchar8, will this effectively process 8120 uchar in parallel?
Edit:
My question was if on CUDA architectures specifically (independently of OpenCL), there are some vector instructions available in "scalar" cores. Because if the core is already capable of handling a 32-bit type, it would be reasonable if it can also handle addition of a 32-bit uchar4 for example, especially since vector operations are often used in computer graphics.
CUDA has "built-in" (i.e. predefined) vector types up to a size of 4 for 4-byte quantities (e.g. int4) and up to a size of 2 for 8-byte quantities (e.g. double2). A CUDA thread has a maximum read/write transaction size of 16 bytes, so these particular size choices tend to line up with that maximum.
These are exposed as typical structures, so you can reference for example .x to access just the first element of a vector type.
Unlike OpenCL, CUDA does not provide built-in operations ("overloads") for basic arithmetic e.g. +, -, etc. for element-wise operations on these vector types. There's no particular reason you couldn't provide such overloads yourself. Likewise, if you wanted a uchar8 you could easily provide a structure definition for such, as well as any desired operator overloads. These could probably be implemented just as you would expect for ordinary C++ code.
Probably an underlying question is, then, what is the difference in implementation between CUDA and OpenCL in this regard? If I operate on a uchar8, e.g.
uchar8 v1 = {...};
uchar8 v2 = {...};
uchar8 r = v1 + v2;
what will the difference be in terms of machine performance (or low-level code generation) between OpenCL and CUDA?
Probably not much, for a CUDA-capable GPU. A CUDA core (i.e. the underlying ALU) does not have direct native support for such an operation on a uchar8, and furthermore, if you write your own C++ compliant overload, you're probably going to use C++ semantics for this which will inherently be serial:
r.x = v1.x + v2.x;
r.y = v1.y + v2.y;
...
So this will decompose into a sequence of operations performed on the CUDA core (or in the appropriate integer unit within the CUDA SM). Since the NVIDIA GPU hardware doesn't provide any direct support for an 8-way uchar add within a single core/clock/instruction, there's really no way OpenCL (as implemented on a NVIDIA GPU) could be much different. At a low level, the underlying machine code is going to be a sequence of operations, not a single instruction.
As an aside, CUDA (or PTX, or CUDA intrinsics) does provide for a limited amount of vector operations within a single core/thread/instruction. Some examples of this are:
a limited set of "native" "video" SIMD instructions. These instructions are per-thread, so if used, they allow for "native" support of up to 4x32 = 128 (8-bit) operands per warp, although the operands must be properly packed into 32-bit registers. You can access these from C++ directly via a set of built-in intrinsics. (A CUDA warp is a set of 32 threads, and is the fundamental unit of lockstep parallel execution and scheduling on a CUDA capable GPU.)
a vector (SIMD) multiply-accumulate operation, which is not directly translatable to a single particular elementwise operation overload, the so-called int8 dp2a and dp4a instructions. int8 here is somewhat misleading. It does not refer to an int8 vector type but rather a packed arrangement of 4 8-bit integer quantities in a single 32-bit word/register. Again, these are accessible via intrinsics.
16-bit floating point is natively supported via half2 vector type in cc 5.3 and higher GPUs, for certain operations.
The new Volta tensorCore is something vaguely like a SIMD-per-thread operation, but it operates (warp-wide) on a set of 16x16 input matrices producing a 16x16 matrix result.
Even with a smart OpenCL compiler that could map certain vector operations into the various operations "natively" supported by the hardware, it would not be complete coverage. There is no operational support for an 8-wide vector (e.g. uchar8) on a single core/thread, in a single instruction, to pick one example. So some serialization would be necessary. In practice, I don't think the OpenCL compiler from NVIDIA is that smart, so my expectation is that you would find such per-thread vector operations fully serialized, if you studied the machine code.
In CUDA, you could provide your own overload for certain operations and vector types, that could be represented approximately in a single instruction. For example a uchar4 add could be performed "natively" with the __vadd4() intrinsic (perhaps included in your implementation of an operator overload.) Likewise, if you are writing your own operator overload, I don't think it would be difficult to perform a uchar8 elementwise vector add using two __vadd4() instructions.
If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
AFAIK it'll always be on a single core (instructions from a single kernel / workitem don't cross cores, except special instructions like barriers), but it may be more than one instruction. This depends on whether your hardware support operations on uchar8 natively. If it does not, then uchar8 will be broken up to as many pieces as required, and each piece will be processed with a separate instruction.
OpenCL is very "generic" in the sense that it supports many different vector type/size combos, but real-world hardware usually only implements some vector type/size combinations. You can query OpenCL devices for "preferred vector size" which should tell you what's the most efficient for that hardware.

Are GPU/CUDA cores SIMD ones?

Let's take the nVidia Fermi Compute Architecture. It says:
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each.
[...]
Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU).
[...]
In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. V
From what I know, and what is unclear for me, is that GPUs execute the threads in so called warps, each warp consists of ~32 threads. Each warp is assigned to only one core (is that true?). So does that mean, that each of the 32 cores of a single SM is a SIMD processor, where a single instruction handles 32 data portions ? If so, then why we say there are 32 threads in a warp, not a single SIMD thread? Why cores are sometimes referred to as scalar processors, not vector processors ?
Each warp is assigned to only one core (is that true?).
No, it's not true. A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).
Cores are in fact scalar processors, not vector processors. 32 cores (or execution units) are marshalled by the warp scheduler to execute a single instruction, across 32 threads, which is where the "SIMT" moniker comes from.
CUDA "cores" can be thought of as SIMD lanes.
First let's recall that the term "CUDA core" is nVIDIA marketing-speak. These are not cores the same way a CPU has cores. Similarly, "CUDA threads" are not the same as the threads we know on CPUs.
The equivalent of a CPU core on a GPU is a "symmetric multiprocessor": It has its own instruction scheduler/dispatcher, its own L1 cache, its own shared memory etc. It is CUDA thread blocks rather than warps that are assigned to a GPU core, i.e. to a streaming multiprocessor. Within an SM, warps get selected to have instructions scheduled, for the entire warp. From a CUDA perspective, those are 32 separate threads, which are instruction-locked; but that's really no different than saying that a warp is like a single thread, which only executes 32-lane-wide SIMD instructions. Of course this isn't a perfect analogy, but I feel it's pretty sound. Something you don't quite / don't always have on CPU SIMD lanes is a masking of which lanes are actively executing, where inactive lanes will have not have the effect of active lanes' setting of register values, memory writes etc.
I hope this helps you makes intuitive sense of things.