Memory Coalescing vs. Vectorized Memory Access - cuda

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.
It is my understanding that:
Memory coalescing is a run-time optimization of the memory controller (implemented in hardware). How many memory transactions are required to fulfill the load/store of a warp is determined at run-time. A load/store instruction of a warp may be issued repeatedly unless there is perfect coalescing.
Memory vectorization is a compile-time optimization. The number of memory transactions for a vectorized load/store is fixed. Each vector load/store instruction is issued exactly once.
Coalescable GPU load/store instructions are more expressive than SSE vector load/store instructions. E.g., a st.global.s32 PTX instruction may store into 32 arbitrary memory locations (warp size 32), whereas a movdqa SSE instruction can only store into a consecutive block of memory.
Memory coalescing in CUDA seems to guarantee efficient vectorized memory access (when accesses are coalescable), whereas on x86-SSE, we have to hope that the compiler actually vectorizes the code (it may fail to do so) or vectorize code manually with SSE intrinsics, which is more difficult for programmers.
Is this correct? Did I miss an important aspect (thread masking, maybe)?
Now, why do GPUs have run-time coalescing? This probably requires extra circuits in hardware. What are the main benefits over compile-time coalescing as in CPUs? Are there applications/memory access patterns that are harder to implement on CPUs because of missing run-time coalescing?

caveat: I don't really know / understand the architecture / microarchitecture of GPUs very well. Some of this understanding is cobbled together from the question + what other people have written in comments / answers here.
The way GPUs let one instruction operate on multiple data is very different from CPU SIMD. That's why they need special support for memory coalescing at all. CPU-SIMD can't be programmed in a way that needs it.
BTW, CPUs have cache to absorb multiple accesses to the same cache line before the actual DRAM controllers get involved. GPUs have cache too, of course.
Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". The CPU-SIMD equivalent would be gather/scatter loads/stores that could optimize to a single wide access to cache for indices that were adjacent. Existing CPUs don't do this: each element accesses cache separately in a gather. You shouldn't use a gather load if you know that many indices will be adjacent; it will be faster to shuffle 128-bit or 256-bit chunks into place. For the common case where all your data is contiguous, you just use a normal vector load instruction instead of a gather load.
The point of modern short-vector CPU SIMD is to feed more work through a fetch/decode/exec pipeline without making it wider in terms of having to decode + track + exec more CPU instructions per clock cycle. Making a CPU pipeline wider quickly hits diminishing returns for most use-cases, because most code doesn't have a lot of ILP.
A general-purpose CPU spends a lot of transistors on instruction-scheduling / out-of-order execution machinery, so just making it wider to be able to run many more uops in parallel isn't viable. (https://electronics.stackexchange.com/questions/443186/why-not-make-one-big-cpu-core).
To get more throughput, we can raise the frequency, raise IPC, and use SIMD to do more work per instruction/uop that the out-of-order machinery has to track. (And we can build multiple cores on a single chip, but cache-coherent interconnects between them + L3 cache + memory controllers are hard). Modern CPUs use all of these things, so we get a total throughput capability of frequency * IPC * SIMD, and times number of cores if we multithread. They aren't viable alternatives to each other, they're orthogonal things that you have to do all of to drive lots of FLOPs or integer work through a CPU pipeline.
This is why CPU SIMD has wide fixed-width execution units, instead of a separate instruction for each scalar operation. There isn't a mechanism for one scalar instruction to flexibly be fed to multiple execution units.
Taking advantage of this requires vectorization at compile time, not just of your loads / stores but also your ALU computation. If your data isn't contiguous, you have to gather it into SIMD vectors either with scalar loads + shuffles, or with AVX2 / AVX512 gather loads that take a base address + vector of (scaled) indices.
But GPU SIMD is different. It's for massively parallel problems where you do the same thing to every element. The "pipeline" can be very lightweight because it doesn't need to support out-of-order exec or register renaming, or especially branching and exceptions. This makes it feasible to just have scalar execution units without needing to handle data in fixed chunks from contiguous addresses.
These are two very different programming models. They're both SIMD, but the details of the hardware that runs them is very different.
Each vector load/store instruction is issued exactly once.
Yes, that's logically true. In practice the internals can be slightly more complicated, e.g. AMD Ryzen splitting 256-bit vector operations into 128-bit halves, or Intel Sandybridge/IvB doing that for just loads+stores while having 256-bit wide FP ALUs.
There's a slight wrinkle with misaligned loads/stores on Intel x86 CPUs: on a cache-line split, the uop has to get replayed (from the reservation station) to do the other part of the access (to the other cache line).
In Intel terminology, the uop for a split load gets dispatched twice, but only issues + retires once.
Aligned loads/stores like movdqa, or movdqu when the memory happens to be aligned at runtime, are just a single access to L1d cache (assuming a cache hit). Unless you're on a CPU that decodes a vector instruction into two halves, like AMD for 256-bit vectors.
But that stuff is purely inside the CPU core for access to L1d cache. CPU <-> memory transactions are in whole cache lines, with write-back L1d / L2 private caches, and shared L3 on modern x86 CPUs - Which cache mapping technique is used in intel core i7 processor? (Intel since Nehalem, the start of the i3/i5/i7 series, AMD since Bulldozer I think introduced L3 caches for them.)
In a CPU, it's the write-back L1d cache that basically coalesces transactions into whole cache lines, whether you use SIMD or not.
What SIMD helps with is getting more work done inside the CPU, to keep up with faster memory. Or for problems where the data fits in L2 or L1d cache, to go really fast over that data.

Memory coalescing is related to parallel accesses: when each core in a SM will access a subsequent memory location, the memory access is optimized.
Viceversa, SIMD is a single core optimization: when a vector register is filled with operands and a SSE operation is performed, the parallelism is inside the CPU core, with one operation being performed on each internal logical unit per clock cycle.
However you are right: coalesced/uncoalesced memory access is a runtime aspect. SIMD operations are compiled in. I don't think they can compare well.
If I would make a parallelism, I would compare coalesing in GPUs to memory prefetching in CPUs. This is a very important runtime optimization as well - and I believe it's active behind the scene using SSE as well.
However there is nothing similar to colescing in Intel CPU cores. Because of cache coherency, the best you can do in optimizing parallel memory accesses, is to let each core access to independent memory regions.
Now, why do GPUs have run-time coalescing?
Graphical processing is optimized for executing a single task in parallel on adjacent elements.
For example, think to perform an operation on every pixel of an image, assigning each pixel to a different core. Now it's clear that you want to have an optimal path to load the image spreading one pixel to each core.
That's why memory coalescing is deeply buried in the GPUs architecture.

Related

Separate instruction and data memory

I am currently in a Computer Architecture class and this is the one thing majorly stumping me. I asked my professor why we have separate instruction and data memory (consider the single-cycle MIPS data path I'm attaching).
My thoughts:
add extra ports (not an issue of FU reuse, similar to register file implementation but with a port for instructions)
consolidate so that memory could be unified and not go unused
His:
agreed with me on last point
ports are quadratic negative increase in perf
separate allows more leeway in placement on chip
single-access memory is faster
Could anyone please elaborate on any of these points in more depth, or add anything of their own? I'm still not fully clear on this.
Yes, multi-ported DRAM is an option, but much more expensive, probably more than twice as expensive per byte. (And lower capacity per die area, so available sizes will be smaller).
In practice real CPUs just have split L1d/L1i caches, and unified L2 cache and memory, assuming it's ultimately a von Neumann type of architecture.
We call this "modified Harvard" - the performance advantages of Harvard allowing parallel code-fetch and load/store, except for contention for access to the unified cache or memory. But it's rare to have lots of code cache misses at the same time as data misses, because if you're stalling on code fetch then you'll have bubbles in the pipeline anyway. (Out-of-order exec could hide that better than a single single-cycle design, of course!)
It needs extra sync / pipeline flushing when we want to run machine code that we recently generated / stored, e.g. a JIT compiler, but other than that it has all the advantages of unified memory and the CPU-pipeline advantages of the Harvard split. (You need extra synchronization anyway to run recently-stored code on an ISA that allows deeply pipelined and out-of-order exec implementations, and which fetch code far ahead into buffers in the pipeline to give more room to absorb bubbles).
What does a 'Split' cache means. And how is it useful(if it is)?
L1 caches usually have split design, but L2, L3 caches have unified design, why?
The first pipelined CPUs had small caches or in the case of MIPS R2000 even off-chip caches with only the controllers on-chip. But yes, MIPS R2000 had split I and D cache. Because you don't want code-fetch to conflict with the MEM stage of load or store instructions; that would introduce a structural hazard that would interfere with running 1 instruction per cycle when you don't have cache misses.
In a single-cycle design I guess your cycle would normally be long enough to access memory twice because you aren't overlapping code-fetch and load/store, so you might not even need multi-ported memory?
L1 data caches are already multi-ported on modern high-performance CPUs, allowing them to commit a store from the store buffer in the same cycle as doing 1 or 2 loads on load execution units.
Having even more ports to also allow code-fetch from it would be even more expensive in terms of power, vs. two slightly smaller caches.
If you think of the Instruction Memory and Data Memory as caches, as in being backed by a unified main memory, then you have the traditional Modified Harvard Architecture, which has some of the advantages of both the Von Neumann and the Harvard Architecture together.
One point you didn't seem to raise is that separation of the two memories (caches) allows for simultaneous access, so an instruction can be read while a data memory is read or written in the same cycle.  This would be more difficult with a unified cache/memory.  This advantage applies to single cycle and pipelined processors since in both designs there is overlap between instruction fetch (IF stage in pipelined) and memory operations (MEM stage in pipelined).
Further, as the Instruction Memory is read-only it has less circuitry.  In the case of being caches, the IM has no dirty bits, no write back, etc..  Further, the IM and DM can have different associativity.
In the case of not being caches, it is not clear how the computer system loads the instruction memory, perhaps it is some fast ROM or is loaded by an external device from ROM into IM.  A number of embedded systems have Instruction Tightly Integrated Memory (and/or Data memory ITIM/DTIM) that then do not act as caches and are not necessarily backed by main memory, instead serving as the primary memories.

CUDA Compute Capability 2.0. Global memory access pattern

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.
L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.

CUDA fft - cooley tukey, how is parallelism exploited?

I know how the FFT implementation works (Cooley-Tuckey algorithm) and I know that there's a CUFFT CUDA library to compute the 1D or 2D FFT quickly, but I'd like to know how CUDA parallelism is exploited in the process.
Is it related to the butterfly computation? (something like each thread loads part of the data into shared memory and then each thread computes an even term or an odd term?)
I do not think they use Cooley-Tuckey algorithm because its index permutation phase makes it not very convenient for shared-memory architectures. Additionally this algorithm works with power-of-two memory strides which is also not good for memory coalescing. Most likely they use some formulation of Stockham self-sorting FFT: for example Bailey's algorithm.
What concerns the implementation, you are right, usually one splits a large FFT into several smaller ones which fit perfectly within one thread block. In my work, I used 512- or 1024-point FFTs (completely unrolled of course) per thread block with 128 threads. Typically, you do not work with a classical radix-2 algorithm on the GPU due to large amount of data transfers required. Instead, one chooses radix-8 or even radix-16 algorithm so that each thread performs one large "butterfly" at a time. For example implementations, you can also visit Vasily Volkov page, or check this "classic" paper.

What happened when all thread of a warp read the same global memory?

I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the programming environment is CUDA 4.0.
Besides, can anybody explain the concept of bus utilization? What is the difference between caching loading and non-caching loading? I saw the concept in http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
All threads in warp accessing same address in global memory
I could answer your questions off the top of my head for AMD GPUs. For Nvidia, googling found the answers quickly enough.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are
there? Is there any serialization. The GPU is Fermi card, the
programming environment is CUDA 4.0.
http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf, dated 2009, says:
Coalescing:
Global memory latency: 400-600 cycles. The single most important
performance consideration!
Global memory access by threads of a half warp can be coalesced to
one transaction for word of size 8-bit, 16-bit, 32-bit, 64-bit or two
transactions for 128-bit.
Global memory can be viewed as composing aligned segments of 16 and
32 words.
Coalescing in Compute Capability 1.0 and 1.1:
K-th thread in a half warp must access the k-th word in a segment;
however, not all threads need to participate
Coalescing in Compute Capability
1.2 and 1.3:
Coalescing for any pattern of access that fits into a segment size
So, it sounds like having all threads of a warp access the same 32-bit address of global memory will work as well as could be hoped for, in anything >= Compute Capability 1.2. But not for 1.0 and 1.1.
Your card is okay.
I must admit that I have not tested this for Nvidia. I have tested it for AMD.
Difference between cache and uncached load
To start off, look at slide 4 of the presentation you refer to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
I.e. the slide titled "Differences between CPU & GPUs" - that says that CPUs have huge caches, and GPUs don't.
A few years ago such a slide might have said that GPUs don't have any caches at all. However, GPUs have begin to add more and more cache, and/or switch more and more local to cache.
I am not sure if you understand what a "cache" is in computer architecture. It's a big topic, so I will only provide a short answer.
Basically, a cache is like local memory. Both cache and local memory - are closer to the processor or GPU than DRAM, main memory - whether that be the GPU's private DRAM, or the CPU's system memory. DRAM main memory is called by Nvidia Global Memory. Slide 9 illustrates this.
Both cache and local memory are closer to the GPU than DRAM global memory: on slide 9 they are drawn as being inside the same chip as the GPU, whereas the DRAMs are separate chips. This can have several good effects, on latency, throughput, power - and, yes, bus utilization (related to bandwidth).
Latency: global memory is 400-800 cycles away. This means that if you only had one warp in your application, it would only execute one memory operation every 400-800 cycles. This means that, in order not to slow down, you need many threads/warps producing memory requests that can be run in parallel, i.e. that have high MLP (Memory Level Parallelism). Fortunately graphics usually does this. The caches are closer, so will have lower latency. Your slides do not say what it is, but other places say 50-200 cycles, 4-8X faster than global memory. This translates to needing fewer threads&warps to avoid slowing down.
Throughput/Bandwidth: there is typically more bandwidth to local memory and/or cache than to DRAM global memory. Your slides say 1+ TB/s versus 177 GB/s - i.e. cache and local memory is more than 5X faster. This higher bandwidth could translate to significantly higher framerates.
Power: you save a lot of power going to cache or local memory rather than to DRAM global memory. This may not matter to a desktop gaming PC, but it matters to a laptop, or to a tablet PC. Actually, it matters even to a desktop gaming PC, because less opower means it can be (over)clocked faster.
OK, so local and cache memory are similar in the above? What's the difference?
Basically, it is easier to program cache than it is local memory. Very good, expert, nionja programmers are needed to manage local memory properly, copying stuff from global memory as needed, and flushing it out. Whereas cache memory is much easier to manage, because you just doing a cached load, and the memory is put in cache automatically, where it will be accessed faster the next time around.
But caches have a downside as well.
First, they actually burn a bit more power than local memory - or they would, if there were actually separate local and global memories. However, in Fermi, the local memory may be configured as cache, and vice versa. (For years GPU folks said "we don't need no stinking cache - cache tags and other overhead are jujst wasteful.)
More importantly, caches tend to operate on cache lines - but not all programs do. This leads to the bus utilization issue you mention. If a warp accesses all words in a cache line, great. But if a warp only accesses 1 word in a cache line, i.e. 1 4 byte word and then skips 124 bytes, then 128 bytes of data are transferred over the bus, but only 4 bytes are used. I.e. >96% of the bus bandwidth is being wasted. This is the low bus utilization.
Whereas the very next slide shows that a non-caching load, suich as one might use to load data into local memory, would transfer only 32 bytes, so "only" 28 bytes out of 32 are wasted. In other words, the non-cache loads could be 4X more efficient, 4X faster, than the cached loads.
Then why not use non-cache loads everywhere? Because they are harder to program - it requirwes expert ninja programmers. And caches work pretty well much of the time.
So, instead of paying your expert ninja programmers to spend a lot of time optimizing all of the code to use non-cache loads and hand-managed local memory - instead you do the easy stuff using cached loads, and you let the highly paid expert ninja programmers concentrate on the stuff that the cache does not work well for.
Besides: nobody likes admitting it, but oftentimes the cache does better than the expert ninja programmers.
Hope this helps.
Throughput, power, and bus utilization: in addition to reducing

Why do GPU based algorithms perform faster

I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster.
I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread to compute the difference for each index.
But can someone tell me a deeper reason as to why GPU's perform faster. What is so different about their architecture that they can beat a CPU based implementation
They don't perform faster, generally.
The point is: Some algorithms fit better into a CPU, some fit better into a GPU.
The execution model of GPUs differs (see SIMD), the memory model differs, the instruction set differs... The whole architecture is different.
There are no obvious way to compare a CPU versus a GPU. You can only discuss whether (and why) the CPU implementation A of an algorithm is faster or slower than a GPU implementation B of this algorithm.
This ended up kind of vague, so a tip of an iceberg of concrete reasons would be: The strong side of CPU is random memory access, branch prediction, etc. GPU excels when there's a high amount of computation with high data locality, so that your implementation can achieve a nice ratio of compute-to-memory-access. SIMD makes GPU implementations slower than CPU where there's a lot of unpredictable braching to many code paths, for example.
The real reason is that a GPU not only has several cores, but it has many cores, typically hundreds of them! Each GPU core however is much slower than a low-end CPU.
But the programming mode is not at all like multi-cores CPUs. So most programs cannot be ported to or take benefit from GPUs.
While some answers have already been given here and this is an old thread, I just thought I'd add this for posterity and what not:
The main reason that CPU's and GPU's differ in performance so much for certain problems is design decisions made on how to allocate the chip's resources. CPU's devote much of their chip space to large caches, instruction decoders, peripheral and system management, etc. Their cores are much more complicated and run at much higher clock rates (which produces more heat per core that must be dissipated.) By contrast, GPU's devote their chip space to packing as many floating-point ALU's on the chip as they can possibly get away with. The original purpose of GPU's was to multiply matricies as fast as possible (because that is the primary type of computation involved in graphics rendering.) Since matrix multiplication is an embarrasingly parallel problem (e.g. each output value is computed completely independently of every other output value) and the code path for each of those computations is identical, chip space can be saved by having several ALU's follow the instructions decoded by a single instruction decoder, since they're all performing the same operations at the same time. By contrast, each of a CPU's cores must have its own separate instruction decoder since the cores are not following identical code paths, which makes each of a CPU's cores much larger on the die than a GPU's cores. Since the primary computations performed in matrix multiplication are floating-point multiplication and floating-point addition, GPU's are implemented such that each of these are single-cycle operations and, in fact, even contain a fused multiply-and-add instruction that multiplies two numbers and adds the result to a third number in a single cycle. This is much faster than a typical CPU, where floating-point multiplication is often a many-cycle operation. Again, the trade-off here is that the chip space is devoted to the floating-point math hardware and other instructions (such as control flow) are often much slower per core than on a CPU or sometimes even just don't exist on a GPU at all.
Also, since GPU cores run at much lower clock rates than typical CPU cores and don't contain as much complicated circuitry, they don't produce as much heat per core (or use as much power per core.) This allows more of them to be packed into the same space without overheating the chip and also allows a GPU with 1,000+ cores to have similar power and cooling requirements to a CPU with only 4 or 8 cores.