What happened when all thread of a warp read the same global memory? - cuda

I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the programming environment is CUDA 4.0.
Besides, can anybody explain the concept of bus utilization? What is the difference between caching loading and non-caching loading? I saw the concept in http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.

All threads in warp accessing same address in global memory
I could answer your questions off the top of my head for AMD GPUs. For Nvidia, googling found the answers quickly enough.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are
there? Is there any serialization. The GPU is Fermi card, the
programming environment is CUDA 4.0.
http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf, dated 2009, says:
Coalescing:
Global memory latency: 400-600 cycles. The single most important
performance consideration!
Global memory access by threads of a half warp can be coalesced to
one transaction for word of size 8-bit, 16-bit, 32-bit, 64-bit or two
transactions for 128-bit.
Global memory can be viewed as composing aligned segments of 16 and
32 words.
Coalescing in Compute Capability 1.0 and 1.1:
K-th thread in a half warp must access the k-th word in a segment;
however, not all threads need to participate
Coalescing in Compute Capability
1.2 and 1.3:
Coalescing for any pattern of access that fits into a segment size
So, it sounds like having all threads of a warp access the same 32-bit address of global memory will work as well as could be hoped for, in anything >= Compute Capability 1.2. But not for 1.0 and 1.1.
Your card is okay.
I must admit that I have not tested this for Nvidia. I have tested it for AMD.
Difference between cache and uncached load
To start off, look at slide 4 of the presentation you refer to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
I.e. the slide titled "Differences between CPU & GPUs" - that says that CPUs have huge caches, and GPUs don't.
A few years ago such a slide might have said that GPUs don't have any caches at all. However, GPUs have begin to add more and more cache, and/or switch more and more local to cache.
I am not sure if you understand what a "cache" is in computer architecture. It's a big topic, so I will only provide a short answer.
Basically, a cache is like local memory. Both cache and local memory - are closer to the processor or GPU than DRAM, main memory - whether that be the GPU's private DRAM, or the CPU's system memory. DRAM main memory is called by Nvidia Global Memory. Slide 9 illustrates this.
Both cache and local memory are closer to the GPU than DRAM global memory: on slide 9 they are drawn as being inside the same chip as the GPU, whereas the DRAMs are separate chips. This can have several good effects, on latency, throughput, power - and, yes, bus utilization (related to bandwidth).
Latency: global memory is 400-800 cycles away. This means that if you only had one warp in your application, it would only execute one memory operation every 400-800 cycles. This means that, in order not to slow down, you need many threads/warps producing memory requests that can be run in parallel, i.e. that have high MLP (Memory Level Parallelism). Fortunately graphics usually does this. The caches are closer, so will have lower latency. Your slides do not say what it is, but other places say 50-200 cycles, 4-8X faster than global memory. This translates to needing fewer threads&warps to avoid slowing down.
Throughput/Bandwidth: there is typically more bandwidth to local memory and/or cache than to DRAM global memory. Your slides say 1+ TB/s versus 177 GB/s - i.e. cache and local memory is more than 5X faster. This higher bandwidth could translate to significantly higher framerates.
Power: you save a lot of power going to cache or local memory rather than to DRAM global memory. This may not matter to a desktop gaming PC, but it matters to a laptop, or to a tablet PC. Actually, it matters even to a desktop gaming PC, because less opower means it can be (over)clocked faster.
OK, so local and cache memory are similar in the above? What's the difference?
Basically, it is easier to program cache than it is local memory. Very good, expert, nionja programmers are needed to manage local memory properly, copying stuff from global memory as needed, and flushing it out. Whereas cache memory is much easier to manage, because you just doing a cached load, and the memory is put in cache automatically, where it will be accessed faster the next time around.
But caches have a downside as well.
First, they actually burn a bit more power than local memory - or they would, if there were actually separate local and global memories. However, in Fermi, the local memory may be configured as cache, and vice versa. (For years GPU folks said "we don't need no stinking cache - cache tags and other overhead are jujst wasteful.)
More importantly, caches tend to operate on cache lines - but not all programs do. This leads to the bus utilization issue you mention. If a warp accesses all words in a cache line, great. But if a warp only accesses 1 word in a cache line, i.e. 1 4 byte word and then skips 124 bytes, then 128 bytes of data are transferred over the bus, but only 4 bytes are used. I.e. >96% of the bus bandwidth is being wasted. This is the low bus utilization.
Whereas the very next slide shows that a non-caching load, suich as one might use to load data into local memory, would transfer only 32 bytes, so "only" 28 bytes out of 32 are wasted. In other words, the non-cache loads could be 4X more efficient, 4X faster, than the cached loads.
Then why not use non-cache loads everywhere? Because they are harder to program - it requirwes expert ninja programmers. And caches work pretty well much of the time.
So, instead of paying your expert ninja programmers to spend a lot of time optimizing all of the code to use non-cache loads and hand-managed local memory - instead you do the easy stuff using cached loads, and you let the highly paid expert ninja programmers concentrate on the stuff that the cache does not work well for.
Besides: nobody likes admitting it, but oftentimes the cache does better than the expert ninja programmers.
Hope this helps.
Throughput, power, and bus utilization: in addition to reducing

Related

Separate instruction and data memory

I am currently in a Computer Architecture class and this is the one thing majorly stumping me. I asked my professor why we have separate instruction and data memory (consider the single-cycle MIPS data path I'm attaching).
My thoughts:
add extra ports (not an issue of FU reuse, similar to register file implementation but with a port for instructions)
consolidate so that memory could be unified and not go unused
His:
agreed with me on last point
ports are quadratic negative increase in perf
separate allows more leeway in placement on chip
single-access memory is faster
Could anyone please elaborate on any of these points in more depth, or add anything of their own? I'm still not fully clear on this.
Yes, multi-ported DRAM is an option, but much more expensive, probably more than twice as expensive per byte. (And lower capacity per die area, so available sizes will be smaller).
In practice real CPUs just have split L1d/L1i caches, and unified L2 cache and memory, assuming it's ultimately a von Neumann type of architecture.
We call this "modified Harvard" - the performance advantages of Harvard allowing parallel code-fetch and load/store, except for contention for access to the unified cache or memory. But it's rare to have lots of code cache misses at the same time as data misses, because if you're stalling on code fetch then you'll have bubbles in the pipeline anyway. (Out-of-order exec could hide that better than a single single-cycle design, of course!)
It needs extra sync / pipeline flushing when we want to run machine code that we recently generated / stored, e.g. a JIT compiler, but other than that it has all the advantages of unified memory and the CPU-pipeline advantages of the Harvard split. (You need extra synchronization anyway to run recently-stored code on an ISA that allows deeply pipelined and out-of-order exec implementations, and which fetch code far ahead into buffers in the pipeline to give more room to absorb bubbles).
What does a 'Split' cache means. And how is it useful(if it is)?
L1 caches usually have split design, but L2, L3 caches have unified design, why?
The first pipelined CPUs had small caches or in the case of MIPS R2000 even off-chip caches with only the controllers on-chip. But yes, MIPS R2000 had split I and D cache. Because you don't want code-fetch to conflict with the MEM stage of load or store instructions; that would introduce a structural hazard that would interfere with running 1 instruction per cycle when you don't have cache misses.
In a single-cycle design I guess your cycle would normally be long enough to access memory twice because you aren't overlapping code-fetch and load/store, so you might not even need multi-ported memory?
L1 data caches are already multi-ported on modern high-performance CPUs, allowing them to commit a store from the store buffer in the same cycle as doing 1 or 2 loads on load execution units.
Having even more ports to also allow code-fetch from it would be even more expensive in terms of power, vs. two slightly smaller caches.
If you think of the Instruction Memory and Data Memory as caches, as in being backed by a unified main memory, then you have the traditional Modified Harvard Architecture, which has some of the advantages of both the Von Neumann and the Harvard Architecture together.
One point you didn't seem to raise is that separation of the two memories (caches) allows for simultaneous access, so an instruction can be read while a data memory is read or written in the same cycle.  This would be more difficult with a unified cache/memory.  This advantage applies to single cycle and pipelined processors since in both designs there is overlap between instruction fetch (IF stage in pipelined) and memory operations (MEM stage in pipelined).
Further, as the Instruction Memory is read-only it has less circuitry.  In the case of being caches, the IM has no dirty bits, no write back, etc..  Further, the IM and DM can have different associativity.
In the case of not being caches, it is not clear how the computer system loads the instruction memory, perhaps it is some fast ROM or is loaded by an external device from ROM into IM.  A number of embedded systems have Instruction Tightly Integrated Memory (and/or Data memory ITIM/DTIM) that then do not act as caches and are not necessarily backed by main memory, instead serving as the primary memories.

Memory Coalescing vs. Vectorized Memory Access

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.
It is my understanding that:
Memory coalescing is a run-time optimization of the memory controller (implemented in hardware). How many memory transactions are required to fulfill the load/store of a warp is determined at run-time. A load/store instruction of a warp may be issued repeatedly unless there is perfect coalescing.
Memory vectorization is a compile-time optimization. The number of memory transactions for a vectorized load/store is fixed. Each vector load/store instruction is issued exactly once.
Coalescable GPU load/store instructions are more expressive than SSE vector load/store instructions. E.g., a st.global.s32 PTX instruction may store into 32 arbitrary memory locations (warp size 32), whereas a movdqa SSE instruction can only store into a consecutive block of memory.
Memory coalescing in CUDA seems to guarantee efficient vectorized memory access (when accesses are coalescable), whereas on x86-SSE, we have to hope that the compiler actually vectorizes the code (it may fail to do so) or vectorize code manually with SSE intrinsics, which is more difficult for programmers.
Is this correct? Did I miss an important aspect (thread masking, maybe)?
Now, why do GPUs have run-time coalescing? This probably requires extra circuits in hardware. What are the main benefits over compile-time coalescing as in CPUs? Are there applications/memory access patterns that are harder to implement on CPUs because of missing run-time coalescing?
caveat: I don't really know / understand the architecture / microarchitecture of GPUs very well. Some of this understanding is cobbled together from the question + what other people have written in comments / answers here.
The way GPUs let one instruction operate on multiple data is very different from CPU SIMD. That's why they need special support for memory coalescing at all. CPU-SIMD can't be programmed in a way that needs it.
BTW, CPUs have cache to absorb multiple accesses to the same cache line before the actual DRAM controllers get involved. GPUs have cache too, of course.
Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". The CPU-SIMD equivalent would be gather/scatter loads/stores that could optimize to a single wide access to cache for indices that were adjacent. Existing CPUs don't do this: each element accesses cache separately in a gather. You shouldn't use a gather load if you know that many indices will be adjacent; it will be faster to shuffle 128-bit or 256-bit chunks into place. For the common case where all your data is contiguous, you just use a normal vector load instruction instead of a gather load.
The point of modern short-vector CPU SIMD is to feed more work through a fetch/decode/exec pipeline without making it wider in terms of having to decode + track + exec more CPU instructions per clock cycle. Making a CPU pipeline wider quickly hits diminishing returns for most use-cases, because most code doesn't have a lot of ILP.
A general-purpose CPU spends a lot of transistors on instruction-scheduling / out-of-order execution machinery, so just making it wider to be able to run many more uops in parallel isn't viable. (https://electronics.stackexchange.com/questions/443186/why-not-make-one-big-cpu-core).
To get more throughput, we can raise the frequency, raise IPC, and use SIMD to do more work per instruction/uop that the out-of-order machinery has to track. (And we can build multiple cores on a single chip, but cache-coherent interconnects between them + L3 cache + memory controllers are hard). Modern CPUs use all of these things, so we get a total throughput capability of frequency * IPC * SIMD, and times number of cores if we multithread. They aren't viable alternatives to each other, they're orthogonal things that you have to do all of to drive lots of FLOPs or integer work through a CPU pipeline.
This is why CPU SIMD has wide fixed-width execution units, instead of a separate instruction for each scalar operation. There isn't a mechanism for one scalar instruction to flexibly be fed to multiple execution units.
Taking advantage of this requires vectorization at compile time, not just of your loads / stores but also your ALU computation. If your data isn't contiguous, you have to gather it into SIMD vectors either with scalar loads + shuffles, or with AVX2 / AVX512 gather loads that take a base address + vector of (scaled) indices.
But GPU SIMD is different. It's for massively parallel problems where you do the same thing to every element. The "pipeline" can be very lightweight because it doesn't need to support out-of-order exec or register renaming, or especially branching and exceptions. This makes it feasible to just have scalar execution units without needing to handle data in fixed chunks from contiguous addresses.
These are two very different programming models. They're both SIMD, but the details of the hardware that runs them is very different.
Each vector load/store instruction is issued exactly once.
Yes, that's logically true. In practice the internals can be slightly more complicated, e.g. AMD Ryzen splitting 256-bit vector operations into 128-bit halves, or Intel Sandybridge/IvB doing that for just loads+stores while having 256-bit wide FP ALUs.
There's a slight wrinkle with misaligned loads/stores on Intel x86 CPUs: on a cache-line split, the uop has to get replayed (from the reservation station) to do the other part of the access (to the other cache line).
In Intel terminology, the uop for a split load gets dispatched twice, but only issues + retires once.
Aligned loads/stores like movdqa, or movdqu when the memory happens to be aligned at runtime, are just a single access to L1d cache (assuming a cache hit). Unless you're on a CPU that decodes a vector instruction into two halves, like AMD for 256-bit vectors.
But that stuff is purely inside the CPU core for access to L1d cache. CPU <-> memory transactions are in whole cache lines, with write-back L1d / L2 private caches, and shared L3 on modern x86 CPUs - Which cache mapping technique is used in intel core i7 processor? (Intel since Nehalem, the start of the i3/i5/i7 series, AMD since Bulldozer I think introduced L3 caches for them.)
In a CPU, it's the write-back L1d cache that basically coalesces transactions into whole cache lines, whether you use SIMD or not.
What SIMD helps with is getting more work done inside the CPU, to keep up with faster memory. Or for problems where the data fits in L2 or L1d cache, to go really fast over that data.
Memory coalescing is related to parallel accesses: when each core in a SM will access a subsequent memory location, the memory access is optimized.
Viceversa, SIMD is a single core optimization: when a vector register is filled with operands and a SSE operation is performed, the parallelism is inside the CPU core, with one operation being performed on each internal logical unit per clock cycle.
However you are right: coalesced/uncoalesced memory access is a runtime aspect. SIMD operations are compiled in. I don't think they can compare well.
If I would make a parallelism, I would compare coalesing in GPUs to memory prefetching in CPUs. This is a very important runtime optimization as well - and I believe it's active behind the scene using SSE as well.
However there is nothing similar to colescing in Intel CPU cores. Because of cache coherency, the best you can do in optimizing parallel memory accesses, is to let each core access to independent memory regions.
Now, why do GPUs have run-time coalescing?
Graphical processing is optimized for executing a single task in parallel on adjacent elements.
For example, think to perform an operation on every pixel of an image, assigning each pixel to a different core. Now it's clear that you want to have an optimal path to load the image spreading one pixel to each core.
That's why memory coalescing is deeply buried in the GPUs architecture.

random access gpgpu performance drop?

I've heard there is a drop in performance when performing computations on arrays with random access on a gpu.
My question is how severe is this performance drop?
Searching around some comments seemed to imply code ran faster on cpu. But seeing the vast difference in int and flop between gpus and cpus it seems difficult to believe performance would drop so bad.
I think that it is related to cache loss. GPU also has L1 L2 caches and if you hit random memory space, then you will have more chance to lose cache. And also because GPU has special memory access pattern that called memory coalescing. It is accessing memory with wide range. It is why GPU is so fast when they run SIMD friendly code. But if you access random memory space, it will break memory coalescing. I think that it would be good to read cuda document to see how GPU works.

why shared memory is faster than global memory?

is that difference in speed due to technology with which both were made of( i read that shared memory is a scratchpad memory that is mainly SRAM memory while global memory is typically a DRAM memory)?
what if both were made with same technology, will be any differences in performance based on shared memory is on-chip and global memory is off-chip due to extra instructions(load instructions) or extra hardware circuit needed for global memory to load it's data into the processor?
At least two reasons are the ones you've already pointed out. There is a:
Location difference - shared memory is on-chip, global memory (at least, ordinary global memory accesses that do not hit in one of the caches) are off-chip. Memory is generally clocked at a fixed frequency, and the maximum frequency will depend on how fast the system can be clocked. Long transmission lines, buffers that drive signals from off-chip to on-chip or vice versa, and many other circuit effects will slow down the maximum rate that a particular circuit can be clocked. Therefore the shared memory is considerably advantaged by being on-chip. The caches (L1, L2, read-only, constant cache, texture cache, etc.) all benefit from the same principle.
Technology difference. An SRAM cell (e.g. shared memory) might be clocked faster than a DRAM cell (e.g. off-chip global memory), and SRAM is more amenable to fast random access. DRAM has a more complicated access sequence that comes into play when a cell is accessed. DRAM is also burdened by mechanisms such as refresh that may get in the way of continuous fast access. However I would suggest that the technology difference is less of an issue. Another technology related issue is that SRAM arrays are generally more amenable (able to be placed in higher density) on the logic processes that modern processors use. For highest density, DRAM arrays use a semiconductor process that differs substantially from the one used for general logic inside a processor.
Processor instuctrions required wouldn't be a meaningful differentiator between shared memory and global memory access times.

"Global Load Efficiency" over 100%

I have a CUDA program in which threads of a block read elements of a long array in several iterations and memory accesses are almost fully coalesced. When I profile, Global Load Efficiency is over 100% (between 119% and 187% depending on the input). Description for Global Load Efficiency is "Ratio of global memory load throughput to required global memory load throughput." Does it mean that I'm hitting L2 cache a lot and my memory accesses are benefiting from it?
My GPU is GeForce GTX 780 (Kepler architecture).
I asked this question at NVIDIA forum here. I quote the answer I got:
"Global Load Efficiency and Global Store Efficiency describe how well the coalescing of DRAM-accesses and (L2?)Cache-accesses works. If they're 100 percent then you've got perfect coalescing. Since efficiencies above 100 percent don't make any sense (you cannot be better than optimal) this has to be an error.
This error is caused by the Visual Profiler, which counts hardware events to calculate some abstract metrics. But the GPU doesn't have the "correct" events to exactly calculate all those metrics, thus Visual Profiler has to estimate those metrics by using some complex formula and "wrong" events. There are some metrics which are just rough estimations and Global Load Efficiency and Global Store Efficiency are two of them. Thus if such an efficiency is bigger than 100 percent it is an estimation error. As far as I observed the Global Load Efficiency and Global Store Efficiency both increased above 100 percent in some of my register spilling kernels. That's why i assume that the Visual-Profiler uses some events, which also may be caused by local memory accesses, to calculate those two efficiencies. Furthermore GPUs just uses 32 Bit Counters. Thus long running kernel tend to overflow those counters, which also causes the Visual Profiler to display wrong metrics."