GPU serializations decomposition - cuda

According to this, http://www.nvidia.co.uk/content/PDF/isc-2011/Ziegler.pdf, I understand that replays in the GPU literature mean serializations. But what are the factors that contribute to the number of serializations?
To do this, I did some experiments. Profiled a some kernels and find the number of replays (= issued instructions - executed instructions). Sometimes, the number of bank conflicts to be equal to the number of replays. Some other times, the number of bank conflicts is smaller. This implies the number of bank conflicts is always a factor. What about the other?
According to the slides above (from slides 35), there are some others:
. The instruction cache misses
. Constant memory bank conflicts
To my understanding, there can be two others:
. The number of branches divergences. Since both paths are executed, there are replays. But i'm not sure if the number of issued instructions are affected by divergences or not?
. The number of cache misses. I have heard that long latency memory requests will be replayed sometimes. But in my experiments, L1 cache misses are often higher than replays.
Can anyone confirm these factors are those contribute to serializations? What is incorrect and do I miss something else?
Thanks

As far as I know branch divergence contributes to instruction replay.
I am not sure about the number of cache misses. That should be handled transparently by the memory controller not affecting the instruction. Worse thing I can think is that the pipeline gets stopped until the memory has been properly fetched.

Related

Which is faster for CUDA shared-mem atomics - warp locality or anti-locality?

Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly.
In which of the cases will such work be completed faster? :
The case of intra-warp access locality, e.g. the total number of memory position accessed by each warp is small and most of them are indeed accessed by multiple lanes
The case of access anti-locality, where all lanes typically access distinct positions (and perhaps with an effort to avoid bank conflicts)?
and no less importantly - is this microarchitecture-dependent, or is it essentially the same on all recent NVIDIA microarchitectures?
Anti-localized access will be faster.
On SM5.0 (Maxwell) and above GPUs the shared memory atomics (assume add) the shared memory unit will replay the instruction due to address conflicts (two lanes with the same address). Normal bank conflict replays also apply. On Maxwell/Pascal the shared memory unit has fixed round robin access between the two SM partitions (2 scheduler in each partition). For each partition the shared memory unit will complete all replays of the instruction prior to moving to the next instruction. The Volta SM will complete the instruction prior to any other shared memory instruction.
Avoid bank conflicts
Avoid address conflicts
On Fermi and Kepler architecture a shared memory lock operation had to be performed prior to the read modify write operation. This blocked all other warp instructions.
Maxwell and newer GPUs have significantly faster shared memory atomic performance thank Fermi/Kepler.
A very simple kernel could be written to micro-benchmark your two different cases. The CUDA profilers provide instruction executed counts and replay counts for shared memory accesses but do not differentiate between replays due to atomics and replays due to load/store conflicts or vector accesses.
There's a quite simple argument to be made even without needing to know anything about how shared memory atomics are implemented in CUDA hardware: At the end of the day, atomic operations must be serialized somehow at some point. This is true in general, it doesn't matter which platform or hardware you're running on. Atomicity kinda requires that simply by nature. If you have multiple atomic operations issued in parallel, you have to somehow execute them in a way that ensures atomicity. That means that atomic operations will always become slower as contention increases, no matter if we're talking GPU or CPU. The only question is: by how much. That depends on the concrete implementation.
So generally, you want to keep the level of contention, i.e., the number of threads what will be trying to perform atomic operations on the same memory location in parallel, as low as possible…
This is a speculative partial answer.
Consider the related question: Performance of atomic operations on shared memory and its accepted answer.
If the accepted answer there is correct (and continues to be correct even today), then warp threads in a more localized access would get in each other's way, making it slower for many lanes to operate atomically, i.e. making anti-locality of warp atomics better.
But to be honest - I'm not sure I completely buy into this line of argumentation, nor do I know if things have changed since that answer was written.

Is it possible that in MIPS an instruction's certain steps come before that of its predecessor in a pipelined structure?

This is a problem about computer architecture and hope somebody has a clue. More specifically, it is about MIPS instruction pipelined flow. But I feel obscured about some aspects of it. Because I currently do not have enough reputation so I cannot post a image.
Does an S (stall) mean no following instructions can utilize the time slot taken by the stall?
Can two consecutive instructions both have X (execute) in the same time slot?
Is it possible that the M (Memory Access) and W (Write Back) of an instruction come before that of its predecessor in a pipelined structure????
In the situation of a loop and the last instruction is a repetition of the first instruction, why there are 2 F's (fetch) in the last instruction?
For issue 1, in a simple, scalar pipeline, a stall introduces a pipeline bubble which cannot be "popped". To allow an instruction later in program order to fill a pipeline bubble, that instruction would have to go past the stalled instruction. Supporting such reordering of instructions increases the complexity of the pipeline, which tends to increase design and production costs and to increase either pipeline depth or cycle time (as well as use more energy per active cycle [out-of-order execution can be more energy efficient in total even when more energy is used when active]). The mechanisms needed to support such reordering also increases the complexity of explaining pipelines.
For issue 2, with a more complex pipeline it is possible to begin execution of more than one instruction at the same time. Such processors are called superscalar. With in-order execution, only instructions in a consecutive sequence (in program order) can begin execution at the same time, and this requires that the instructions do not have data dependencies and that sufficient hardware resources are available to execute the instructions and handle their results. For an in-order microarchitecture, the width of the earlier pipeline stages is typically the same as the width of later pipeline stages, though buffering would allow multiple instructions to accumulate behind a stall.
(Even at only two-wide execution, there are usually additional restrictions on what kinds of instructions can be executed in parallel. E.g., one execution port might not handle memory accesses or branches while the other execution port might handle those instructions but not shifts or multiplies. Having two copies of hardware for relatively expensive operations [like shifts and multiplies] increases size and limiting the data paths for memory accesses and branches can simplify design and potentially reduce delay.)
For issue 3, out-of-order execution allows the reordering of instructions, so an instruction later in program order could execute and writeback results to the register file before an earlier instruction. With some additional complexity in handling exceptions/interrupts and arbitrating register write port use (or increasing the number of write ports), it is also possible for an in-order processor to writeback results out of program order. The Motorola 88110 (from the early 1990s) is an example of a processor which did such. In order to handle exceptions, the 88110 had a history buffer to hold data that is overwritten by instructions that are later in program order than where the exception is. The 88110 had two additional read ports to each of the register files to read the data in the destination registers and write such to the history buffer.
For issue 4, I am guessing that you mean the case where the body of the loop is composed on only one instruction. For a typical RISC instruction set the branch instruction controlling the loop is a separate instruction from the instruction performing a computation or memory access, so the loop would actually contain two instructions. (Power, formerly PowerPC, could have a one instruction delay loop using branch on counter which decrements the special counter register, but optimizing instruction fetch for a simple implementation for such peculiar code would be foolish.)
For the simple classic 5-stage pipeline with delayed branches, it does not make sense from a performance perspective to avoid an instruction cache access since the loop branch does not introduce a pipeline bubble even when taken. This means that there is no opportunity to execute more instructions. However, in some microarchitectures where redirecting instruction fetch to a non-sequential address introduces a pipeline bubble (particularly if from instruction fetch taking more than one cycle), providing a small fast-access buffer can improve performance. (Instruction fetch bandwidth limitations could also justify a buffer for performance; a small buffer could provide higher bandwidth than a large cache or an off-chip memory.) In addition, to reduce energy use, the use of a loop buffer makes considerable sense, but one would almost certainly not want to limit the size of the buffer to only two instructions (the branch plus one "body" instruction) because such tiny loops are rare and even increasing the buffer size to eight instructions would only add a modest amount of hardware.
In order to specially handle the case of small bodied loops, such loops must be detected. While the buffer could always be filled with the last N instructions (to avoid the first encounter of the short backward branch not "hitting" in the loop buffer--and such a buffer could also even out variations in instruction fetch which might be caused by crossing cache line boundaries, cache misses, fetch redirection delays, etc.), it would be necessary to check each branch instruction to see if it targeted an instruction within the buffer. (It would even be possible to provide a special storage for the loop branch instruction since storage is only needed for the condition checked, a small index into the loop buffer and an indication of where the branch is, but short loops are probably not sufficiently common for such specialized hardware.) In effect, a loop buffer can be a very small Level 0 instruction cache
(A branch target instruction cache [BTIC] is a mechanism similar to a loop buffer, but instead of caching instructions only from the target of the most recent loop branch a BTIC caches instructions from the targets of a number of recent branches. BTICs are primarily used to hide instruction fetch latency.)
When teaching pipelines, such complicating factors are usually avoided initially.

In CUDA, do non-coalesced memory accesses cause branch divergence?

I always thought that branch divergence is only caused by the branching code, like "if", "else", "for", "switch", etc. However I have read a paper recently in which it says:
"
One can clearly observe that the number of divergent branches taken by threads in each first exploration-based algorithm is at least twice more important than the full exploration strategy. This is typically the results from additional non-coalesced accesses to the global memory. Hence, such a threads divergence leads to many memory accesses that have to be serialized, increasing the total number of instructions executed.
One can observe that the number of warp serializations for the version using non-coalesced accesses is between seven and sixteen times more important than for its counterpart. Indeed, a threads divergence caused by non-coalesced accesses leads to many memory accesses that have to be serialized, increasing the instructions to be executed.
"
It seems like, according to the author, non-coalesced accesses can cause divergent branches. Is that true?
My question is, how many reasons exactly are there for the branch divergence?
Thanks in advance.
I think the author is unclear on the concepts and/or terminology.
The two concepts of divergence and serialization are closely related. Divergence causes serialization, as the divergent groups of threads in a warp must be executed serially. But serialization does not cause divergence, as divergence refers specifically to threads within a warp running different code paths.
Other things that cause serialization (but not divergence) are bank conflicts and atomic operations.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

What happened when all thread of a warp read the same global memory?

I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the programming environment is CUDA 4.0.
Besides, can anybody explain the concept of bus utilization? What is the difference between caching loading and non-caching loading? I saw the concept in http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
All threads in warp accessing same address in global memory
I could answer your questions off the top of my head for AMD GPUs. For Nvidia, googling found the answers quickly enough.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are
there? Is there any serialization. The GPU is Fermi card, the
programming environment is CUDA 4.0.
http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf, dated 2009, says:
Coalescing:
Global memory latency: 400-600 cycles. The single most important
performance consideration!
Global memory access by threads of a half warp can be coalesced to
one transaction for word of size 8-bit, 16-bit, 32-bit, 64-bit or two
transactions for 128-bit.
Global memory can be viewed as composing aligned segments of 16 and
32 words.
Coalescing in Compute Capability 1.0 and 1.1:
K-th thread in a half warp must access the k-th word in a segment;
however, not all threads need to participate
Coalescing in Compute Capability
1.2 and 1.3:
Coalescing for any pattern of access that fits into a segment size
So, it sounds like having all threads of a warp access the same 32-bit address of global memory will work as well as could be hoped for, in anything >= Compute Capability 1.2. But not for 1.0 and 1.1.
Your card is okay.
I must admit that I have not tested this for Nvidia. I have tested it for AMD.
Difference between cache and uncached load
To start off, look at slide 4 of the presentation you refer to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
I.e. the slide titled "Differences between CPU & GPUs" - that says that CPUs have huge caches, and GPUs don't.
A few years ago such a slide might have said that GPUs don't have any caches at all. However, GPUs have begin to add more and more cache, and/or switch more and more local to cache.
I am not sure if you understand what a "cache" is in computer architecture. It's a big topic, so I will only provide a short answer.
Basically, a cache is like local memory. Both cache and local memory - are closer to the processor or GPU than DRAM, main memory - whether that be the GPU's private DRAM, or the CPU's system memory. DRAM main memory is called by Nvidia Global Memory. Slide 9 illustrates this.
Both cache and local memory are closer to the GPU than DRAM global memory: on slide 9 they are drawn as being inside the same chip as the GPU, whereas the DRAMs are separate chips. This can have several good effects, on latency, throughput, power - and, yes, bus utilization (related to bandwidth).
Latency: global memory is 400-800 cycles away. This means that if you only had one warp in your application, it would only execute one memory operation every 400-800 cycles. This means that, in order not to slow down, you need many threads/warps producing memory requests that can be run in parallel, i.e. that have high MLP (Memory Level Parallelism). Fortunately graphics usually does this. The caches are closer, so will have lower latency. Your slides do not say what it is, but other places say 50-200 cycles, 4-8X faster than global memory. This translates to needing fewer threads&warps to avoid slowing down.
Throughput/Bandwidth: there is typically more bandwidth to local memory and/or cache than to DRAM global memory. Your slides say 1+ TB/s versus 177 GB/s - i.e. cache and local memory is more than 5X faster. This higher bandwidth could translate to significantly higher framerates.
Power: you save a lot of power going to cache or local memory rather than to DRAM global memory. This may not matter to a desktop gaming PC, but it matters to a laptop, or to a tablet PC. Actually, it matters even to a desktop gaming PC, because less opower means it can be (over)clocked faster.
OK, so local and cache memory are similar in the above? What's the difference?
Basically, it is easier to program cache than it is local memory. Very good, expert, nionja programmers are needed to manage local memory properly, copying stuff from global memory as needed, and flushing it out. Whereas cache memory is much easier to manage, because you just doing a cached load, and the memory is put in cache automatically, where it will be accessed faster the next time around.
But caches have a downside as well.
First, they actually burn a bit more power than local memory - or they would, if there were actually separate local and global memories. However, in Fermi, the local memory may be configured as cache, and vice versa. (For years GPU folks said "we don't need no stinking cache - cache tags and other overhead are jujst wasteful.)
More importantly, caches tend to operate on cache lines - but not all programs do. This leads to the bus utilization issue you mention. If a warp accesses all words in a cache line, great. But if a warp only accesses 1 word in a cache line, i.e. 1 4 byte word and then skips 124 bytes, then 128 bytes of data are transferred over the bus, but only 4 bytes are used. I.e. >96% of the bus bandwidth is being wasted. This is the low bus utilization.
Whereas the very next slide shows that a non-caching load, suich as one might use to load data into local memory, would transfer only 32 bytes, so "only" 28 bytes out of 32 are wasted. In other words, the non-cache loads could be 4X more efficient, 4X faster, than the cached loads.
Then why not use non-cache loads everywhere? Because they are harder to program - it requirwes expert ninja programmers. And caches work pretty well much of the time.
So, instead of paying your expert ninja programmers to spend a lot of time optimizing all of the code to use non-cache loads and hand-managed local memory - instead you do the easy stuff using cached loads, and you let the highly paid expert ninja programmers concentrate on the stuff that the cache does not work well for.
Besides: nobody likes admitting it, but oftentimes the cache does better than the expert ninja programmers.
Hope this helps.
Throughput, power, and bus utilization: in addition to reducing