"Global Load Efficiency" over 100% - cuda

I have a CUDA program in which threads of a block read elements of a long array in several iterations and memory accesses are almost fully coalesced. When I profile, Global Load Efficiency is over 100% (between 119% and 187% depending on the input). Description for Global Load Efficiency is "Ratio of global memory load throughput to required global memory load throughput." Does it mean that I'm hitting L2 cache a lot and my memory accesses are benefiting from it?
My GPU is GeForce GTX 780 (Kepler architecture).

I asked this question at NVIDIA forum here. I quote the answer I got:
"Global Load Efficiency and Global Store Efficiency describe how well the coalescing of DRAM-accesses and (L2?)Cache-accesses works. If they're 100 percent then you've got perfect coalescing. Since efficiencies above 100 percent don't make any sense (you cannot be better than optimal) this has to be an error.
This error is caused by the Visual Profiler, which counts hardware events to calculate some abstract metrics. But the GPU doesn't have the "correct" events to exactly calculate all those metrics, thus Visual Profiler has to estimate those metrics by using some complex formula and "wrong" events. There are some metrics which are just rough estimations and Global Load Efficiency and Global Store Efficiency are two of them. Thus if such an efficiency is bigger than 100 percent it is an estimation error. As far as I observed the Global Load Efficiency and Global Store Efficiency both increased above 100 percent in some of my register spilling kernels. That's why i assume that the Visual-Profiler uses some events, which also may be caused by local memory accesses, to calculate those two efficiencies. Furthermore GPUs just uses 32 Bit Counters. Thus long running kernel tend to overflow those counters, which also causes the Visual Profiler to display wrong metrics."

Related

Memory Coalescing vs. Vectorized Memory Access

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.
It is my understanding that:
Memory coalescing is a run-time optimization of the memory controller (implemented in hardware). How many memory transactions are required to fulfill the load/store of a warp is determined at run-time. A load/store instruction of a warp may be issued repeatedly unless there is perfect coalescing.
Memory vectorization is a compile-time optimization. The number of memory transactions for a vectorized load/store is fixed. Each vector load/store instruction is issued exactly once.
Coalescable GPU load/store instructions are more expressive than SSE vector load/store instructions. E.g., a st.global.s32 PTX instruction may store into 32 arbitrary memory locations (warp size 32), whereas a movdqa SSE instruction can only store into a consecutive block of memory.
Memory coalescing in CUDA seems to guarantee efficient vectorized memory access (when accesses are coalescable), whereas on x86-SSE, we have to hope that the compiler actually vectorizes the code (it may fail to do so) or vectorize code manually with SSE intrinsics, which is more difficult for programmers.
Is this correct? Did I miss an important aspect (thread masking, maybe)?
Now, why do GPUs have run-time coalescing? This probably requires extra circuits in hardware. What are the main benefits over compile-time coalescing as in CPUs? Are there applications/memory access patterns that are harder to implement on CPUs because of missing run-time coalescing?
caveat: I don't really know / understand the architecture / microarchitecture of GPUs very well. Some of this understanding is cobbled together from the question + what other people have written in comments / answers here.
The way GPUs let one instruction operate on multiple data is very different from CPU SIMD. That's why they need special support for memory coalescing at all. CPU-SIMD can't be programmed in a way that needs it.
BTW, CPUs have cache to absorb multiple accesses to the same cache line before the actual DRAM controllers get involved. GPUs have cache too, of course.
Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". The CPU-SIMD equivalent would be gather/scatter loads/stores that could optimize to a single wide access to cache for indices that were adjacent. Existing CPUs don't do this: each element accesses cache separately in a gather. You shouldn't use a gather load if you know that many indices will be adjacent; it will be faster to shuffle 128-bit or 256-bit chunks into place. For the common case where all your data is contiguous, you just use a normal vector load instruction instead of a gather load.
The point of modern short-vector CPU SIMD is to feed more work through a fetch/decode/exec pipeline without making it wider in terms of having to decode + track + exec more CPU instructions per clock cycle. Making a CPU pipeline wider quickly hits diminishing returns for most use-cases, because most code doesn't have a lot of ILP.
A general-purpose CPU spends a lot of transistors on instruction-scheduling / out-of-order execution machinery, so just making it wider to be able to run many more uops in parallel isn't viable. (https://electronics.stackexchange.com/questions/443186/why-not-make-one-big-cpu-core).
To get more throughput, we can raise the frequency, raise IPC, and use SIMD to do more work per instruction/uop that the out-of-order machinery has to track. (And we can build multiple cores on a single chip, but cache-coherent interconnects between them + L3 cache + memory controllers are hard). Modern CPUs use all of these things, so we get a total throughput capability of frequency * IPC * SIMD, and times number of cores if we multithread. They aren't viable alternatives to each other, they're orthogonal things that you have to do all of to drive lots of FLOPs or integer work through a CPU pipeline.
This is why CPU SIMD has wide fixed-width execution units, instead of a separate instruction for each scalar operation. There isn't a mechanism for one scalar instruction to flexibly be fed to multiple execution units.
Taking advantage of this requires vectorization at compile time, not just of your loads / stores but also your ALU computation. If your data isn't contiguous, you have to gather it into SIMD vectors either with scalar loads + shuffles, or with AVX2 / AVX512 gather loads that take a base address + vector of (scaled) indices.
But GPU SIMD is different. It's for massively parallel problems where you do the same thing to every element. The "pipeline" can be very lightweight because it doesn't need to support out-of-order exec or register renaming, or especially branching and exceptions. This makes it feasible to just have scalar execution units without needing to handle data in fixed chunks from contiguous addresses.
These are two very different programming models. They're both SIMD, but the details of the hardware that runs them is very different.
Each vector load/store instruction is issued exactly once.
Yes, that's logically true. In practice the internals can be slightly more complicated, e.g. AMD Ryzen splitting 256-bit vector operations into 128-bit halves, or Intel Sandybridge/IvB doing that for just loads+stores while having 256-bit wide FP ALUs.
There's a slight wrinkle with misaligned loads/stores on Intel x86 CPUs: on a cache-line split, the uop has to get replayed (from the reservation station) to do the other part of the access (to the other cache line).
In Intel terminology, the uop for a split load gets dispatched twice, but only issues + retires once.
Aligned loads/stores like movdqa, or movdqu when the memory happens to be aligned at runtime, are just a single access to L1d cache (assuming a cache hit). Unless you're on a CPU that decodes a vector instruction into two halves, like AMD for 256-bit vectors.
But that stuff is purely inside the CPU core for access to L1d cache. CPU <-> memory transactions are in whole cache lines, with write-back L1d / L2 private caches, and shared L3 on modern x86 CPUs - Which cache mapping technique is used in intel core i7 processor? (Intel since Nehalem, the start of the i3/i5/i7 series, AMD since Bulldozer I think introduced L3 caches for them.)
In a CPU, it's the write-back L1d cache that basically coalesces transactions into whole cache lines, whether you use SIMD or not.
What SIMD helps with is getting more work done inside the CPU, to keep up with faster memory. Or for problems where the data fits in L2 or L1d cache, to go really fast over that data.
Memory coalescing is related to parallel accesses: when each core in a SM will access a subsequent memory location, the memory access is optimized.
Viceversa, SIMD is a single core optimization: when a vector register is filled with operands and a SSE operation is performed, the parallelism is inside the CPU core, with one operation being performed on each internal logical unit per clock cycle.
However you are right: coalesced/uncoalesced memory access is a runtime aspect. SIMD operations are compiled in. I don't think they can compare well.
If I would make a parallelism, I would compare coalesing in GPUs to memory prefetching in CPUs. This is a very important runtime optimization as well - and I believe it's active behind the scene using SSE as well.
However there is nothing similar to colescing in Intel CPU cores. Because of cache coherency, the best you can do in optimizing parallel memory accesses, is to let each core access to independent memory regions.
Now, why do GPUs have run-time coalescing?
Graphical processing is optimized for executing a single task in parallel on adjacent elements.
For example, think to perform an operation on every pixel of an image, assigning each pixel to a different core. Now it's clear that you want to have an optimal path to load the image spreading one pixel to each core.
That's why memory coalescing is deeply buried in the GPUs architecture.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

What happened when all thread of a warp read the same global memory?

I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the programming environment is CUDA 4.0.
Besides, can anybody explain the concept of bus utilization? What is the difference between caching loading and non-caching loading? I saw the concept in http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
All threads in warp accessing same address in global memory
I could answer your questions off the top of my head for AMD GPUs. For Nvidia, googling found the answers quickly enough.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are
there? Is there any serialization. The GPU is Fermi card, the
programming environment is CUDA 4.0.
http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf, dated 2009, says:
Coalescing:
Global memory latency: 400-600 cycles. The single most important
performance consideration!
Global memory access by threads of a half warp can be coalesced to
one transaction for word of size 8-bit, 16-bit, 32-bit, 64-bit or two
transactions for 128-bit.
Global memory can be viewed as composing aligned segments of 16 and
32 words.
Coalescing in Compute Capability 1.0 and 1.1:
K-th thread in a half warp must access the k-th word in a segment;
however, not all threads need to participate
Coalescing in Compute Capability
1.2 and 1.3:
Coalescing for any pattern of access that fits into a segment size
So, it sounds like having all threads of a warp access the same 32-bit address of global memory will work as well as could be hoped for, in anything >= Compute Capability 1.2. But not for 1.0 and 1.1.
Your card is okay.
I must admit that I have not tested this for Nvidia. I have tested it for AMD.
Difference between cache and uncached load
To start off, look at slide 4 of the presentation you refer to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
I.e. the slide titled "Differences between CPU & GPUs" - that says that CPUs have huge caches, and GPUs don't.
A few years ago such a slide might have said that GPUs don't have any caches at all. However, GPUs have begin to add more and more cache, and/or switch more and more local to cache.
I am not sure if you understand what a "cache" is in computer architecture. It's a big topic, so I will only provide a short answer.
Basically, a cache is like local memory. Both cache and local memory - are closer to the processor or GPU than DRAM, main memory - whether that be the GPU's private DRAM, or the CPU's system memory. DRAM main memory is called by Nvidia Global Memory. Slide 9 illustrates this.
Both cache and local memory are closer to the GPU than DRAM global memory: on slide 9 they are drawn as being inside the same chip as the GPU, whereas the DRAMs are separate chips. This can have several good effects, on latency, throughput, power - and, yes, bus utilization (related to bandwidth).
Latency: global memory is 400-800 cycles away. This means that if you only had one warp in your application, it would only execute one memory operation every 400-800 cycles. This means that, in order not to slow down, you need many threads/warps producing memory requests that can be run in parallel, i.e. that have high MLP (Memory Level Parallelism). Fortunately graphics usually does this. The caches are closer, so will have lower latency. Your slides do not say what it is, but other places say 50-200 cycles, 4-8X faster than global memory. This translates to needing fewer threads&warps to avoid slowing down.
Throughput/Bandwidth: there is typically more bandwidth to local memory and/or cache than to DRAM global memory. Your slides say 1+ TB/s versus 177 GB/s - i.e. cache and local memory is more than 5X faster. This higher bandwidth could translate to significantly higher framerates.
Power: you save a lot of power going to cache or local memory rather than to DRAM global memory. This may not matter to a desktop gaming PC, but it matters to a laptop, or to a tablet PC. Actually, it matters even to a desktop gaming PC, because less opower means it can be (over)clocked faster.
OK, so local and cache memory are similar in the above? What's the difference?
Basically, it is easier to program cache than it is local memory. Very good, expert, nionja programmers are needed to manage local memory properly, copying stuff from global memory as needed, and flushing it out. Whereas cache memory is much easier to manage, because you just doing a cached load, and the memory is put in cache automatically, where it will be accessed faster the next time around.
But caches have a downside as well.
First, they actually burn a bit more power than local memory - or they would, if there were actually separate local and global memories. However, in Fermi, the local memory may be configured as cache, and vice versa. (For years GPU folks said "we don't need no stinking cache - cache tags and other overhead are jujst wasteful.)
More importantly, caches tend to operate on cache lines - but not all programs do. This leads to the bus utilization issue you mention. If a warp accesses all words in a cache line, great. But if a warp only accesses 1 word in a cache line, i.e. 1 4 byte word and then skips 124 bytes, then 128 bytes of data are transferred over the bus, but only 4 bytes are used. I.e. >96% of the bus bandwidth is being wasted. This is the low bus utilization.
Whereas the very next slide shows that a non-caching load, suich as one might use to load data into local memory, would transfer only 32 bytes, so "only" 28 bytes out of 32 are wasted. In other words, the non-cache loads could be 4X more efficient, 4X faster, than the cached loads.
Then why not use non-cache loads everywhere? Because they are harder to program - it requirwes expert ninja programmers. And caches work pretty well much of the time.
So, instead of paying your expert ninja programmers to spend a lot of time optimizing all of the code to use non-cache loads and hand-managed local memory - instead you do the easy stuff using cached loads, and you let the highly paid expert ninja programmers concentrate on the stuff that the cache does not work well for.
Besides: nobody likes admitting it, but oftentimes the cache does better than the expert ninja programmers.
Hope this helps.
Throughput, power, and bus utilization: in addition to reducing

Maximum (shared memory per block) / (threads per block) in CUDA with 100% MP load

I'm trying to process array of big structures with CUDA 2.0 (NVIDIA 590). I'd like to use shared memory for it. I've experimented with CUDA occupancy calculator, trying to allocate maximum shared memory per thread, so that each thread can process whole element of array.
However maximum of (shared memory per block) / (threads per block) I can see in calculator with 100% Multiprocessor load is 32 bytes, which is not enough for single element (on the order of magnitude).
Is 32 bytes a maximum possible value for (shared memory per block) / (threads per block)?
Is it possible to say which alter4native is preferable - allocate part of array in global memory or just use underloaded multiprocessor? Or it can only be decided by experiment?
Yet another alternative I can see is to process array in several passes, but it looks like a last resort.
That is first time I'm trying something really complex with CUDA, so I could be missing some other options...
There are many hardware limitations you need to keep in mind when designing a CUDA kernel. Here are some of the constraints you need to consider:
maximum number of threads you can run in a single block
maximum number of blocks you can load on a streaming multiprocessor at once
maximum number of registers per streaming multiprocessor
maximum amount of shared memory per streaming multiprocessor
Whichever of these limits you hit first becomes a constraint that limits your occupancy (is maximum occupancy what you are referring to by "100% Multiprocessor load"?). Once you reach a certain threshold of occupancy, it becomes less important to pay attention to occupancy. For example, occupancy of 33% does not mean that you are only able to achieve 33% of the maximum theoretical performance of the GPU. Vasily Volkov gave a great talk at the 2010 GPU Technology Conference which recommends not worrying too much about occupancy, and instead trying to minimize memory transactions by using some explicit caching tricks (and other stuff) in the kernel. You can watch the talk here: http://www.gputechconf.com/gtcnew/on-demand-GTC.php?sessionTopic=25&searchByKeyword=occupancy&submit=&select=+&sessionEvent=&sessionYear=&sessionFormat=#193
The only real way to be sure that you are using a kernel design that gives best performance is to test all the possibilities. And you need to redo this performance testing for each type of device you run it on, because they all have different constraints in some way. This can obviously be tedious, especially when the different design patterns result in fundamentally different kernels. I get around this to some extent by using a templating engine to dynamically generate kernels at runtime according to the device hardware specifications, but it's still a bit of a hassle.

CUDA profiled achieved occupany very low; how to diagnose?

When I run the profiler against my code, part of the output is:
Limiting Factor
Achieved Occupancy: 0.02 ( Theoretical Occupancy: 0.67 )
IPC: 1.00 ( Maximum IPC: 4 )
Achieved occupancy of 0.02 seems horribly low. Is it possible that this is due to missing .csv files from the profile run? It complains about:
Program run #18 completed.
Read profiler output file for context #0, run #1, Number of rows=6
Error : Error in profiler data file '/.../temp_compute_profiler_1_0.csv' at line number 1. No column found
Error in reading profiler output:
Application : "/.../bin/python".
Profiler data file '/.../temp_compute_profiler_2_0.csv' for application run 2 not found.
Read profiler output file for context #0, run #4, Number of rows=6
My blocks are 32*4*1, the grid is 25*100, and testing has shown that 32 registers provides the best performance (even though that results in spilling).
If the 0.02 number is correct, how can I go about debugging what's going on? I've already tried moving likely candidates to shared and/or constant memory, experimenting with launch_bounds, moving data into textures, etc.
Edit: if more data from a profile run will be helpful, just let me know and I can provide it. Thanks for reading.
Edit 2: requested data.
IPC: 1.00
Maximum IPC: 4
Divergent branches(%): 6.44
Control flow divergence(%): 96.88
Replayed Instructions(%): -0.00
Global memory replay(%): 10.27
Local memory replays(%): 5.45
Shared bank conflict replay(%): 0.00
Shared memory bank conflict per shared memory instruction(%): 0.00
L1 cache read throughput(GB/s): 197.17
L1 cache global hit ratio (%): 51.23
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 9.80
L2 cache global memory write throughput(GB/s): 6.80
L2 cache global memory throughput(GB/s): 16.60
Local memory bus traffic(%): 206.07
Peak global memory throughput(GB/s): 128.26
The following derived statistic(s) cannot be computed as required counters are not available:
Kernel requested global memory read throughput(GB/s)
Kernel requested global memory write throughput(GB/s)
Global memory excess load(%)
Global memory excess store(%)
Achieved global memory read throughput(GB/s)
Achieved global memory write throughput(GB/s)
Solution(s):
The issue with missing data was due to a too-low timeout value; certain early runs of the data would time out and the data not be written (and those error messages would get lost in the spam of later runs).
The 0.02 achieved occupancy was due to active_warps and active_cycles (and potentially other values as well) hitting maxint (2**32-1). Reducing the size of the input to the profiled script caused much more sane values to come out (including better/more realistic IPC and branching stats).
The hardware counters used by the Visual Profiler, Parallel Nsight, and the CUDA command line profiler are 32-bit counters and will overflow in 2^32 / shaderclock seconds (~5s). Some of the counters will overflow quicker than this. If you see values of MAX_INT or if your duration is in seconds then you are likely to see incorrect results in the tools.
I recommend splitting your kernel launch into 2 or more launches for profiling such that the duration of the launch is less than 1-2 seconds. In your case you have a Theoretical Occupancy of 67% (32 warps/SM) and a block size of 4 warps. When dividing work you want to make sure that each SM is fully loaded and preferably receives multiple waves of blocks. For each launch try launching NumSMs * MaxBlocksPerSM * 10 Blocks. For example, if you have a GTX560 which has 8 SMs and your reported configuration above you would break the single launch of 2500 blocks into 4 launches of 640, 640, 640, and 580.
Improved support for handling overflows should be in a future version of the tools.
Theoretical occupancy is the maximum number of warps you can execute on a a SM divided by the device limit. Theoretical occupancy can be lower than the device limit based upon the kernels use of threads per block, registers per thread, or shared memory per block.
Achieved occupancy is the measure of (active_warps / active_cyles) / max_warps_per_sm.
An achieved occupancy of .02 implies that only 1 warps is active on the SM. Given a launch of 10000 warps (2500 blocks * 128 threads / WARP_SIZE) this can only happen if you have extremely divergent code where all warps except for 1 immediately exit and 1 warp runs for a very long time. Also it is highly unlikely that you could achieve an IPC of 1 with this achieved occupancy so I suspect an error in the reported value.
If you would like help diagnosing the problem I would suggest you
post your device information
verify that you launched <<<{25,100,1}, {128, 4, 1}>>>
post your code
If you cannot post your code I would recommend capturing the counters active_cycles and active_warps and calculate achieved occupancy as
(active_warps / active_cycles) / 48
Given that you have errors in your profiler log it is possible that the results are invalid.
I believe from the output you are using an older version of the Visual Profiler. You may want to consider updating to version 4.1 which improves both collection of PM counters as well as will help provide hints on how to improve your code.
It seems like (a big part of) your issue here is:
Control flow divergence(%): 96.88
It sounds like 96.88 percent of the time, threads are not running the same instruction at the same time. The GPU can only really run the threads in parallel when each thread in a warp is running the same instruction at the same time. Things like if-else statements can cause some threads of a given warp to enter the if, and some threads to enter the else, causing divergence. What happens then is the GPU switches back and forth between executing each set of threads, causing each execution cycle to have a less than optimal occupancy.
To improve this, try to make sure that threads that will execute together in a warp (32 at a time in all NVIDIA cards today... I think) will all take the same path through the kernel code. Sometimes sorting the input data so that like data gets processed together works. Beyond that, add a barrier in strategic places in the kernel code can help. If threads of a warp are forced to diverge, a barrier will make sure that, after they reach common code again, the wait for each other to get there and then resume executing with full occupancy (for that warp). Just beware that a barrier must be hit by all threads, or you will cause a deadlock.
I can't promise this is your whole answer, but it seems to be a big problem for your code given the numbers listed in your question.