CUDA profiled achieved occupany very low; how to diagnose? - cuda

When I run the profiler against my code, part of the output is:
Limiting Factor
Achieved Occupancy: 0.02 ( Theoretical Occupancy: 0.67 )
IPC: 1.00 ( Maximum IPC: 4 )
Achieved occupancy of 0.02 seems horribly low. Is it possible that this is due to missing .csv files from the profile run? It complains about:
Program run #18 completed.
Read profiler output file for context #0, run #1, Number of rows=6
Error : Error in profiler data file '/.../temp_compute_profiler_1_0.csv' at line number 1. No column found
Error in reading profiler output:
Application : "/.../bin/python".
Profiler data file '/.../temp_compute_profiler_2_0.csv' for application run 2 not found.
Read profiler output file for context #0, run #4, Number of rows=6
My blocks are 32*4*1, the grid is 25*100, and testing has shown that 32 registers provides the best performance (even though that results in spilling).
If the 0.02 number is correct, how can I go about debugging what's going on? I've already tried moving likely candidates to shared and/or constant memory, experimenting with launch_bounds, moving data into textures, etc.
Edit: if more data from a profile run will be helpful, just let me know and I can provide it. Thanks for reading.
Edit 2: requested data.
IPC: 1.00
Maximum IPC: 4
Divergent branches(%): 6.44
Control flow divergence(%): 96.88
Replayed Instructions(%): -0.00
Global memory replay(%): 10.27
Local memory replays(%): 5.45
Shared bank conflict replay(%): 0.00
Shared memory bank conflict per shared memory instruction(%): 0.00
L1 cache read throughput(GB/s): 197.17
L1 cache global hit ratio (%): 51.23
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 9.80
L2 cache global memory write throughput(GB/s): 6.80
L2 cache global memory throughput(GB/s): 16.60
Local memory bus traffic(%): 206.07
Peak global memory throughput(GB/s): 128.26
The following derived statistic(s) cannot be computed as required counters are not available:
Kernel requested global memory read throughput(GB/s)
Kernel requested global memory write throughput(GB/s)
Global memory excess load(%)
Global memory excess store(%)
Achieved global memory read throughput(GB/s)
Achieved global memory write throughput(GB/s)
Solution(s):
The issue with missing data was due to a too-low timeout value; certain early runs of the data would time out and the data not be written (and those error messages would get lost in the spam of later runs).
The 0.02 achieved occupancy was due to active_warps and active_cycles (and potentially other values as well) hitting maxint (2**32-1). Reducing the size of the input to the profiled script caused much more sane values to come out (including better/more realistic IPC and branching stats).

The hardware counters used by the Visual Profiler, Parallel Nsight, and the CUDA command line profiler are 32-bit counters and will overflow in 2^32 / shaderclock seconds (~5s). Some of the counters will overflow quicker than this. If you see values of MAX_INT or if your duration is in seconds then you are likely to see incorrect results in the tools.
I recommend splitting your kernel launch into 2 or more launches for profiling such that the duration of the launch is less than 1-2 seconds. In your case you have a Theoretical Occupancy of 67% (32 warps/SM) and a block size of 4 warps. When dividing work you want to make sure that each SM is fully loaded and preferably receives multiple waves of blocks. For each launch try launching NumSMs * MaxBlocksPerSM * 10 Blocks. For example, if you have a GTX560 which has 8 SMs and your reported configuration above you would break the single launch of 2500 blocks into 4 launches of 640, 640, 640, and 580.
Improved support for handling overflows should be in a future version of the tools.

Theoretical occupancy is the maximum number of warps you can execute on a a SM divided by the device limit. Theoretical occupancy can be lower than the device limit based upon the kernels use of threads per block, registers per thread, or shared memory per block.
Achieved occupancy is the measure of (active_warps / active_cyles) / max_warps_per_sm.
An achieved occupancy of .02 implies that only 1 warps is active on the SM. Given a launch of 10000 warps (2500 blocks * 128 threads / WARP_SIZE) this can only happen if you have extremely divergent code where all warps except for 1 immediately exit and 1 warp runs for a very long time. Also it is highly unlikely that you could achieve an IPC of 1 with this achieved occupancy so I suspect an error in the reported value.
If you would like help diagnosing the problem I would suggest you
post your device information
verify that you launched <<<{25,100,1}, {128, 4, 1}>>>
post your code
If you cannot post your code I would recommend capturing the counters active_cycles and active_warps and calculate achieved occupancy as
(active_warps / active_cycles) / 48
Given that you have errors in your profiler log it is possible that the results are invalid.
I believe from the output you are using an older version of the Visual Profiler. You may want to consider updating to version 4.1 which improves both collection of PM counters as well as will help provide hints on how to improve your code.

It seems like (a big part of) your issue here is:
Control flow divergence(%): 96.88
It sounds like 96.88 percent of the time, threads are not running the same instruction at the same time. The GPU can only really run the threads in parallel when each thread in a warp is running the same instruction at the same time. Things like if-else statements can cause some threads of a given warp to enter the if, and some threads to enter the else, causing divergence. What happens then is the GPU switches back and forth between executing each set of threads, causing each execution cycle to have a less than optimal occupancy.
To improve this, try to make sure that threads that will execute together in a warp (32 at a time in all NVIDIA cards today... I think) will all take the same path through the kernel code. Sometimes sorting the input data so that like data gets processed together works. Beyond that, add a barrier in strategic places in the kernel code can help. If threads of a warp are forced to diverge, a barrier will make sure that, after they reach common code again, the wait for each other to get there and then resume executing with full occupancy (for that warp). Just beware that a barrier must be hit by all threads, or you will cause a deadlock.
I can't promise this is your whole answer, but it seems to be a big problem for your code given the numbers listed in your question.

Related

Issued load/store instructions for replay

There are two nvprof metrics regarding load/store instructions and they are ldst_executed and ldst_issued. We know that executed<=issued. I expect that those load/stores that are issued but not executed are related to branch predications and other incorrent predictions. However, from this (slide 9) document and this topic, instructions that are issued but not executed are related to serialization and replay.
I don't know if that reason applies for load/store instructions or not. Moreover, I would like to know why such terminology is used for issued but not executed instructions? If there is a serialization for any reason, instructions are executed multiple times. So, why they are not counted as executed?
Any explanation for that?
The NVIDIA architecture optimized memory throughput by issuing an instruction for a group of threads called a warp. If each thread accesses a consecutive data element or the same element then the access can be performed very efficiently. However, if each thread accesses data in a different cache line or at a different address in the same bank then there is a conflict and the instruction has to be replayed.
inst_executed is the count of instructions retired.
inst_issued is the count of instructions issued. An instruction may be issued multiple times in the case of a vector memory access, memory address conflict, memory bank conflict, etc. On each issue the thread mask is reduced until all threads have completed.
The distinction is made for two reasons:
1. Retirement of an instruction indicates completion of a data dependency. The data dependency is only resolved 1 time despite possible replays.
2. The ratio between issued and executed is a simple way to show opportunities to save warp scheduler issue cycles.
In Fermi and Kepler SM if a memory conflict was encountered then the instruction was replayed (re-issued) until all threads completed. This was performed by the warp scheduler. These replays consume issue cycles reducing the ability for the SM to issue instructions to math pipes. In this SM issued > executed indicates an opportunity for optimization especially if issued IPC is high.
In the Maxwell-Turing SM replays for vector accesses, address conflicts, and memory conflicts are replayed by the memory unit (shared memory, L1, etc.) and do not steal warp scheduler issue cycles. In this SM issued is very seldom more than a few % above executed.
EXAMPLE: A kernel loads a 32-bit value. All 32 threads in the warp are active and each thread accesses a unique cache line (stride = 128 bytes).
On Kepler (CC3.*) SM the instruction is issued 1 time then replayed 31 additional times as the Kepler L1 can only perform 1 tag lookup per request.
inst_executed = 1
inst_issued = 32
On Kepler the instruction has to be replayed again for each request that missed in the L1. If all threads miss in the L1 cache then
inst_executed = 1
inst_issued >= 64 = 32 request + 32 replays for misses
On Maxwell - Turing architecture the replay is performed by the SM memory system. The replays can limit memory throughput but will not block the warp scheduler from issuing instructions to the math pipe.
inst_executed = 1
inst_issued = 1
On Maxwell-Turing Nsight Compute/Perfworks expose throughput counters for each of the memory pipelines including number of cycles due to memory bank conflicts, serialization of atomics, address divergence, etc.
GPU architecture is based on maximizing throughput rather than minimizing latency. Thus, GPUs (currently) don't really do out-of-order execution or branch prediction. Instead of building a few cores full of complex control logic to make one thread run really fast (like you'd have on a CPU), GPUs rather use those transistors to build more cores to run as many threads as possible in parallel.
As explained on slide 9 of the presentation you linked, executed instructions are the instructions that control flow passes over in your program (basically, the number of lines of assembly code that were run). When you, e.g., execute a global load instruction and the memory request cannot be served immediately (misses the cache), the GPU will switch to another thread. Once the value is ready in the cache and GPU switches back to your thread, the load instruction will have to be issued again to complete fetching the value (see also this answer and this thread). When you, e.g., access shared memory and there are a bank conflicts, the shared memory access will have to be replayed multiple times for different threads in the warp…
The main reason to differentiate between executed and issued instructions would seem to be that the ratio of the two can serve as a measurement for the amount of overhead your code produces due to instructions that cannot be completed immediately at the time they are executed…

Interpreting NVIDIA Visual Profiler outputs

I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.
However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.
As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2 times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.
The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).
From now on, I will describe my interpretation of the profiler.
When profiling, if I run Examine GPU usage I get (click to enlarge):
From here I see several things:
Low Memcopy/Compute overlap 0%. This is expected, as I run a big kernel, wait until it has finished and then memcopy. There should not be overlap.
Low Kernel Concurrency 0%. I just got 1 kernel, this is expected.
Low Memcopy Overlap 0%. Same thing. I only memcopy once in the begging, and I memcopy once after each kernel. This is expected.
From the kernel executions "bars", top and right I can see:
Most of the time is running kernels. There is little memory overhead.
All kernels take the same time (good)
The biggest flag is occupancy, below 45% always, being the registers the limiters. However, optimizing occupancy doesn't seem to be always a priority.
I follow my profiling by running Perform Kernel Analysis, getting:
I can see here that
Compute and memory utilization is low in the kernel. The profiler suggests that below 60% is no good.
Most of the time is in computing and L2 cache reading.
Something else?
I continue by Perform Latency Analysis, as the profiler suggests that the biggest bottleneck is there.
The biggest 3 stall reasons seem to be
Memory dependency. Too many texture memreads? But I need this amount of memreads.
Execution dependency. "can be reduced by increasing instruction level parallelism". Does this mean that I should try to change e.g. a=a+1;a=a*a;b=b+1;b=b*b; to a=a+1;b=b+1;a=a*a;b=b*b;?
Instruction fetch (????)
Questions:
Are there more additional tests I can perform to understand better my kernels execution time limitations?
Is there a ways to profile in the instruction level inside the kernel?
Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
If I were to start trying to optimize the kernel, where would I start?
Are there more additional tests I can perform to understand better my
kernels execution time limitations?
Of course! If you pay attention to "Properties" window. Your screenshot is telling you that your kernel 1. Is limited by register usage (check it on 'Kernel Lantency' analisys), and 2.Warp Efficiency is low (less than 100% means thread divergece) (check it on 'Divergent Execution').
Is there a ways to profile in the instruction level inside the kernel?
Yes, you have available two types of profiling:
'Kernel Profile - Instruction Execution'
'Kernel Profile - PC Sampling' (Only in Maxwell)
Are there more conclusions one can obtain by looking at the profiling
than the ones I do obtain?
You should check if your kernel has some thread divergence. Also you should check that there is no problem with shared/global memory access patterns.
If I were to start trying to optimize the kernel, where would I start?
I find the Kernel Latency window the most useful one, but I suppose it depends on the type of kernel you are analyzing.

"Global Load Efficiency" over 100%

I have a CUDA program in which threads of a block read elements of a long array in several iterations and memory accesses are almost fully coalesced. When I profile, Global Load Efficiency is over 100% (between 119% and 187% depending on the input). Description for Global Load Efficiency is "Ratio of global memory load throughput to required global memory load throughput." Does it mean that I'm hitting L2 cache a lot and my memory accesses are benefiting from it?
My GPU is GeForce GTX 780 (Kepler architecture).
I asked this question at NVIDIA forum here. I quote the answer I got:
"Global Load Efficiency and Global Store Efficiency describe how well the coalescing of DRAM-accesses and (L2?)Cache-accesses works. If they're 100 percent then you've got perfect coalescing. Since efficiencies above 100 percent don't make any sense (you cannot be better than optimal) this has to be an error.
This error is caused by the Visual Profiler, which counts hardware events to calculate some abstract metrics. But the GPU doesn't have the "correct" events to exactly calculate all those metrics, thus Visual Profiler has to estimate those metrics by using some complex formula and "wrong" events. There are some metrics which are just rough estimations and Global Load Efficiency and Global Store Efficiency are two of them. Thus if such an efficiency is bigger than 100 percent it is an estimation error. As far as I observed the Global Load Efficiency and Global Store Efficiency both increased above 100 percent in some of my register spilling kernels. That's why i assume that the Visual-Profiler uses some events, which also may be caused by local memory accesses, to calculate those two efficiencies. Furthermore GPUs just uses 32 Bit Counters. Thus long running kernel tend to overflow those counters, which also causes the Visual Profiler to display wrong metrics."

the number of blocks can be scheduled at the same time

This question is also started from following link: shared memory optimization confusion
In above link, from talonmies's answer, I found that the first condition of the number of blocks which will be scheduled to run is "8". I have 3 questions as shown in below.
Does it mean that only 8 blocks can be scheduled at the same time when the number of blocks from condition 2 and 3 is over 8? Is it regardless of any condition such as cuda environment, gpu device, or algorithm?
If so, it really means that it is better not to use shared memory in some cases, it depends. Then we have to think how can we judge which one is better, using or not using shared memory. I think one approach is checking whether there is global memory access limitation (memory bandwidth bottleneck) or not. It means we can select "not using shared memory" if there is no global memory access limitation. Is it good approach?
Plus above question 2, I think if the data that my CUDA program should handle is huge, then we can think "not using shared memory" is better because it is hard to handle within the shared memory. Is it also good approach?
The number of concurrently scheduled blocks are always going to be limited by something.
Playing with the CUDA Occupancy Calculator should make it clear how it works. The usage of three types of resources affect the number of concurrently scheduled blocks. They are, Threads Per Block, Registers Per Thread and Shared Memory Per Block.
If you set up a kernel that uses 1 Threads Per Block, 1 Registers Per Thread and 1 Shared Memory Per Block on Compute Capability 2.0, you are limited by Max Blocks per Multiprocessor, which is 8. If you start increasing Shared Memory Per Block, the Max Blocks per Multiprocessor will continue to be your limiting factor until you reach a threshold at which Shared Memory Per Block becomes the limiting factor. Since there are 49152 bytes of shared memory per SM, that happens at around 8 / 49152 = 6144 bytes (It's a bit less because some shared memory is used by the system and it's allocated in chunks of 128 bytes).
In other words, given the limit of 8 Max Blocks per Multiprocessor, using shared memory is completely free (as it relates to the number of concurrently running blocks), as long as you stay below the threshold at which Shared Memory Per Block becomes the limiting factor.
The same goes for register usage.

How much is run concurrently on a GPU given its numbers of SM's and SP's?

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info:
an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
A grid is a set of blocks which together execute the GPU operation.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.