Issued load/store instructions for replay - cuda

There are two nvprof metrics regarding load/store instructions and they are ldst_executed and ldst_issued. We know that executed<=issued. I expect that those load/stores that are issued but not executed are related to branch predications and other incorrent predictions. However, from this (slide 9) document and this topic, instructions that are issued but not executed are related to serialization and replay.
I don't know if that reason applies for load/store instructions or not. Moreover, I would like to know why such terminology is used for issued but not executed instructions? If there is a serialization for any reason, instructions are executed multiple times. So, why they are not counted as executed?
Any explanation for that?

The NVIDIA architecture optimized memory throughput by issuing an instruction for a group of threads called a warp. If each thread accesses a consecutive data element or the same element then the access can be performed very efficiently. However, if each thread accesses data in a different cache line or at a different address in the same bank then there is a conflict and the instruction has to be replayed.
inst_executed is the count of instructions retired.
inst_issued is the count of instructions issued. An instruction may be issued multiple times in the case of a vector memory access, memory address conflict, memory bank conflict, etc. On each issue the thread mask is reduced until all threads have completed.
The distinction is made for two reasons:
1. Retirement of an instruction indicates completion of a data dependency. The data dependency is only resolved 1 time despite possible replays.
2. The ratio between issued and executed is a simple way to show opportunities to save warp scheduler issue cycles.
In Fermi and Kepler SM if a memory conflict was encountered then the instruction was replayed (re-issued) until all threads completed. This was performed by the warp scheduler. These replays consume issue cycles reducing the ability for the SM to issue instructions to math pipes. In this SM issued > executed indicates an opportunity for optimization especially if issued IPC is high.
In the Maxwell-Turing SM replays for vector accesses, address conflicts, and memory conflicts are replayed by the memory unit (shared memory, L1, etc.) and do not steal warp scheduler issue cycles. In this SM issued is very seldom more than a few % above executed.
EXAMPLE: A kernel loads a 32-bit value. All 32 threads in the warp are active and each thread accesses a unique cache line (stride = 128 bytes).
On Kepler (CC3.*) SM the instruction is issued 1 time then replayed 31 additional times as the Kepler L1 can only perform 1 tag lookup per request.
inst_executed = 1
inst_issued = 32
On Kepler the instruction has to be replayed again for each request that missed in the L1. If all threads miss in the L1 cache then
inst_executed = 1
inst_issued >= 64 = 32 request + 32 replays for misses
On Maxwell - Turing architecture the replay is performed by the SM memory system. The replays can limit memory throughput but will not block the warp scheduler from issuing instructions to the math pipe.
inst_executed = 1
inst_issued = 1
On Maxwell-Turing Nsight Compute/Perfworks expose throughput counters for each of the memory pipelines including number of cycles due to memory bank conflicts, serialization of atomics, address divergence, etc.

GPU architecture is based on maximizing throughput rather than minimizing latency. Thus, GPUs (currently) don't really do out-of-order execution or branch prediction. Instead of building a few cores full of complex control logic to make one thread run really fast (like you'd have on a CPU), GPUs rather use those transistors to build more cores to run as many threads as possible in parallel.
As explained on slide 9 of the presentation you linked, executed instructions are the instructions that control flow passes over in your program (basically, the number of lines of assembly code that were run). When you, e.g., execute a global load instruction and the memory request cannot be served immediately (misses the cache), the GPU will switch to another thread. Once the value is ready in the cache and GPU switches back to your thread, the load instruction will have to be issued again to complete fetching the value (see also this answer and this thread). When you, e.g., access shared memory and there are a bank conflicts, the shared memory access will have to be replayed multiple times for different threads in the warp…
The main reason to differentiate between executed and issued instructions would seem to be that the ratio of the two can serve as a measurement for the amount of overhead your code produces due to instructions that cannot be completed immediately at the time they are executed…

Related

Why does each thread have its own instruction address counter inside a warp?

Warps in CUDA always include 32 threads, and all of these 32 threads run the same instruction when the warp is running in SM. The previous question also says each thread has its own instruction counter as quoted below.
Then why does each thread need its own instruction address counter if all the 32 threads always execute the same instruction, could the threads inside 1 warp just share an instruction address counter?
Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data
I'm not able to respond directly to the quoted text, because I don't have the book it comes from, nor do I know the authors intent.
However, an independent program counter per thread is considered to be a new feature in Volta, see figure 21 and caption in the volta whitepaper:
Volta maintains per-thread scheduling resources such as program counter (PC) and call stack (S), while earlier architectures maintained these resources per warp.
The same whitepaper probably does about as good a job as you will find of why this is needed in Volta, and presumably it carries forward to newer architectures such as Turing:
Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to
make better use of execution resources or to allow one thread to wait for data to be produced by
another. To maximize parallel efficiency, Volta includes a schedule optimizer which determines
how to group active threads from the same warp together into SIMT units. This retains the high
throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads
can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in
Volta will still group together threads which are executing the same code and run them in parallel
for maximum efficiency
Because of this, a Volta warp could have any number of subgroups of threads (up to the warp size, 32), which could be at different places in the instruction stream. The Volta designers decided that the best way to support this flexibility was to provide (among other things) a separate PC per thread in the warp.

how exactly does CUDA handle a memory access?

i would like to know how CUDA hardware/run-time system handles the following case.
If a warp (warp1 in the following) instruction involves access to global memory (load/store); the run-time system schedules the next ready warp for execution.
When the new warp is executed,
Will the "memory access" of warp1 be conducted in parallel, i.e. while the new warp is running ?
Will the run time system put warp1 into a memory access waiting queue; once the memory request is completed, the warp is then moved into the runnable queue?
Will the instruction pointer related to warp1 execution be incremented automatically and in parallel to the new warp execution, to annotate that the memory request is completed?
For instance, consider this pseudo code output=input+array[i]; where output and input are both scalar variables mapped into registers, whereas array is saved in the global memory.
To run the above instruction, we need to load the value of array[i] into a (temporary) register before updating output; i.e the above instruction can be translated into 2 macro assembly instructions load reg, reg=&array[i], output_register=input_register+reg.
I would like to know how the hardware and runtime system handle the execution of the above 2 macro assembly instructions, given that load can't return immediately
I am not sure I understand your questions correctly, so I'll just try to answer them as I read them:
Yes, while a memory transaction is in flight further independent instructions will continue to be issued. There isn't necessarily a switch to a different warp though - while instructions from other warps will always be independent, the following instructions from the same warp might be independent as well and the same warp may keep running (i.e. further instructions may be issued from the same warp).
No. As explained under 1. the warp can and will continue executing instructions until either the result of the load is needed by dependent instruction, or a memory fence / barrier instruction requires it to wait for the effect of the store being visible to other threads.
This can go as far as issuing further (independent) load or store instructions, so that multiple memory transactions can be in flight for the same warp at the same time. So the status of a warp after issuing a load/store doesn't change fundamentally and it is not halted until necessary.
The instruction pointer will always be incremented automatically (there is no situation where you ever do this manually, nor are there instructions allowing to do so). However, as 2. implies, this doesn't necessarily indicate that the memory access has been performed - there is separate hardware to track progress of memory accesses.
Please note that the hardware implementation is completely undocumented by Nvidia. You might find some indications of possible implementations if you search through Nvidia's patent applications.
GPUs up to the Fermi generation (compute capability 2.x) tracked outstanding memory transaction completely in hardware. While undocumented by Nvidia, the common mechanism to track (memory) transactions in flight is scoreboarding.
GPUs from newer generations starting with Kepler (compute capability 3.x) use some assistance in the form of control words embedded in the shader assembly code. While again undocumented, Scott Gray has reversed engineered these for his Maxas Maxwell assembler. He found that (amongst other things) the control words contain barrier instructions for tracking memory transactions and was kind enough to document his findings on his Control-Codes wiki page.

What's the mechanism of the warps and the banks in CUDA?

I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence.
There are some points:
It's said that threads in one block are split into warps. In each warp there are at most 32 threads. That means all these threads of the same warp will execute simultaneously with the same processor. So what's the senses of half-warp?
When it comes to the shared memory of one block, it would be split into 16 banks. To avoid bank conflicts, multiple threads can READ one bank at the same time rather than write in the same bank. Is this a correct interpretation?
Thanks in advance!
The principal usage of "half-warp" was applied to CUDA processors
prior to the Fermi generation (e.g. the "Tesla" or GT200 generation,
and the original G80/G92 generation). These GPUs were architected
with a SM (streaming multiprocessor -- a HW block inside the GPU)
that had fewer than 32 thread
processors.
The definition of warp was still the same, but the actual HW
execution took place in "half-warps" at a time. Actually the
granular details are more complicated than this, but suffice it to
say that the execution model caused memory requests to be issued
according to the needs of a half-warp, i.e. 16 threads within the warp.
A full warp that hit a memory transaction would thus generate a
total of 2 requests for that transaction.
Fermi and newer GPUs have at least 32 thread processors per
SM.
Therefore a memory transaction is immediately visible across a full
warp. As a result, memory requests are issued at the per-warp
level, rather than per-half-warp. However, a full memory request
can only retrieve 128 bytes at a time. Therefore, for data sizes
larger than 32 bits per thread per transaction, the memory
controller may still break the request down into a half-warp size.
My view is that, especially for a beginner, it's not necessary to
have a detailed understanding of half-warp. It's generally
sufficient to understand that it refers to a group of 16 threads
executing together and it has implications for memory requests.
Shared memory for example on the Fermi-class
GPUs
is broken into 32 banks. On previous
GPUs
it was broken into 16 banks. Bank conflicts occur any time an
individual bank is accessed by more than one thread in the same
memory request (i.e. originating from the same code instruction).
To avoid bank conflicts, basic strategies are very similar to the
strategies for coalescing memory requests, eg. for global memory. On Fermi and newer GPUs, multiple threads can read the same address without causing a bank conflict, but in general the definition of a bank conflict is when multiple threads read from the same bank. For further understanding of shared memory and how to avoid bank conflicts, I would recommend the NVIDIA webinar on this topic.

CUDA profiled achieved occupany very low; how to diagnose?

When I run the profiler against my code, part of the output is:
Limiting Factor
Achieved Occupancy: 0.02 ( Theoretical Occupancy: 0.67 )
IPC: 1.00 ( Maximum IPC: 4 )
Achieved occupancy of 0.02 seems horribly low. Is it possible that this is due to missing .csv files from the profile run? It complains about:
Program run #18 completed.
Read profiler output file for context #0, run #1, Number of rows=6
Error : Error in profiler data file '/.../temp_compute_profiler_1_0.csv' at line number 1. No column found
Error in reading profiler output:
Application : "/.../bin/python".
Profiler data file '/.../temp_compute_profiler_2_0.csv' for application run 2 not found.
Read profiler output file for context #0, run #4, Number of rows=6
My blocks are 32*4*1, the grid is 25*100, and testing has shown that 32 registers provides the best performance (even though that results in spilling).
If the 0.02 number is correct, how can I go about debugging what's going on? I've already tried moving likely candidates to shared and/or constant memory, experimenting with launch_bounds, moving data into textures, etc.
Edit: if more data from a profile run will be helpful, just let me know and I can provide it. Thanks for reading.
Edit 2: requested data.
IPC: 1.00
Maximum IPC: 4
Divergent branches(%): 6.44
Control flow divergence(%): 96.88
Replayed Instructions(%): -0.00
Global memory replay(%): 10.27
Local memory replays(%): 5.45
Shared bank conflict replay(%): 0.00
Shared memory bank conflict per shared memory instruction(%): 0.00
L1 cache read throughput(GB/s): 197.17
L1 cache global hit ratio (%): 51.23
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 9.80
L2 cache global memory write throughput(GB/s): 6.80
L2 cache global memory throughput(GB/s): 16.60
Local memory bus traffic(%): 206.07
Peak global memory throughput(GB/s): 128.26
The following derived statistic(s) cannot be computed as required counters are not available:
Kernel requested global memory read throughput(GB/s)
Kernel requested global memory write throughput(GB/s)
Global memory excess load(%)
Global memory excess store(%)
Achieved global memory read throughput(GB/s)
Achieved global memory write throughput(GB/s)
Solution(s):
The issue with missing data was due to a too-low timeout value; certain early runs of the data would time out and the data not be written (and those error messages would get lost in the spam of later runs).
The 0.02 achieved occupancy was due to active_warps and active_cycles (and potentially other values as well) hitting maxint (2**32-1). Reducing the size of the input to the profiled script caused much more sane values to come out (including better/more realistic IPC and branching stats).
The hardware counters used by the Visual Profiler, Parallel Nsight, and the CUDA command line profiler are 32-bit counters and will overflow in 2^32 / shaderclock seconds (~5s). Some of the counters will overflow quicker than this. If you see values of MAX_INT or if your duration is in seconds then you are likely to see incorrect results in the tools.
I recommend splitting your kernel launch into 2 or more launches for profiling such that the duration of the launch is less than 1-2 seconds. In your case you have a Theoretical Occupancy of 67% (32 warps/SM) and a block size of 4 warps. When dividing work you want to make sure that each SM is fully loaded and preferably receives multiple waves of blocks. For each launch try launching NumSMs * MaxBlocksPerSM * 10 Blocks. For example, if you have a GTX560 which has 8 SMs and your reported configuration above you would break the single launch of 2500 blocks into 4 launches of 640, 640, 640, and 580.
Improved support for handling overflows should be in a future version of the tools.
Theoretical occupancy is the maximum number of warps you can execute on a a SM divided by the device limit. Theoretical occupancy can be lower than the device limit based upon the kernels use of threads per block, registers per thread, or shared memory per block.
Achieved occupancy is the measure of (active_warps / active_cyles) / max_warps_per_sm.
An achieved occupancy of .02 implies that only 1 warps is active on the SM. Given a launch of 10000 warps (2500 blocks * 128 threads / WARP_SIZE) this can only happen if you have extremely divergent code where all warps except for 1 immediately exit and 1 warp runs for a very long time. Also it is highly unlikely that you could achieve an IPC of 1 with this achieved occupancy so I suspect an error in the reported value.
If you would like help diagnosing the problem I would suggest you
post your device information
verify that you launched <<<{25,100,1}, {128, 4, 1}>>>
post your code
If you cannot post your code I would recommend capturing the counters active_cycles and active_warps and calculate achieved occupancy as
(active_warps / active_cycles) / 48
Given that you have errors in your profiler log it is possible that the results are invalid.
I believe from the output you are using an older version of the Visual Profiler. You may want to consider updating to version 4.1 which improves both collection of PM counters as well as will help provide hints on how to improve your code.
It seems like (a big part of) your issue here is:
Control flow divergence(%): 96.88
It sounds like 96.88 percent of the time, threads are not running the same instruction at the same time. The GPU can only really run the threads in parallel when each thread in a warp is running the same instruction at the same time. Things like if-else statements can cause some threads of a given warp to enter the if, and some threads to enter the else, causing divergence. What happens then is the GPU switches back and forth between executing each set of threads, causing each execution cycle to have a less than optimal occupancy.
To improve this, try to make sure that threads that will execute together in a warp (32 at a time in all NVIDIA cards today... I think) will all take the same path through the kernel code. Sometimes sorting the input data so that like data gets processed together works. Beyond that, add a barrier in strategic places in the kernel code can help. If threads of a warp are forced to diverge, a barrier will make sure that, after they reach common code again, the wait for each other to get there and then resume executing with full occupancy (for that warp). Just beware that a barrier must be hit by all threads, or you will cause a deadlock.
I can't promise this is your whole answer, but it seems to be a big problem for your code given the numbers listed in your question.

My GPU has 2 multiprocessors with 48 CUDA cores each. What does this mean?

My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.