Which GPU execution dependencies have fixed latency (causing 'Wait' stalls)? - cuda

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states. One of these is:
Wait : Warp was stalled waiting on a fixed latency execution dependency.
As #GregSmith explains, fixed-latency instructions are: "Math, bitwise [and] register movement". But what are fixed-latency "execution dependencies"? Are these just "waiting for somebody else's fixed-latency instruction to conclude before we can issue it ourselves"?

Execution dependencies are dependencies that need to be resolved before the next instruction can be issued. These include register operands and predicates. The WAIT stall reason will be issued between instructions that have fixed latency. The compiler can choose to add additional waits between instructions to the same pipeline if the pipeline issue frequency is not 1 warp per cycle (e.g. FMA and ALU pipe can issue every other cycle on GV100 - GA100).
EXAMPLE 1 - No dependencies - compiler added waits
IADD R0, R1, R2; # R0 = R1 + R2
// stall = wait for 1 additional cycle
IADD R4, R5, R6; # R4 = R5 + R6
// stall = wait for 1 additional cycle
IADD R8, R9, R10; # R8 = R9 + R10
If the compiler did not add wait cycles then the stall reason would be math_throttle. This can also show up if the warp is ready to issue the instruction (all dependencies resolved) and another warp is issuing an instruction to the target pipeline.
EXAMPLE 2 - Wait stalls due to read after write dependency
IADD R0, R1, R2; # R0 = R1 + R2
// stall - wait for fixed number of cycles to clear read after write
IADD R0, R0, R3; # R0 += R3
// stall - wait for fixed number of cycles to clear read after write
IADD R0, R0, R4; # R0 += R4

Related

Is there a way to run thread in USER Mode for azure-rtos (threadx)?

I have been playing around with azure-rtos (THREADX) and trying to port the OS for the cortex R5 based system. After looking at the port files, it seems that OS runs the threads in Supervisor (SVC) mode.
For example, in the function _tx_thread_stack_build, while building the stack for threads, initialization value for the CPSR is such that mode bits correspond to SVC mode. This initialization value is later used to initialize the CPSR before jumping to the thread entry function.
Following is the snippet of the function _tx_thread_stack_build storing initialization value of CPSR on the stack of a thread. For your reference see file tx_thread_stack_build.S.
.global _tx_thread_stack_build
.type _tx_thread_stack_build,function
_tx_thread_stack_build:
# Stack Bottom: (higher memory address) */
#
...
MRS r1, CPSR # Pickup CPSR
BIC r1, r1, #CPSR_MASK # Mask mode bits of CPSR
ORR r3, r1, #SVC_MODE # Build CPSR, SVC mode, interrupts enabled
STR r3, [r2, #4] # Store initial CPSR
...
To give another example, the function tx_thread_context_restore.S switches to SVC mode from IRQ mode to save the context of thread being switched out, which indicates that OS assumes here that thread is running in an SVC mode. For your reference see the file tx_thread_context_restore.s
Following is a snippet of the function saving context of a thread being switched out.
LDMIA sp!, {r3, r10, r12, lr} ; Recover temporarily saved registers
MOV r1, lr ; Save lr (point of interrupt)
MOV r2, #SVC_MODE ; Build SVC mode CPSR
MSR CPSR_c, r2 ; Enter SVC mode
STR r1, [sp, #-4]! ; Save point of interrupt
STMDB sp!, {r4-r12, lr} ; Save upper half of registers
MOV r4, r3 ; Save SPSR in r4
MOV r2, #IRQ_MODE ; Build IRQ mode CPSR
MSR CPSR_c, r2 ; Enter IRQ mode
LDMIA sp!, {r0-r3} ; Recover r0-r3
MOV r5, #SVC_MODE ; Build SVC mode CPSR
MSR CPSR_c, r5 ; Enter SVC mode
STMDB sp!, {r0-r3} ; Save r0-r3 on thread's stack
This leads me to a question, is there a way to run threads in USER mode? It is typically a case in OS that threads run in USER mode while kernel and services provided by it run in an SVC mode, which does not seem to be the case with Azure RTOS.
This is by design, ThreadX is a small monolithic kernel, where application code is tightly integrated with the kernel and lives in the same address space and mode. This allows for greater performance and lower footprint. You can also use ThreadX Modules, where the available MPU or MMU is used to separate kernel and user code into different modes and provide additional protection, but this incurs a small performance and footprint penalty.

Instruction execution order by cuda driver

The following piece of code
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u64 data, [%0];"::"l"(po):"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
looks like this in the SASS code
/*0420*/ CS2R R2, SR_CLOCKLO ; /* 0x0000000000027805 */
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ; /* 0x0000000004047381 */
/*0440*/ CS2R R6, SR_CLOCKLO ; /*
I want to be sure that the scheduler issues the second CS2R after the LDG instruction and not earlier due to any any optimization like out-of-order execution.
How can I be sure about that?
UPDATE:
Based on Greg's suggestion, I added a dependent instruction which looks like
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u64 data, [%0];"::"l"(po):"memory");
asm volatile("add.u64 %0, data, %0;":"+l"(sink)::"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
where uint64_t sink = 0; is defined. Still I see only one LDG between CS2R instructions. I expected to see an IADD instruction also since I am reading data again. I think I wrote the asm add instruction incorrectly, but don't know any more.
NVIDIA GPUs compute capability 1.0 - 7.x will issue instructions for a warp in order. The special purpose registers clock and clock64 can be used to time sections of code by reading the register before and after a sequence of instructions.
This can be useful to estimate the number of cycles that it took to issue a sequence of instructions for a single warp.
CASE 1 : Instruction Issue Latency
clock64 reads are inserted before and after a sequence of instructions. In the case below clock64 reads wrap a single global load. This style estimates the instruction issue latency of the global load instruction. The warp can be stalled between the start and end CS2R increasing the duration. Stall reasons can include the following:
- not_selected - the warp scheduler selected a higher priority warp
- no_instruction - LDG was on a new instruction cache line and the warp is stalled until the cache line is fetched
- mio_throttle - LDG instruction cannot be issued as the instruction queue for the Load Store Unit was full.
- lg_throttle - LDG instruction cannot be issued as the instruction queue for the Load Store Unit has reached a local/global watermark.
In order to increase accuracy it is recommended to measure a sequence of instructions as opposed to a single instruction.
PTX
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u32 data, [%0];"::"l"(po):"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
SASS (SM_70)
/*0420*/ CS2R R2, SR_CLOCKLO ;
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ;
/*0440*/ CS2R R6, SR_CLOCKLO ;
CASE 2: Instruction Execution Latency
A clock64 read is inserted before a sequence of instructions. A set of instructions that guarantee completion of the sequence of instruction and a clock64 read is inserted after the sequence of instructions. In the case below an integer add is inserted before the last read that is dependent on the value from the global load. This technique can be used to estimate the execution duration of the global load.
PTX
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u32 data, [%0];"::"l"(po):"memory");
asm volatile("add.u32 %0, data, %0;":"+l"(sink)::"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
SASS (SM_70)
/*0420*/ CS2R R2, SR_CLOCKLO ;
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ;
/*0440*/ IADD R4, R4, 1 ;
/*0450*/ CS2R R6, SR_CLOCKLO ;
DIAGRAM
The measurement period for Case 1 and Case 2 is shown in the wave form diagram. The diagram shows the CS2R and IADD instructions taking 4 cycles to execute. The CS2R instructions read the time on the 3rd cycle.
For Case 1 the measured time may be as small as 2 cycles.
For Case 1 the measured time includes the load from global memory. If the load hits in the L1 cache then the time is in the 20-50 cycles else the time is likely greater than 200 cycles.
WARNING
In practice this type of instruction issue or instruction execution latency is very hard to implement. These techniques can be used to write micro-benchmarks or time large sequences of code. In the case of micro-benchmarks it is critical to understand and potentially isolate other factors such as warp scheduling, instruction cache misses, constant cache misses, etc.
The compiler does not treat a read of clock/clock64 as an instruction fence. The compiler is free to move the read to an unexpected location. It is recommended to always inspect the generated SASS code.
Compute Capability 6.0 and higher supports instruction level preemption. Instruction level preemption will result in unexpected results.

Combined format of SASS instructions

I haven't seen a cuda document that describes the combined form of SASS instructions. For example, I know what are IADD and IMAD. But
IMAD.IADD R8, R8, 0x1, R7 ;
are not clear. Which operand belongs to which opcode? How that is executed? Moreover, are we dealing with one ADD and one MAD which means two ADD and one MUL? Or that is considered as one one MADD which means one ADD and one MUL?
How about IMAD.MOV.U32 R5, RZ, RZ, 0x0 ;? How that is interpreted?
The Volta and Turing architecture have two primary execution pipes.
FMA pipe is responsible for FFMA, FMUL, FADD, FSWZADD, and IMAD instructions.
ALU pipe is responsible for integer (except IMAD), bit manipulation, logical, and data movement instructions.
The ALU pipe executes MOV and IADD3.
The FMA pipe executes IMAD including variants IMAD.IADD and IMAD.MOV.
Using IMAD to emulate IADD and MOV allows the compiler to explicitly schedule instructions to FMA pipe instead of the ALU pipe.
What's clear from compiler output is that the compiler is emulating binary integer add and raw moves with IMAD, which generalizes both. The suffix is just the disassembler being nice by matching the pattern and telling you the operation is semantically equivalent to a simpler operation. The IMAD.* sequences are clever using RZ (the zero register), 0x0 and 0x1 to accomplish this. When the disassembler sees such a pattern, it adds the .MOV op suffix to say, "Hey, this is just a simple move."
E.g.
IMAD.IADD R8, R8, 0x1, R7
is:
R8 = 1*R8 + R7 = R8 + R7
IADD R8, R8, R7
(If IADD existed.)
Similarly for the MOV case, you see that it's using RZ. It's emulating the following.
MOV R5, 0x0
There is a MOV op in Volta, but I almost never see it.
(There's also a left-shift-by-K version IMAD.SHL I think, which uses a multiplier of 2^K where K is the shift amount.)

Is starting 1 thread per element always optimal for data independent problems on the GPU?

I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:
template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
using vector_type = int2;
vector_type* src2 = reinterpret_cast<vector_type*>(src);
vector_type* dest2 = reinterpret_cast<vector_type*>(dest);
//This copy kernel is only correct when size%sizeof(vector_type)==0
auto numElements = size / sizeof(vector_type);
for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
dest2[id] = src2[id];
}
}
I also calculated the number of blocks required to reach 100% occupancy like so:
THREADS_PER_BLOCK = 256
Multi-Processors: 4
Max Threads per Multi Processor: 2048
NUM_BLOCKS = 4 * 2048 / 256 = 32
My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:
bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s
So my questions are:
Why is there a speed difference?
Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?
Is starting 1 thread per element always optimal for data independent problems on the GPU?
Not always. Let's consider 3 different implementations. In each case we'll assume we're dealing with a trivially parallelizable problem that involves one element load, some "work" and one element store per thread. In your copy example there is basically no work - just loads and stores.
One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.
Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.
Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:
LOOP: LD R0, in[idx];
ST out[idx], R0;
...
BRA LOOP;
which it can unroll (let's say) 8 times:
LD R0, in[idx];
ST out[idx], R0;
LD R0, in[idx+1];
ST out[idx+1], R0;
LD R0, in[idx+2];
ST out[idx+2], R0;
LD R0, in[idx+3];
ST out[idx+3], R0;
LD R0, in[idx+4];
ST out[idx+4], R0;
LD R0, in[idx+5];
ST out[idx+5], R0;
LD R0, in[idx+6];
ST out[idx+6], R0;
LD R0, in[idx+7];
ST out[idx+7], R0;
and after that it can reorder the instructions, since the operations are independent:
LD R0, in[idx];
LD R1, in[idx+1];
LD R2, in[idx+2];
LD R3, in[idx+3];
LD R4, in[idx+4];
LD R5, in[idx+5];
LD R6, in[idx+6];
LD R7, in[idx+7];
ST out[idx], R0;
ST out[idx+1], R1;
ST out[idx+2], R2;
ST out[idx+3], R3;
ST out[idx+4], R4;
ST out[idx+5], R5;
ST out[idx+6], R6;
ST out[idx+7], R7;
at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.
What we have done here, is effectively (working in conjunction with the compiler) increase the number of instructions that can be issued per thread, before the thread will hit a stall. This is also referred to as increasing the exposed parallelism, basically via ILP in this approach. Whether or not it has any benefit will vary depending on your actual code, your actual GPU, and what else is going in the GPU at that time. But it is always a good strategy to increase exposed parallelism using techniques such as this, because the ability to issue instructions is how the GPU hides the various forms of latency that it must deal with, so we have effectively improved the GPU's ability to hide latency, with this approach.
Why is there a speed difference?
This can be difficult to answer without profiling the code carefully. However it's often the case that launching just enough threads to fully satisfy the instantaneous carrying capacity of the GPU is not a good strategy, possibly due to the "tail effect" or other types of inefficiency. It may also be the case that blocks are limited by some other factor, such as registers or shared memory usage. It's usually necessary to carefully profile as well as possibly study the generated machine code to fully answer such a question. But it may be that the loop overhead measurably impacts your comparison, which is basically my case 2 vs. my case 1 above.
(note the memory indices in my "pseudo" machine code example are not what you would expect for a well written grid-striding copy loop - they are just for example purposes to demonstrate unrolling and the benefit it can have via compiler instruction reordering).
One-liner answer: When you have one thread per element, you pay the thread setup cost - at the very least, copying argument from constant memory to registers - for every one element, and that's wasteful.

How do I get the CC 2.0 and 3.0 compilers to generate FMA instructions?

I'm attempting to run a performance test by generating a series of FMA instructions. However, I can't seem to get the CC 2.0 and CC 3.0 compilers to generate FMA instructions.
If I compile:
for (float x = 0; x < loop; x++) {
a += x * loop;
a += x * loop;
... (6 more repetitions)
}
Where loop is also a float, I get the following for each line of a += x * loop;:
compute_10,sm_10:
a += x * loop;
0x0001ffa0 [0103] mov.f32 %f11, %f2;
0x0001ffa0 MOV R3, R2;
0x0001ffa8 [0104] ld.param.f32 %f12, [__cudaparm__Z6kernelPfifS__loop];
0x0001ffa8 MOV32I R2, 0x28;
0x0001ffb0 LDC R2, c[0x0][R2];
0x0001ffb8 [0105] mov.f32 %f13, %f4;
0x0001ffb8 MOV R0, R0;
0x0001ffc0 [0106] mad.f32 %f14, %f12, %f13, %f11;
0x0001ffc0 FFMA.FTZ R2, R2, R0, R3;
0x0001ffc8 [0107] mov.f32 %f2, %f14;
0x0001ffc8 MOV R2, R2;
compute_30,sm_30:
a += x * loop;
0x00044688 [0101] mul.f32 %f14, %f30, %f7;
0x00044688 FMUL R5, R4, R0;
0x00044690 [0102] add.f32 %f15, %f13, %f14;
0x00044690 FADD R3, R3, R5;
That is, when compiling for CC 3.0, I get FMUL/FADD instructions instead of FFMA. When compiling for CC 1.0, I get an FFMA instruction.
I also get this result on a CC 2.0 compiler with compute_20,sm_20, and on both release and debug builds.
I have tried to specify -use_fast_math and --fmad=true. I created the projects with the CUDA 4.2 and 5.0 wizards and made no changes in the default settings.
Environments:
Windows 7 64-bit
Visual Studio 2010
CUDA 4.2 + CUDA 5.0 (5.0 installed on top of 4.2)
GPU: Single GTX660
Nsight 3.0 RC1
and
Windows 7 64-bit
Visual Studio 2010
CUDA 4.2
Nsight 2.2
GPU: Single GTX570
Passing the -G switch to nvcc affects code generation and also generates additional debug info (symbols) to be added to the output file. According to the nvcc documentation, the description of the -G switch is not "generate device debug info" but is actually "generate debug-able device code".
There will be many instances where using the -G switch causes substantially different device code generation. In this case it appears to inhibit generation of FMA instructions in favor of separate MUL/ADD sequences.