Instruction execution order by cuda driver - cuda

The following piece of code
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u64 data, [%0];"::"l"(po):"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
looks like this in the SASS code
/*0420*/ CS2R R2, SR_CLOCKLO ; /* 0x0000000000027805 */
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ; /* 0x0000000004047381 */
/*0440*/ CS2R R6, SR_CLOCKLO ; /*
I want to be sure that the scheduler issues the second CS2R after the LDG instruction and not earlier due to any any optimization like out-of-order execution.
How can I be sure about that?
UPDATE:
Based on Greg's suggestion, I added a dependent instruction which looks like
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u64 data, [%0];"::"l"(po):"memory");
asm volatile("add.u64 %0, data, %0;":"+l"(sink)::"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
where uint64_t sink = 0; is defined. Still I see only one LDG between CS2R instructions. I expected to see an IADD instruction also since I am reading data again. I think I wrote the asm add instruction incorrectly, but don't know any more.

NVIDIA GPUs compute capability 1.0 - 7.x will issue instructions for a warp in order. The special purpose registers clock and clock64 can be used to time sections of code by reading the register before and after a sequence of instructions.
This can be useful to estimate the number of cycles that it took to issue a sequence of instructions for a single warp.
CASE 1 : Instruction Issue Latency
clock64 reads are inserted before and after a sequence of instructions. In the case below clock64 reads wrap a single global load. This style estimates the instruction issue latency of the global load instruction. The warp can be stalled between the start and end CS2R increasing the duration. Stall reasons can include the following:
- not_selected - the warp scheduler selected a higher priority warp
- no_instruction - LDG was on a new instruction cache line and the warp is stalled until the cache line is fetched
- mio_throttle - LDG instruction cannot be issued as the instruction queue for the Load Store Unit was full.
- lg_throttle - LDG instruction cannot be issued as the instruction queue for the Load Store Unit has reached a local/global watermark.
In order to increase accuracy it is recommended to measure a sequence of instructions as opposed to a single instruction.
PTX
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u32 data, [%0];"::"l"(po):"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
SASS (SM_70)
/*0420*/ CS2R R2, SR_CLOCKLO ;
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ;
/*0440*/ CS2R R6, SR_CLOCKLO ;
CASE 2: Instruction Execution Latency
A clock64 read is inserted before a sequence of instructions. A set of instructions that guarantee completion of the sequence of instruction and a clock64 read is inserted after the sequence of instructions. In the case below an integer add is inserted before the last read that is dependent on the value from the global load. This technique can be used to estimate the execution duration of the global load.
PTX
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u32 data, [%0];"::"l"(po):"memory");
asm volatile("add.u32 %0, data, %0;":"+l"(sink)::"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
SASS (SM_70)
/*0420*/ CS2R R2, SR_CLOCKLO ;
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ;
/*0440*/ IADD R4, R4, 1 ;
/*0450*/ CS2R R6, SR_CLOCKLO ;
DIAGRAM
The measurement period for Case 1 and Case 2 is shown in the wave form diagram. The diagram shows the CS2R and IADD instructions taking 4 cycles to execute. The CS2R instructions read the time on the 3rd cycle.
For Case 1 the measured time may be as small as 2 cycles.
For Case 1 the measured time includes the load from global memory. If the load hits in the L1 cache then the time is in the 20-50 cycles else the time is likely greater than 200 cycles.
WARNING
In practice this type of instruction issue or instruction execution latency is very hard to implement. These techniques can be used to write micro-benchmarks or time large sequences of code. In the case of micro-benchmarks it is critical to understand and potentially isolate other factors such as warp scheduling, instruction cache misses, constant cache misses, etc.
The compiler does not treat a read of clock/clock64 as an instruction fence. The compiler is free to move the read to an unexpected location. It is recommended to always inspect the generated SASS code.
Compute Capability 6.0 and higher supports instruction level preemption. Instruction level preemption will result in unexpected results.

Related

Is starting 1 thread per element always optimal for data independent problems on the GPU?

I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:
template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
using vector_type = int2;
vector_type* src2 = reinterpret_cast<vector_type*>(src);
vector_type* dest2 = reinterpret_cast<vector_type*>(dest);
//This copy kernel is only correct when size%sizeof(vector_type)==0
auto numElements = size / sizeof(vector_type);
for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
dest2[id] = src2[id];
}
}
I also calculated the number of blocks required to reach 100% occupancy like so:
THREADS_PER_BLOCK = 256
Multi-Processors: 4
Max Threads per Multi Processor: 2048
NUM_BLOCKS = 4 * 2048 / 256 = 32
My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:
bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s
So my questions are:
Why is there a speed difference?
Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?
Is starting 1 thread per element always optimal for data independent problems on the GPU?
Not always. Let's consider 3 different implementations. In each case we'll assume we're dealing with a trivially parallelizable problem that involves one element load, some "work" and one element store per thread. In your copy example there is basically no work - just loads and stores.
One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.
Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.
Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:
LOOP: LD R0, in[idx];
ST out[idx], R0;
...
BRA LOOP;
which it can unroll (let's say) 8 times:
LD R0, in[idx];
ST out[idx], R0;
LD R0, in[idx+1];
ST out[idx+1], R0;
LD R0, in[idx+2];
ST out[idx+2], R0;
LD R0, in[idx+3];
ST out[idx+3], R0;
LD R0, in[idx+4];
ST out[idx+4], R0;
LD R0, in[idx+5];
ST out[idx+5], R0;
LD R0, in[idx+6];
ST out[idx+6], R0;
LD R0, in[idx+7];
ST out[idx+7], R0;
and after that it can reorder the instructions, since the operations are independent:
LD R0, in[idx];
LD R1, in[idx+1];
LD R2, in[idx+2];
LD R3, in[idx+3];
LD R4, in[idx+4];
LD R5, in[idx+5];
LD R6, in[idx+6];
LD R7, in[idx+7];
ST out[idx], R0;
ST out[idx+1], R1;
ST out[idx+2], R2;
ST out[idx+3], R3;
ST out[idx+4], R4;
ST out[idx+5], R5;
ST out[idx+6], R6;
ST out[idx+7], R7;
at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.
What we have done here, is effectively (working in conjunction with the compiler) increase the number of instructions that can be issued per thread, before the thread will hit a stall. This is also referred to as increasing the exposed parallelism, basically via ILP in this approach. Whether or not it has any benefit will vary depending on your actual code, your actual GPU, and what else is going in the GPU at that time. But it is always a good strategy to increase exposed parallelism using techniques such as this, because the ability to issue instructions is how the GPU hides the various forms of latency that it must deal with, so we have effectively improved the GPU's ability to hide latency, with this approach.
Why is there a speed difference?
This can be difficult to answer without profiling the code carefully. However it's often the case that launching just enough threads to fully satisfy the instantaneous carrying capacity of the GPU is not a good strategy, possibly due to the "tail effect" or other types of inefficiency. It may also be the case that blocks are limited by some other factor, such as registers or shared memory usage. It's usually necessary to carefully profile as well as possibly study the generated machine code to fully answer such a question. But it may be that the loop overhead measurably impacts your comparison, which is basically my case 2 vs. my case 1 above.
(note the memory indices in my "pseudo" machine code example are not what you would expect for a well written grid-striding copy loop - they are just for example purposes to demonstrate unrolling and the benefit it can have via compiler instruction reordering).
One-liner answer: When you have one thread per element, you pay the thread setup cost - at the very least, copying argument from constant memory to registers - for every one element, and that's wasteful.

what is the current execution mode/exception level, etc?

I am new to ARMv8 architecture. I have following basic questions on my mind:
How do I know what is the current execution mode AArch32 or AArch64? Should I read CPSR or SPSR to ascertain this?
What is the current Exception level, EL0/1/2/3?
Once an exception comes, can i read any register to determine whether I am in Serror/Synchronous/IRQ/FIQ exception handler.
TIA.
The assembly instructions and their binary encoding are entirely different for 32 and 64 bit. So the information what mode you are currently in is something that you/ the compiler already needs to know during compilation. checking for them at runtime doesn't make sense. For C, C++ checking can be done at compile time (#ifdef) through compiler provided macros like the ones provided by armclang: __aarch64__ for 64 bit, __arm__ for 32 bit
depends on the execution mode:
aarch32: MRS <Rn>, CPSR read the current state into register number n. Then extract bits 3:0 that contain the current mode.
aarch64: MRS <Xn>, CurrentEL read the current EL into register number n
short answer: you can't. long answer: the assumption is that by the structure of the code and the state of any user defined variables, you already know what you are doing. i.e. whether you came to a position in code through regular code or through an exception.
aarch64 C code:
register uint64_t x0 __asm__ ("x0");
__asm__ ("mrs x0, CurrentEL;" : : : "%x0");
printf("EL = %" PRIu64 "\n", x0 >> 2);
arm C code:
register uint32_t r0 __asm__ ("r0");
__asm__ ("mrs r0, CPSR" : : : "%r0");
printf("EL = %" PRIu32 "\n", r0 & 0x1F);
CurrentEL however is not readable from EL0 as shown on the ARMv8 manual C5.2.1 "CurrentEL, Current Exception Level" section "Accessibility". Trying to run it in Linux userland raises SIGILL. You could catch that signal however I suppose...
CPSR is readable from EL0 however.
Tested on QEMU and gem5 with this setup.

What is the cause of undefined ARM Exceptions?

One question is when the undefined instruction happens .... Do we need to get the current executing instruction from R14_SVC or R14_UNDEF? . Currently I am working on one problem where an undefined instruction happened. On checking the R14_SVC I found the instruction was like below:
0x46BFD73C cmp r0, #0x0
0x46BFD740 beq 0x46BFD75C
0x46BFD744 ldr r0,0x46BFE358
so in my assumption the undefined instruction would have happened while executing the instruction beq 0x46BFD75C
One thing that puzzles me is I checked the r14_undef and the istruction was different.
0x46bfd4b8 bx r14
0x46bfd4bC mov r0, 0x01
0x46bfd4c0 bx r14
Which one caused the undefined instruction exception?
All of your answers are in the ARM ARM, ARM Architectural Reference Manual. go to infocenter.arm.com under reference manuals find the architecture family you are interested in. The non-cortex-m series all handle these exceptions the same way
When an Undefined Instruction exception occurs, the following actions are performed:
R14_und = address of next instruction after the Undefined instruction
SPSR_und = CPSR
CPSR[4:0] = 0b11011 /* Enter Undefined Instruction mode */
CPSR[5] = 0 /* Execute in ARM state */
/* CPSR[6] is unchanged */
CPSR[7] = 1 /* Disable normal interrupts */
/* CPSR[8] is unchanged */
CPSR[9] = CP15_reg1_EEbit
/* Endianness on exception entry */
if high vectors configured then
PC = 0xFFFF0004
else
PC = 0x00000004
R14_und points at the next instruction AFTER the undefined instruction. you have to examine SPSR_und to determine what mode the processor was in (arm or thumb) to know if you need to subtract 2 or 4 from R14_und and if you need to fetch 2 or 4 bytes. Unfortunately if on a newer architecture that supports thumb2 you may have to fetch 4 bytes even in thumb mode and try to figure out what happened. being variable word length it is very possible to be in a situation where it is impossible to determine what happened. If you are not using thumb2 instructions then it is deterministic.

Avoiding unnecessary mov operations in inline PTX

When writing PTX in a separate file, a kernel parameter can be loaded into a register with:
.reg .u32 test;
ld.param.u32 test, [test_param];
However, when using inline PTX, the Using Inline PTX Assembly in CUDA (version 01) application note describes a syntax where loading a parameter is closely linked to another operation. It provides this example:
asm("add.s32 %0, %1, %2;" : "=r"(i) : "r"(j), "r"(k));
Which generates:
ld.s32 r1, [j];
ld.s32 r2, [k];
add.s32 r3, r1, r2;
st.s32 [i], r3;
In many cases, it is necessary to separate the two operations. For instance, one might want to store the parameter in a register outside of a loop and then reuse and modify the register inside a loop. The only way I have found to do this is to use an extra mov instruction, to move the parameter from the register to which it was implicitly loaded, to another register I can use later.
Is there a way to avoid this additional mov instruction when moving from PTX in a separate file to inline PTX?
If I were you I wouldn't worry too much about those mov operations.
Keep in mind that PTX is not the final assembly code.
PTX is further compiled into CUBIN before the kernel launch. Among others, this last step performs register allocation and will remove all unnecessary mov operations.
In particular, if you move from %r1 to %r2 and then never ever use %r1 at all, the algorithm is likely to assign %r1 and %r2 to the same hardware register and remove the move.

clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).
There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.
Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.
Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}
The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.
On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).
Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.