Avoiding unnecessary mov operations in inline PTX

Avoiding unnecessary mov operations in inline PTX - cuda

When writing PTX in a separate file, a kernel parameter can be loaded into a register with:
.reg .u32 test;
ld.param.u32 test, [test_param];
However, when using inline PTX, the Using Inline PTX Assembly in CUDA (version 01) application note describes a syntax where loading a parameter is closely linked to another operation. It provides this example:
asm("add.s32 %0, %1, %2;" : "=r"(i) : "r"(j), "r"(k));
Which generates:
ld.s32 r1, [j];
ld.s32 r2, [k];
add.s32 r3, r1, r2;
st.s32 [i], r3;
In many cases, it is necessary to separate the two operations. For instance, one might want to store the parameter in a register outside of a loop and then reuse and modify the register inside a loop. The only way I have found to do this is to use an extra mov instruction, to move the parameter from the register to which it was implicitly loaded, to another register I can use later.
Is there a way to avoid this additional mov instruction when moving from PTX in a separate file to inline PTX?

If I were you I wouldn't worry too much about those mov operations.
Keep in mind that PTX is not the final assembly code.
PTX is further compiled into CUBIN before the kernel launch. Among others, this last step performs register allocation and will remove all unnecessary mov operations.
In particular, if you move from %r1 to %r2 and then never ever use %r1 at all, the algorithm is likely to assign %r1 and %r2 to the same hardware register and remove the move.

Related

Instruction execution order by cuda driver

The following piece of code
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u64 data, [%0];"::"l"(po):"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
looks like this in the SASS code
/*0420*/ CS2R R2, SR_CLOCKLO ; /* 0x0000000000027805 */
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ; /* 0x0000000004047381 */
/*0440*/ CS2R R6, SR_CLOCKLO ; /*
I want to be sure that the scheduler issues the second CS2R after the LDG instruction and not earlier due to any any optimization like out-of-order execution.
How can I be sure about that?
UPDATE:
Based on Greg's suggestion, I added a dependent instruction which looks like
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u64 data, [%0];"::"l"(po):"memory");
asm volatile("add.u64 %0, data, %0;":"+l"(sink)::"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
where uint64_t sink = 0; is defined. Still I see only one LDG between CS2R instructions. I expected to see an IADD instruction also since I am reading data again. I think I wrote the asm add instruction incorrectly, but don't know any more.

NVIDIA GPUs compute capability 1.0 - 7.x will issue instructions for a warp in order. The special purpose registers clock and clock64 can be used to time sections of code by reading the register before and after a sequence of instructions.
This can be useful to estimate the number of cycles that it took to issue a sequence of instructions for a single warp.
CASE 1 : Instruction Issue Latency
clock64 reads are inserted before and after a sequence of instructions. In the case below clock64 reads wrap a single global load. This style estimates the instruction issue latency of the global load instruction. The warp can be stalled between the start and end CS2R increasing the duration. Stall reasons can include the following:
- not_selected - the warp scheduler selected a higher priority warp
- no_instruction - LDG was on a new instruction cache line and the warp is stalled until the cache line is fetched
- mio_throttle - LDG instruction cannot be issued as the instruction queue for the Load Store Unit was full.
- lg_throttle - LDG instruction cannot be issued as the instruction queue for the Load Store Unit has reached a local/global watermark.
In order to increase accuracy it is recommended to measure a sequence of instructions as opposed to a single instruction.
PTX
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u32 data, [%0];"::"l"(po):"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
SASS (SM_70)
/*0420*/ CS2R R2, SR_CLOCKLO ;
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ;
/*0440*/ CS2R R6, SR_CLOCKLO ;
CASE 2: Instruction Execution Latency
A clock64 read is inserted before a sequence of instructions. A set of instructions that guarantee completion of the sequence of instruction and a clock64 read is inserted after the sequence of instructions. In the case below an integer add is inserted before the last read that is dependent on the value from the global load. This technique can be used to estimate the execution duration of the global load.
PTX
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
asm volatile("ld.global.ca.u32 data, [%0];"::"l"(po):"memory");
asm volatile("add.u32 %0, data, %0;":"+l"(sink)::"memory");
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop) :: "memory");
SASS (SM_70)
/*0420*/ CS2R R2, SR_CLOCKLO ;
/*0430*/ LDG.E.64.STRONG.CTA R4, [R4] ;
/*0440*/ IADD R4, R4, 1 ;
/*0450*/ CS2R R6, SR_CLOCKLO ;
DIAGRAM
The measurement period for Case 1 and Case 2 is shown in the wave form diagram. The diagram shows the CS2R and IADD instructions taking 4 cycles to execute. The CS2R instructions read the time on the 3rd cycle.
For Case 1 the measured time may be as small as 2 cycles.
For Case 1 the measured time includes the load from global memory. If the load hits in the L1 cache then the time is in the 20-50 cycles else the time is likely greater than 200 cycles.
WARNING
In practice this type of instruction issue or instruction execution latency is very hard to implement. These techniques can be used to write micro-benchmarks or time large sequences of code. In the case of micro-benchmarks it is critical to understand and potentially isolate other factors such as warp scheduling, instruction cache misses, constant cache misses, etc.
The compiler does not treat a read of clock/clock64 as an instruction fence. The compiler is free to move the read to an unexpected location. It is recommended to always inspect the generated SASS code.
Compute Capability 6.0 and higher supports instruction level preemption. Instruction level preemption will result in unexpected results.

Use of LR and PC instructions in non-leaf and leaf functions epilogue

I am trying to learn assembly through the guide from azeria-labs.com
I have a question about the use of the LR register and the PC register in the epilogue of non-leaf and leaf functions.
In the snippet below they show the difference for the epilogue in these functions.
If i write a program in C and look at in GDB it will always use "pop {r11, pc} for a non-leaf function and "pop {r11}; bx lr" for a leaf function. Can anybody tell me why this is?
When i am in a leaf function. Does it for example make a difference if i use "bx lr" or "pop pc" to go back to the parent functions?
/* An epilogue of a leaf function */
pop {r11}
bx lr
/* An epilogue of a non-leaf function */
pop {r11, pc}

I am trying to learn assembly
I have a question about the use of the LR register and the PC register in the epilogue of non-leaf and leaf functions.
This is part of the beauty and pain of assembler. There are no rules for the use of anything. It is up to you to decide what is needed. Please see: ARM Link and frame pointer as it maybe helpful.
... it will always use pop {r11, pc} for a non-leaf function and pop {r11}; bx lr for a leaf function. Can anybody tell me why this is?
A 'C' compiler is different. It has rules called an ABI. The latest version is called AAPCS for arm or ATPCS for thumb. These rules exist so that different compilers can call each others functions.note1 Ie, tools can operate. You can have this 'rule' in assembler or you can disregard it. Ie, if your goal is to interoperate with a compilers code, you need to follow that ABI rules.
Some of the rules say what needs to be pushed on the stack and how registers are used. The 'reason' that the leaf is different is that it is more efficient. Writing to a register lr is much faster than memory (push to the stack). When it is an non-leaf function, a function call there will destroy the existing lr and you would not return the right place afterwards, so LR is pushed to the stack to make things work.
When i am in a leaf function. Does it for example make a difference if i use "bx lr" or "pop pc" to go back to the parent functions?
The bx lr is faster than the pop pc because one uses memory and the other does not. Functionally they are the same. However, one common reason to use assembler is to be faster. You will functionally end up with the same execution path, it is just it will take longer; how much will depend on the memory system. It could be next to negligible for a Cortex-M with TCM or very high for Cortex-A CPUS.
The ARM uses register to pass parameters because this is faster than pushing parameters on the stack. Consider this code,
int foo(int a, int b, int c) {return a+b+c;}
int bar(int a) { return foo(a, 1, 2);}
Here is a possible ARM code note2,
foo:
pop {r0, r1}
add r0,r0,r1 ; only two registers needed.
pop {r1}
add r0,r0,r1
bx lr
bar:
push lr
push r0 ; notice we are only using one register?
mov r0, #1
push r0
mov r0, #2
push r0
bl foo
pop pc
This is not how any ARM compiler will do things. The convention is to use R0, R1, and R2 to pass the parameters. Because this is faster and actually produces less code. But either way achieves the same thing. Maybe,
foo:
add r0,r0,r1 ; a = a + b
add r0,r0,r2 ; a = a + c
bx lr
bar:
push lr ; a = a from caller of bar.
mov r1, #1 ; b = 1
mov r2, #2 ; c = 2
bl foo
pop pc
The lr is somewhat similar to the parameters. You could push the parameters on the stack or just leave them in a register. You could put the lr on the stack and then pop it off later or you can just leave it there. What should not be under-estimated is how much faster code can become when it uses registers as oppose to memory. Moving things around is generally a sign that assembler code is not optimal. The more mov, push and pop you have the slower your code is.
So generally quite a bit of thought went into the ABI to make it as fast as possible. The older APCS is slightly slower than the newer AAPCS, but they both work.
Note1: You will notice a difference between static and non static function if you turn up optimizations. This is because the compiler may ignore the ABI to be faster. Static functions can NOT be called by another compiler and don't need to interoperate.
Note2: In fact the CPU designers think a lot about the ABI and take into consideration how many registers. Too many registers and the opcodes will be big. Too few and there will be lots of memory used instead of registers.

In the leaf function, there are no other function calls which would modify the link register lr.
For a non-leaf function, the lr must be preserved, done here by pushing it to the stack (somewhere not shown, earlier in the function).
The epilogue of the non-leaf function could be rewritten:
pop {r11, lr}
bx lr
This is however one more instruction, and so it is slightly less efficient.

When do we create base pointer in a function - before or after local variables?

I am reading the Programming From Ground Up book. I see two different examples of how the base pointer %ebp is created from the current stack position %esp.
In one case, it is done before the local variables.
_start:
# INITIALIZE PROGRAM
subl $ST_SIZE_RESERVE, %esp # Allocate space for pointers on the
# stack (file descriptors in this
# case)
movl %esp, %ebp
The _start however is not like other functions, it is the entry point of the program.
In another case it is done after.
power:
pushl %ebp # Save old base pointer
movl %esp, %ebp # Make stack pointer the base pointer
subl $4, %esp # Get room for our local storage
So my question is, do we first reserve space for local variables in the stack and create the base pointer or first create the base pointer and then reserve space for local variables?
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc. Does C have a specific convention when it creates the machine code?
My reasoning is that all the code in a function would be relative to the base pointer, so as long as that function follows the convention according to which it created a reference of the stack, it just works?
Few related links for those are interested:
Function Prologue

In your first case you don't care about preservation - this is the entry point. You are trashing %ebp when you exit the program - who cares about the state of the registers? It doesn't matter any more as your application has ended. But in a function, when you return from that function the caller certainly doesn't want %ebp trashed. Now can you modify %esp first then save %ebp then use %ebp? Sure, so long as you unwind the same way on the other end of the function, you may not need to have a frame pointer at all, often that is just a personal choice.
You just need a relative picture of the world. A frame pointer is usually just there to make the compiler author's job easier, actually it is usually there just to waste a register for many instruction sets. Perhaps because some teacher or textbook taught it that way, and nobody asked why.
For coding sanity, the compiler author's sanity etc, it is desirable if you need to use the stack to have a base address from which to offset into your portion of the stack, FOR THE DURATION of the function. Or at least after the setup and before the cleanup. This can be the stack pointer(sp) itself or it can be a frame pointer, sometimes it is obvious from the instruction set. Some have a stack that grows down (in address space toward zero) and the stack pointer can only have positive offsets in sp based address (sane) or some negative only (insane) (unlikely but lets say its there). So you may want a general purpose register. Maybe there are some you cant use the sp in addressing at all and you have to use a general purpose register.
Bottom line, for sanity you want a reference point to offset items in the stack, the more painful way but uses less memory would be to add and remove things as you go:
x is at sp+4
push a
push b
do stuff
x is at sp+12
pop b
x is at sp+8
call something
pop a
x is at sp+4
do stuff
More work but can make a program (compiler) that keeps track and is less error prone than a human by hand, but when debugging the compiler output (a human) it is harder to follow and keep track. So generally we burn the stack space and have one reference point. A frame pointer can be used to separate the incoming parameters and the local variables using base pointer(bp) for example as a static base address within the function and sp as the base address for local variables (athough sp could be used for everything if the instruction set provides that much of an offset). So by pushing bp then modifying sp you are creating this two base address situation, sp can move around perhaps for local stuff (although not usually sane) and bp can be used as a static place to grab parameters if this is a calling convention that dictates all parameters are on the stack (generally when you dont have a lot of general purpose registers) sometimes you see the parameters are copied to local allocation on the stack for later use, but if you have enough registers you may see that instead a register is saved on the stack and used in the function instead of needing to access the stack using a base address and offset.
unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int x )
{
unsigned int y;
y = x;
return(more_fun(x+1)+y);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e2800001 add r0, r0, #1
c: ebfffffe bl 0 <more_fun>
10: e0800004 add r0, r0, r4
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Do not take what you see in a text book, white board (or on answers in StackOverflow) as gospel. Think through the problem, and through alternatives.
Are the alternatives functionally broken?
Are they functionally correct?
Are there disadvantages like readability?
Performance?
Is the performance hit universal or does it depend on just how
slow/fast the memory is?
Do the alternatives generate more code which is a performance hit but
maybe that code is pipelined vs random memory accesses?
If I don't use a frame pointer does the architecture let me regain
that register for general purpose use?
In the first example bp is being trashed, that is bad in general but this is the entry point to the program, there is no need to preserve bp (unless the operating system dictates).
In a function though, based on the calling convention one assumes that bpis used by the caller and must be preserved, so you have to save it on the stack to use it. In this case it appears to want to be used to access parameters passed in by the caller on the stack, then sp is moved to make room for (and possibly access but not necessarily required if bp can be used) local variables.
If you were to modify sp first then push bp, you would basically have two pointers one push width away from each other, does that make much sense? Does it make sense to have two frame pointers anyway and if so does it make sense to have them almost the same address?
By pushing bp first and if the calling convention pushes the first paramemter last then as a compiler author you can make bp+N always or ideally always point at the first parameter for a fixed value N likewise bp+M always points at the second. A bit lazy to me, but if the register is there to be burned then burn it...

In one case, it is done before the local variables.
_start is not a function. It's your entry point. There's no return address, and no caller's value of %ebp to save.
The i386 System V ABI doc suggests (in section 2.3.1 Initial Stack and Register State) that you might want to zero %ebp to mark the deepest stack frame. (i.e. before your first call instruction, so the linked list of saved ebp values has a NULL terminator when that first function pushes the zeroed ebp. See below).
Does C have a specific convention when it creates the machine code?
No, unlike in some other x86 systems, the i386 System V ABI doesn't require much about your stack-frame layout. (Linux uses the System V ABI / calling convention, and the book you're using (PGU) is for Linux.)
In some calling conventions, setting up ebp is not optional, and the function entry sequence has to push ebp just below the return address. This creates a linked list of stack frames which allows an exception handler (or debugger) to backtrace up the stack. (How to generate the backtrace by looking at the stack values?). I think this is required in 32-bit Windows code for SEH (structured exception handling), at least in some cases, but IDK the details.
The i386 SysV ABI defines an alternate mechanism for stack unwinding which makes frame pointers optional, using metadata in another section (.eh_frame and .eh_frame_hdr which contains metadata created by .cfi_... assembler directives, which in theory you could write yourself if you wanted stack-unwinding through your function to work. i.e. if you were calling any C++ code which expected throw to work.)
If you want to use the traditional frame-walking in current gdb, you have to actually do it yourself by defining a GDB function like gdb backtrace by walking frame pointers or Force GDB to use frame-pointer based unwinding. Or apparently if your executable has no .eh_frame section at all, gdb will use the EBP-based stack-walking method.
If you compile with gcc -fno-omit-frame-pointer, your call stack will have this linked-list property, because when C compilers do make proper stack frames, they push ebp first.
IIRC, perf has a mode for using the frame-pointer chain to get backtraces while profiling, and apparently this can be more reliable than the default .eh_frame stuff for correctly accounting which functions are responsible for using the most CPU time. (Or causing the most cache misses, branch mispredicts, or whatever else you're counting with performance counters.)
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc.
Yes, it would work fine. In fact setting up ebp at all is optional, but when writing by hand it's easier to have a fixed base (unlike esp which moves around when you push/pop).
For the same reason, it's easier to stick to the convention of mov %esp, %ebp after one push (of the old %ebp), so the first function arg is always at ebp+8. See What is stack frame in assembly? for the usual convention.
But you could maybe save code size by having ebp point in the middle of some space you reserved, so all the memory addressable with an ebp + disp8 addressing mode is usable. (disp8 is a signed 8-bit displacement: -128 to +124 if we're limiting to 4-byte aligned locations). This saves code bytes vs. needing a disp32 to reach farther. So you might do
bigfunc:
push %ebp
lea -112(%esp), %ebp # first arg at ebp+8+112 = 120(%ebp)
sub $236, %esp # locals from -124(%ebp) ... 108(%ebp)
# saved EBP at 112(%ebp), ret addr at 116(%ebp)
# 236 was chosen to leave %esp 16-byte aligned.
Or delay saving any registers until after reserving space for locals, so we aren't using up any of the locations (other than the ret addr) with saved values we never want to address.
bigfunc2: # first arg at 4(%esp)
sub $252, %esp # first arg at 252+4(%esp)
push %ebp # first arg at 252+4+4(%esp)
lea 140(%esp), %ebp # first arg at 260-140 = 120(%ebp)
push %edi # save the other call-preserved regs
push %esi
push %ebx
# %esp is 16-byte aligned after these pushes, in case that matters
(Remember to be careful how you restore registers and clean up. You can't use leave because esp = ebp isn't right. With the "normal" stack frame sequence, you might restore other pushed registers (from near the saved EBP) with mov, then use leave. Or restore esp to point at the last push (with add), and use pop instructions.)
But if you're going to do this, there's no advantage to using ebp instead of ebx or something. In fact, there's a disadvantage to using ebp: the 0(%ebp) addressing mode requires a disp8 of 0, instead of no displacement, but %ebx wouldn't. So use %ebp for a non-pointer scratch register. Or at least one that you don't dereference without a displacement. (This quirk is irrelevant with a real frame pointer: (%ebp) is the saved EBP value. And BTW, the encoding that would mean (%ebp) with no displacement is how the ModRM byte encodes a disp32 with no base register, like (12345) or my_label)
These example are pretty artifical; you usually don't need that much space for locals unless it's an array, and then you'd use indexed addressing modes or pointers, not just a disp8 relative to ebp. But maybe you need space for a few 32-byte AVX vectors. In 32-bit code with only 8 vector registers, that's plausible.
AVX512 compressed disp8 mostly defeats this argument for 64-byte AVX512 vectors, though. (But AVX512 in 32-bit mode can still only use 8 vector registers, zmm0-zmm7, so you could easily need to spill some. You only get x/ymm8-15 and zmm8-31 in 64-bit mode.)

Is starting 1 thread per element always optimal for data independent problems on the GPU?

I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:
template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
using vector_type = int2;
vector_type* src2 = reinterpret_cast<vector_type*>(src);
vector_type* dest2 = reinterpret_cast<vector_type*>(dest);
//This copy kernel is only correct when size%sizeof(vector_type)==0
auto numElements = size / sizeof(vector_type);
for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
dest2[id] = src2[id];
}
}
I also calculated the number of blocks required to reach 100% occupancy like so:
THREADS_PER_BLOCK = 256
Multi-Processors: 4
Max Threads per Multi Processor: 2048
NUM_BLOCKS = 4 * 2048 / 256 = 32
My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:
bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s
So my questions are:
Why is there a speed difference?
Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?

Is starting 1 thread per element always optimal for data independent problems on the GPU?
Not always. Let's consider 3 different implementations. In each case we'll assume we're dealing with a trivially parallelizable problem that involves one element load, some "work" and one element store per thread. In your copy example there is basically no work - just loads and stores.
One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.
Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.
Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:
LOOP: LD R0, in[idx];
ST out[idx], R0;
...
BRA LOOP;
which it can unroll (let's say) 8 times:
LD R0, in[idx];
ST out[idx], R0;
LD R0, in[idx+1];
ST out[idx+1], R0;
LD R0, in[idx+2];
ST out[idx+2], R0;
LD R0, in[idx+3];
ST out[idx+3], R0;
LD R0, in[idx+4];
ST out[idx+4], R0;
LD R0, in[idx+5];
ST out[idx+5], R0;
LD R0, in[idx+6];
ST out[idx+6], R0;
LD R0, in[idx+7];
ST out[idx+7], R0;
and after that it can reorder the instructions, since the operations are independent:
LD R0, in[idx];
LD R1, in[idx+1];
LD R2, in[idx+2];
LD R3, in[idx+3];
LD R4, in[idx+4];
LD R5, in[idx+5];
LD R6, in[idx+6];
LD R7, in[idx+7];
ST out[idx], R0;
ST out[idx+1], R1;
ST out[idx+2], R2;
ST out[idx+3], R3;
ST out[idx+4], R4;
ST out[idx+5], R5;
ST out[idx+6], R6;
ST out[idx+7], R7;
at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.
What we have done here, is effectively (working in conjunction with the compiler) increase the number of instructions that can be issued per thread, before the thread will hit a stall. This is also referred to as increasing the exposed parallelism, basically via ILP in this approach. Whether or not it has any benefit will vary depending on your actual code, your actual GPU, and what else is going in the GPU at that time. But it is always a good strategy to increase exposed parallelism using techniques such as this, because the ability to issue instructions is how the GPU hides the various forms of latency that it must deal with, so we have effectively improved the GPU's ability to hide latency, with this approach.
Why is there a speed difference?
This can be difficult to answer without profiling the code carefully. However it's often the case that launching just enough threads to fully satisfy the instantaneous carrying capacity of the GPU is not a good strategy, possibly due to the "tail effect" or other types of inefficiency. It may also be the case that blocks are limited by some other factor, such as registers or shared memory usage. It's usually necessary to carefully profile as well as possibly study the generated machine code to fully answer such a question. But it may be that the loop overhead measurably impacts your comparison, which is basically my case 2 vs. my case 1 above.
(note the memory indices in my "pseudo" machine code example are not what you would expect for a well written grid-striding copy loop - they are just for example purposes to demonstrate unrolling and the benefit it can have via compiler instruction reordering).

One-liner answer: When you have one thread per element, you pay the thread setup cost - at the very least, copying argument from constant memory to registers - for every one element, and that's wasteful.

Will arguments to a function be passed on the stack or in a register?

I'm currently analyzing a program I wrote in assembly and was thinking about moving some code around in the assembly. I have a procedure which takes one argument, but I'm not sure if it is passed on the stack or a register.
When I open my program in IDA Pro, the first line in the procedure is:
ThreadID= dword ptr -4
If I hover my cursor over the declaration, the following also appears:
ThreadID dd ?
r db 4 dup(?)
which I would assume would point to a stack variable?
When I open the same program in OllyDbg however, at this spot on the stack there is a large value, which would be inconsistent with any parameter that could have been passed, leading me to believe that it is passed in a register.
Can anyone point me in the right direction?

The way arguments are passed to a function depends on the function's calling convention. The default calling convention depends on the language, compiler and architecture.
I can't say anything for sure with the information you provided, however you shouldn't forget that assembly-level debuggers like OllyDbg and disassemblers like IDA often use heuristics to reverse-engineer the program. The best way to study the code generated by the compiler is to instruct it to write assembly listings. Most compilers have an option to do this.

It is a local variable for sure. To check out arguments look for [esp+XXX] values. IDA names those [esp+arg_XXX] automatically.
.text:0100346A sub_100346A proc near ; CODE XREF: sub_100347C+44p
.text:0100346A ; sub_100367A+C6p ...
.text:0100346A
.text:0100346A arg_0 = dword ptr 4
.text:0100346A
.text:0100346A mov eax, [esp+arg_0]
.text:0100346E add dword_1005194, eax
.text:01003474 call sub_1002801
.text:01003474
.text:01003479 retn 4
.text:01003479
.text:01003479 sub_100346A endp
And fastcall convention as was outlined in comment above uses registers to pass arguments. I'd bet on Microsoft or GCC compiler as they are more widely used. So check out ECX and EDX registers first.
Microsoft or GCC [2] __fastcall[3]
convention (aka __msfastcall) passes
the first two arguments (evaluated
left to right) that fit into ECX and
EDX. Remaining arguments are pushed
onto the stack from right to left.
http://en.wikipedia.org/wiki/X86_calling_conventions#fastcall

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Avoiding unnecessary mov operations in inline PTX - cuda

Related

Instruction execution order by cuda driver

Use of LR and PC instructions in non-leaf and leaf functions epilogue

When do we create base pointer in a function - before or after local variables?

Is starting 1 thread per element always optimal for data independent problems on the GPU?

Will arguments to a function be passed on the stack or in a register?

Categories

Resources