When do we create base pointer in a function - before or after local variables? - function

I am reading the Programming From Ground Up book. I see two different examples of how the base pointer %ebp is created from the current stack position %esp.
In one case, it is done before the local variables.
_start:
# INITIALIZE PROGRAM
subl $ST_SIZE_RESERVE, %esp # Allocate space for pointers on the
# stack (file descriptors in this
# case)
movl %esp, %ebp
The _start however is not like other functions, it is the entry point of the program.
In another case it is done after.
power:
pushl %ebp # Save old base pointer
movl %esp, %ebp # Make stack pointer the base pointer
subl $4, %esp # Get room for our local storage
So my question is, do we first reserve space for local variables in the stack and create the base pointer or first create the base pointer and then reserve space for local variables?
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc. Does C have a specific convention when it creates the machine code?
My reasoning is that all the code in a function would be relative to the base pointer, so as long as that function follows the convention according to which it created a reference of the stack, it just works?
Few related links for those are interested:
Function Prologue

In your first case you don't care about preservation - this is the entry point. You are trashing %ebp when you exit the program - who cares about the state of the registers? It doesn't matter any more as your application has ended. But in a function, when you return from that function the caller certainly doesn't want %ebp trashed. Now can you modify %esp first then save %ebp then use %ebp? Sure, so long as you unwind the same way on the other end of the function, you may not need to have a frame pointer at all, often that is just a personal choice.
You just need a relative picture of the world. A frame pointer is usually just there to make the compiler author's job easier, actually it is usually there just to waste a register for many instruction sets. Perhaps because some teacher or textbook taught it that way, and nobody asked why.
For coding sanity, the compiler author's sanity etc, it is desirable if you need to use the stack to have a base address from which to offset into your portion of the stack, FOR THE DURATION of the function. Or at least after the setup and before the cleanup. This can be the stack pointer(sp) itself or it can be a frame pointer, sometimes it is obvious from the instruction set. Some have a stack that grows down (in address space toward zero) and the stack pointer can only have positive offsets in sp based address (sane) or some negative only (insane) (unlikely but lets say its there). So you may want a general purpose register. Maybe there are some you cant use the sp in addressing at all and you have to use a general purpose register.
Bottom line, for sanity you want a reference point to offset items in the stack, the more painful way but uses less memory would be to add and remove things as you go:
x is at sp+4
push a
push b
do stuff
x is at sp+12
pop b
x is at sp+8
call something
pop a
x is at sp+4
do stuff
More work but can make a program (compiler) that keeps track and is less error prone than a human by hand, but when debugging the compiler output (a human) it is harder to follow and keep track. So generally we burn the stack space and have one reference point. A frame pointer can be used to separate the incoming parameters and the local variables using base pointer(bp) for example as a static base address within the function and sp as the base address for local variables (athough sp could be used for everything if the instruction set provides that much of an offset). So by pushing bp then modifying sp you are creating this two base address situation, sp can move around perhaps for local stuff (although not usually sane) and bp can be used as a static place to grab parameters if this is a calling convention that dictates all parameters are on the stack (generally when you dont have a lot of general purpose registers) sometimes you see the parameters are copied to local allocation on the stack for later use, but if you have enough registers you may see that instead a register is saved on the stack and used in the function instead of needing to access the stack using a base address and offset.
unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int x )
{
unsigned int y;
y = x;
return(more_fun(x+1)+y);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e2800001 add r0, r0, #1
c: ebfffffe bl 0 <more_fun>
10: e0800004 add r0, r0, r4
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Do not take what you see in a text book, white board (or on answers in StackOverflow) as gospel. Think through the problem, and through alternatives.
Are the alternatives functionally broken?
Are they functionally correct?
Are there disadvantages like readability?
Performance?
Is the performance hit universal or does it depend on just how
slow/fast the memory is?
Do the alternatives generate more code which is a performance hit but
maybe that code is pipelined vs random memory accesses?
If I don't use a frame pointer does the architecture let me regain
that register for general purpose use?
In the first example bp is being trashed, that is bad in general but this is the entry point to the program, there is no need to preserve bp (unless the operating system dictates).
In a function though, based on the calling convention one assumes that bpis used by the caller and must be preserved, so you have to save it on the stack to use it. In this case it appears to want to be used to access parameters passed in by the caller on the stack, then sp is moved to make room for (and possibly access but not necessarily required if bp can be used) local variables.
If you were to modify sp first then push bp, you would basically have two pointers one push width away from each other, does that make much sense? Does it make sense to have two frame pointers anyway and if so does it make sense to have them almost the same address?
By pushing bp first and if the calling convention pushes the first paramemter last then as a compiler author you can make bp+N always or ideally always point at the first parameter for a fixed value N likewise bp+M always points at the second. A bit lazy to me, but if the register is there to be burned then burn it...

In one case, it is done before the local variables.
_start is not a function. It's your entry point. There's no return address, and no caller's value of %ebp to save.
The i386 System V ABI doc suggests (in section 2.3.1 Initial Stack and Register State) that you might want to zero %ebp to mark the deepest stack frame. (i.e. before your first call instruction, so the linked list of saved ebp values has a NULL terminator when that first function pushes the zeroed ebp. See below).
Does C have a specific convention when it creates the machine code?
No, unlike in some other x86 systems, the i386 System V ABI doesn't require much about your stack-frame layout. (Linux uses the System V ABI / calling convention, and the book you're using (PGU) is for Linux.)
In some calling conventions, setting up ebp is not optional, and the function entry sequence has to push ebp just below the return address. This creates a linked list of stack frames which allows an exception handler (or debugger) to backtrace up the stack. (How to generate the backtrace by looking at the stack values?). I think this is required in 32-bit Windows code for SEH (structured exception handling), at least in some cases, but IDK the details.
The i386 SysV ABI defines an alternate mechanism for stack unwinding which makes frame pointers optional, using metadata in another section (.eh_frame and .eh_frame_hdr which contains metadata created by .cfi_... assembler directives, which in theory you could write yourself if you wanted stack-unwinding through your function to work. i.e. if you were calling any C++ code which expected throw to work.)
If you want to use the traditional frame-walking in current gdb, you have to actually do it yourself by defining a GDB function like gdb backtrace by walking frame pointers or Force GDB to use frame-pointer based unwinding. Or apparently if your executable has no .eh_frame section at all, gdb will use the EBP-based stack-walking method.
If you compile with gcc -fno-omit-frame-pointer, your call stack will have this linked-list property, because when C compilers do make proper stack frames, they push ebp first.
IIRC, perf has a mode for using the frame-pointer chain to get backtraces while profiling, and apparently this can be more reliable than the default .eh_frame stuff for correctly accounting which functions are responsible for using the most CPU time. (Or causing the most cache misses, branch mispredicts, or whatever else you're counting with performance counters.)
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc.
Yes, it would work fine. In fact setting up ebp at all is optional, but when writing by hand it's easier to have a fixed base (unlike esp which moves around when you push/pop).
For the same reason, it's easier to stick to the convention of mov %esp, %ebp after one push (of the old %ebp), so the first function arg is always at ebp+8. See What is stack frame in assembly? for the usual convention.
But you could maybe save code size by having ebp point in the middle of some space you reserved, so all the memory addressable with an ebp + disp8 addressing mode is usable. (disp8 is a signed 8-bit displacement: -128 to +124 if we're limiting to 4-byte aligned locations). This saves code bytes vs. needing a disp32 to reach farther. So you might do
bigfunc:
push %ebp
lea -112(%esp), %ebp # first arg at ebp+8+112 = 120(%ebp)
sub $236, %esp # locals from -124(%ebp) ... 108(%ebp)
# saved EBP at 112(%ebp), ret addr at 116(%ebp)
# 236 was chosen to leave %esp 16-byte aligned.
Or delay saving any registers until after reserving space for locals, so we aren't using up any of the locations (other than the ret addr) with saved values we never want to address.
bigfunc2: # first arg at 4(%esp)
sub $252, %esp # first arg at 252+4(%esp)
push %ebp # first arg at 252+4+4(%esp)
lea 140(%esp), %ebp # first arg at 260-140 = 120(%ebp)
push %edi # save the other call-preserved regs
push %esi
push %ebx
# %esp is 16-byte aligned after these pushes, in case that matters
(Remember to be careful how you restore registers and clean up. You can't use leave because esp = ebp isn't right. With the "normal" stack frame sequence, you might restore other pushed registers (from near the saved EBP) with mov, then use leave. Or restore esp to point at the last push (with add), and use pop instructions.)
But if you're going to do this, there's no advantage to using ebp instead of ebx or something. In fact, there's a disadvantage to using ebp: the 0(%ebp) addressing mode requires a disp8 of 0, instead of no displacement, but %ebx wouldn't. So use %ebp for a non-pointer scratch register. Or at least one that you don't dereference without a displacement. (This quirk is irrelevant with a real frame pointer: (%ebp) is the saved EBP value. And BTW, the encoding that would mean (%ebp) with no displacement is how the ModRM byte encodes a disp32 with no base register, like (12345) or my_label)
These example are pretty artifical; you usually don't need that much space for locals unless it's an array, and then you'd use indexed addressing modes or pointers, not just a disp8 relative to ebp. But maybe you need space for a few 32-byte AVX vectors. In 32-bit code with only 8 vector registers, that's plausible.
AVX512 compressed disp8 mostly defeats this argument for 64-byte AVX512 vectors, though. (But AVX512 in 32-bit mode can still only use 8 vector registers, zmm0-zmm7, so you could easily need to spill some. You only get x/ymm8-15 and zmm8-31 in 64-bit mode.)

Related

Looking for a _one byte_ invalid opcode with x86

I need an invalid opcode with x86 (not x64!) that's exactly one byte in length to overwrite some code in a foreign process. Currently I'm using INT3 (0xCC) but it would be nicer to trap an invalid opcode separately since the foreign process contains a lot of valid INT3.
According to http://ref.x86asm.net/coder32.html, there aren't any in 32-bit mode guaranteed to #UD. Anything that wasn't nailed down has been reused as building material for new extensions.
The ones that exist in 64-bit mode are reserved and not guaranteed to fault on future CPUs; only ud2 is truly guaranteed future-proof. Assuming x86-64 lasts long enough, likely some vendor will make use of that 64-bit-only coding space and stop wasting code-size to also cater to increasingly obsolete 32-bit mode.
If you don't need #UD, you can raise #GP(0) with some privileged instructions in user-space, assuming you're never going to be running in kernel mode.
F4 hlt will always #GP(0) in user-space, not enabled by IOPL, only true CPL=0. (Or #UD if used with a lock prefix). Even if it somehow gets executed in a kernel context, it just stops and waits for the next interrupt, so typically no effect on correctness unless executed with interrupts disabled. (In which case you're stuck until the next NMI).
A similar but worse option is FB sti. But it can execute successfully in a program that's used Linux iopl(), like an X11 server. Unless interrupts were supposed to be disabled, though, that's still not going to lock up your system, it just won't trigger the exception you were looking for. (Unlike cli which could get that CPU stuck, or in al, dx which could do wild IO and even be allowed by ioperm not just iopl, depending on what value is in DX.)
Depending what comes next in memory, 9A callf ptr16:32 might fault on trying to load an invalid value into CS. That value would come from the 2 bytes of machine code 5 and 6 bytes after this one (i.e. after a 32-bit new EIP, since ptr16:32 is stored little-endian). Unlike call rel32 or whatever, it may fault before actually pushing anything and overwriting the current CS:EIP. (But if not, in theory your debugger could simulate popping that far-return address back into CS:EIP after catching the fault.)
Just to be clear, I'm suggesting overwriting a byte with 9A, and leaving the later bytes of machine code unmodified, after checking that the bytes that would be the new CS value are in fact invalid. e.g. by making sure a far call to that address segfaults. Or if this is near the end of a page, and the next is unmapped, it can #PF.
The F0 lock prefix faults with #UD if used on things other than a memory-destination RMW operation, so it can also work if later context would decode as any other instruction. But you can't always use it; you need to check that you aren't creating a valid atomic RMW instruction. e.g. if the ModRM byte was 00 or 01, replacing the opcode with a lock prefix creates a memory-destination add.
#ecm points out that f1 on some CPUs is icebp / int1, but on other CPUs where it isn't, it's undefined but doesn't raise #UD. (http://ref.x86asm.net/coder32.html#xF1)
If the following byte is 0, D4 00 aam 0 is guaranteed to #DE (divide exception). But any other value does immediate 8-bit division of AL.
Depending what byte comes next, CD int n can be used. But not for all following bytes, e.g. int 0x80 won't fault under Linux (unless your kernel is built without CONFIG_IA32_EMULATION). And you might not want some of the other random interrupt numbers. e.g. CD 03 int 3 is pretty much like CC int3.

Use of LR and PC instructions in non-leaf and leaf functions epilogue

I am trying to learn assembly through the guide from azeria-labs.com
I have a question about the use of the LR register and the PC register in the epilogue of non-leaf and leaf functions.
In the snippet below they show the difference for the epilogue in these functions.
If i write a program in C and look at in GDB it will always use "pop {r11, pc} for a non-leaf function and "pop {r11}; bx lr" for a leaf function. Can anybody tell me why this is?
When i am in a leaf function. Does it for example make a difference if i use "bx lr" or "pop pc" to go back to the parent functions?
/* An epilogue of a leaf function */
pop {r11}
bx lr
/* An epilogue of a non-leaf function */
pop {r11, pc}
I am trying to learn assembly
I have a question about the use of the LR register and the PC register in the epilogue of non-leaf and leaf functions.
This is part of the beauty and pain of assembler. There are no rules for the use of anything. It is up to you to decide what is needed. Please see: ARM Link and frame pointer as it maybe helpful.
... it will always use pop {r11, pc} for a non-leaf function and pop {r11}; bx lr for a leaf function. Can anybody tell me why this is?
A 'C' compiler is different. It has rules called an ABI. The latest version is called AAPCS for arm or ATPCS for thumb. These rules exist so that different compilers can call each others functions.note1 Ie, tools can operate. You can have this 'rule' in assembler or you can disregard it. Ie, if your goal is to interoperate with a compilers code, you need to follow that ABI rules.
Some of the rules say what needs to be pushed on the stack and how registers are used. The 'reason' that the leaf is different is that it is more efficient. Writing to a register lr is much faster than memory (push to the stack). When it is an non-leaf function, a function call there will destroy the existing lr and you would not return the right place afterwards, so LR is pushed to the stack to make things work.
When i am in a leaf function. Does it for example make a difference if i use "bx lr" or "pop pc" to go back to the parent functions?
The bx lr is faster than the pop pc because one uses memory and the other does not. Functionally they are the same. However, one common reason to use assembler is to be faster. You will functionally end up with the same execution path, it is just it will take longer; how much will depend on the memory system. It could be next to negligible for a Cortex-M with TCM or very high for Cortex-A CPUS.
The ARM uses register to pass parameters because this is faster than pushing parameters on the stack. Consider this code,
int foo(int a, int b, int c) {return a+b+c;}
int bar(int a) { return foo(a, 1, 2);}
Here is a possible ARM code note2,
foo:
pop {r0, r1}
add r0,r0,r1 ; only two registers needed.
pop {r1}
add r0,r0,r1
bx lr
bar:
push lr
push r0 ; notice we are only using one register?
mov r0, #1
push r0
mov r0, #2
push r0
bl foo
pop pc
This is not how any ARM compiler will do things. The convention is to use R0, R1, and R2 to pass the parameters. Because this is faster and actually produces less code. But either way achieves the same thing. Maybe,
foo:
add r0,r0,r1 ; a = a + b
add r0,r0,r2 ; a = a + c
bx lr
bar:
push lr ; a = a from caller of bar.
mov r1, #1 ; b = 1
mov r2, #2 ; c = 2
bl foo
pop pc
The lr is somewhat similar to the parameters. You could push the parameters on the stack or just leave them in a register. You could put the lr on the stack and then pop it off later or you can just leave it there. What should not be under-estimated is how much faster code can become when it uses registers as oppose to memory. Moving things around is generally a sign that assembler code is not optimal. The more mov, push and pop you have the slower your code is.
So generally quite a bit of thought went into the ABI to make it as fast as possible. The older APCS is slightly slower than the newer AAPCS, but they both work.
Note1: You will notice a difference between static and non static function if you turn up optimizations. This is because the compiler may ignore the ABI to be faster. Static functions can NOT be called by another compiler and don't need to interoperate.
Note2: In fact the CPU designers think a lot about the ABI and take into consideration how many registers. Too many registers and the opcodes will be big. Too few and there will be lots of memory used instead of registers.
In the leaf function, there are no other function calls which would modify the link register lr.
For a non-leaf function, the lr must be preserved, done here by pushing it to the stack (somewhere not shown, earlier in the function).
The epilogue of the non-leaf function could be rewritten:
pop {r11, lr}
bx lr
This is however one more instruction, and so it is slightly less efficient.

How can I simulate a CALL instruction by using JMP?

Like this but without the CALL instruction. I suppose that I should use JMP and probably other instructions.
PUSH 5
PUSH 4
CALL Function
This is fairly easy to do. Push the return address onto the stack and then jump to the subroutine.
The final code looks like this:
PUSH 5
PUSH 4
PUSH offset label1
jmp Function
label1: ; returns here
leas esp, 8[esp]
Function:
...
ret
While this works, you really don't want to do this. On most modern processors, an on-chip call stack return address cache is kept, which pushes return addresses on a call, and pops return addresses on an RET. Being on the processor this has extremely short update/access times, which means the RET instruction can use the call-stack cache popped value to predict where the PC should go next, rather than waiting for the actual memory read from the memory location actually pointed to by ESP. If you do the "PUSH offset label1" trick,
this cache does not get updated, and thus the RET branch prediction is wrong and the processor pipeline gets blown, having a severe negative impact on performance. (I think IBM has a patent on special instructions which are essentially "PUSHRETURNADDRESS k" and "POPRETURNADDESS", allowing this trick to be used on some of their CPUs. Alas, not on the x86.
It depends on the situation. If the last thing your function does before returning is call another function, you can simply jump to that function. This is called tail call elimination, and is an optimization performed by many compilers. Example:
foo:
call B
call A
ret
Tail call elimination replaces the last two lines with a single jump instruction:
foo:
call B
jmp A
This works because the stack contains the return address of foo's caller. So when function A returns, it returns back to the function that called foo.
It you want execution to resume after the jump to A, push that address onto the stack before jumping:
foo:
call B
push offset bar
jmp A
bar:
However, I can think of no reason why anybody would want to do this.
Before x86-64, call was the only instruction that could read EIP. (I guess int as well, but it doesn't put the result anywhere you can read from user-space).
So it's impossible to simulate call in position-independent code. In fact, 32-bit PIC code uses call to find out its own address.
But in x86-64, we have RIP-relative lea
... put function args in registers
lea rax, [rel ret_addr] ; AT&T lea ret_addr(%rip), %rax
push rax
jmp call_target
ret_addr:
call itself internally decodes as push RIP / jmp target, where RIP during execution of an instruction = address of the end of that instruction = start of the next.
Of course this is normally terrible for performance, unbalancing the return-address predictor stack. http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/. Use a normal call unless you want a ret to mispredict, e.g. for a retpoline or specpoline.
(A tailcall with just jmp is fine, collapsing a call/ret pair into a jmp, but pushing a new return address manually is always a problem.)

How to know how many arguments takes a function?

I have this function:
BOOL WINAPI MyFunction(WORD a, WORD b, WORD *c, WORD *d)
When disassembling, I'm getting something like this:
PUSH EBP
MOV ESP, EBP
SUB ESP, C
...
LEAVE
RETN C
As far as I know, the SUB ESP, C means that the function takes 12 bytes for all it's arguments, right? Each argument is 4-byte, and there're 4 arguments so shouldn't this function be disassembled as SUB ESP, 10?
Also, if I don't know about the C header of the function, how can I know the size of each parameter (not the size of all the parameters)?
No, the SUB instruction only tells you that the function needs 12 bytes for its local variables. Inferring the arguments requires looking at the code that calls this function. You'll see it setting up the stack before the CALL instruction.
In the specific case of a WINAPI function (aka __stdcall), the RET instruction gives you information since that calling convention requires the function to clean-up the stack before it returns. So a RET 0x0C tells you that the arguments required 12 bytes. Otherwise an accidental match with the stack frame size. Which usually means it takes 3 arguments, it depends on the argument types. A WORD size argument gets promoted to a 32-bit value so the signature you theorized is not a match.
If the convention call uses the stack (as it seems) to pass parameters, you can figure out how many parameters and what size they have.
For "how many", you can look at the operand of the RET instruction, if any (stdcall convention). This will give you how many bytes parameters are using. Of course this data alone if of not much use.
You have to read the function code and search for memory references like this [EBP+n] where n is a positive offset from the value of EBP. Positive offsets are addressing parameters, and negative offsets are addressing local variables (created with the SUB ESP,x instruction)
Hopefully, you will able to spot all distinct parameters. If the function has been complied with optimizations, this may be hard to figure out.
For size and type, more inverse engineering is needed. Look at the instructions that use addressed parameters. If you find something like dword ptr [ebp+n] then that parameter is 32-bit long. word ptr [ebp+n] tels you that the parameter is 16-bit long, and byte ptr [ebp+n] means a byte size parameter.
For byte and word sized parameters, the most plausible options are char/unsigned char and short/unsigned short.
For double word sized parameters, type may be int/unsigned int/long/unsigned long, but it may be a pointer as well. To differentiate a pointer from a plain integer, you will have to look further, to see if the dword read from the parameter is being used as a memory address itself to access memory (i.e. it's being dereferenciated).
To tell signedness of a parameter, you have to search for a code fragment in which a particular parameter is compared against some other value, and then a conditional jump is issued. The particular condition used in the jump will tell you if the comparison was performed taking the sign into account or not. For example: a comparison with a JA / JB / JAE / JBE conditional jumps indicate an unsigned comparison and hence, an unsigned parameter. Conditional jumps as JG / JE / JGE / JLE indicate signed parameter involved in the comparison.
That depends on your ABI.
In your case, it seems you're using Windows x86 (32 bit), which allows several C calling conventions. Some pass parameters in registers, others on the stack.
If the parameters are passed on the stack, they will be above the frame pointer, so subtracting from the stack pointer is used to make space for local variables, not to read the function parameters.

x86 assembly functions

I have a function that is called by main. Assume that function's name is funct1. funct1 calls another function named read_input.
Now assume that funct1 starts as follows:
push %rbp
push %rbx
sub $0x28, %rsp
mov $rsp, %rsi
callq 4014f0 read_input
cmpl $0x0, (%rsp)
jne (some terminating function)
So just a few of questions:
In this case, does read_input only have one argument, which is
%rbx?
Furthermore, if the stack pointer is being decreased by
0x28, this means a string of size 0x28 is getting pushed onto the
stack? (I know it's a string).
And what is the significance of
mov %rsp, %rsi before calling a function?
And lastly, when read_input returns, where is the return value put?
Thank you and sorry for the questions but I am just starting to learn x86!
It looks like your code is using the Linux/AMD ABI. I'll answer your questions in that context.
No, rbx is a callee-saved (nonvolatile) register. Your function is saving it so that it doesn't disturb the caller's value. It's not being restored in the code you've shown, but that's because you haven't shown the whole function. If there's more to this function, and I think there is, it's because rbx is being used somewhere later on in this routine.
Yes, space for 0x28 bytes of data is being made on the stack. Assuming read_input is taking a string as a parameter, your description is reasonable. It's not necessarily accurate, however. Some of that data might be used for other local variables aside from just the buffer being allocated to pass to read_input.
This instruction is putting a pointer to the newly allocated stack buffer into rsi. rsi is the second parameter register for the AMD x64 calling convention. That means you're going to be calling read_input with whatever the first parameter passed to this function is, along with a pointer to your new stack buffer.
In rax, if it's a 64-bit value or smaller, in rax & rdx if it's larger. Or if it's floating point, in xmm0, ymm0, or st(0). You probably should look at a description of your calling convention to get a handle on this stuff - there's a great PDF file at this link. Check out Table 4.