Can I identify a "function" in an x86 binary? - language-agnostic

"Function" meaning a chunk (or a graph of chunks) of the binary that starts at a point (likely arriving from one of the CALL instructions), possibly sets up a stack frame, and has one or more endpoints in the form of RETs (and depending on the calling convention it may also unwind said stack frame).
My current idea is to treat the various conditional branching instructions as junctions in a graph and do a Breadth-first search on the code this way. Is this viable at all? If not, what's a better approach?
My objective with this is just what it is: extract the functions. Purely for the sake of doing it. Maybe doing something fancy later if I have the time and notion.

You can use a disassembler library like BeaEngine to do the hard work for you and then search on resulting mnemonics for call.

Without a symbol table I would say: almost impossible. At least without false positives/negatives.
What you need first is a disassembler. Just looking for a byte combination won't cut it, the combination might be part of some "random" data. Then, tracing the CALLs is likely the best solution as a function doesn't necessarily always start with the same opcode sequence. But even a disassembler might have a hard time and get confused by embedded data in the text segment.
Even if you were able to find the functions, you cannot get their names without debug symbols (in the compiled program there's no need for names any more, only addresses).
Also, you'd have a very hard time finding out what kind of parameters the function accepts. For example, a function might accept 2 argument but uses neither. In this case you would need a function call and look at how the stack is prepared in advance of calling the function.

You have to look for things like:
push ebp
mov ebp, esp
sub esp, ???
...
...
add esp, ???
pop ebp
ret

Related

Using "saved" registers in the main function at RISC-V Assembly

Suppose the simple following main function written in RISC-V Assembly:
.globl main
main:
addi s3,zero,10 #Should this register (s3) be saved before using?
Since s3 is a "saved register", the procedure calling conventions should be followed and thus, this register should be pushed to the stack before using it. However, by looking at the source file, no other procedure has used this register and saving the register to the stack seems redundant.
My question is, should these types of registers be saved every time before every usage even if it means writing more (redundant) code just to obey the calling conventions? Can these conventions sometimes be ignored to improve performance?
In the example above, should the register be saved because it is unknown if the main's caller has been using the s3 register?
Yes, main is a function that has a real caller you return to, and that caller might be using s3 for something.
Unless your main never returns, either being an infinite loop or only exiting by calling exit or a system call. If you never return, you don't need to be able to restore the caller's state, or even find your way back (via a return address).
So if it's just as convenient to call exit instead of ever returning from main, doing that allows you to avoid saving anything.
This also applies in cases where there's nothing for main to return to, of course, so returning wasn't even an option. e.g. if it's the entry point in a kernel or other freestanding code.
Also, I hope you understand that saved every time before every usage means once per function that uses them, not separately around each separate block. And not saving call-clobbered registers around each function call; just let them die.
Can these conventions sometimes be ignored to improve performance?
Yes, if you keep the details invisible to any code you don't control.
If you treat small private helper functions as actually part of one big function, then they can use a "private" custom calling convention. (Even if you do actually call / return instead of just jumping to them, if you want to avoid inlining them at multiple callsites)
Sometimes this is just taking advantage of extra guarantees when you know about the function you're calling. e.g. that it doesn't actually clobber some of its input arg registers. This can be useful in recursion when you're calling yourself: foo(int *p, int a) self calls might take advantage of p still being in the same register unmodified, instead of having to keep p somewhere else for use after the call returns like it would if calling an "unknown" function where you can't assume anything the calling convention doesn't guarantee.
Or if you have a publicly-visible wrapper in front of your actual private recursive function, you can set up some some constants, or even have the recursive function treat one register as a static variable, instead of passing around pointers to some shared state in memory. (That's no longer pure recursion, just a loop that uses the asm stack to keep track of some history that happens to include a jump address.)

How many arguments are passed in a function call?

I wish to analyze assembly code that calls functions, and for each 'call' find out how many arguments are passed to the function. I assume that the target functions are not accessible to me, but only the calling code.
I limit myself to code that was compiled with GCC only, and to System V ABI calling convention.
I tried scanning back from each 'call' instruction, but I failed to find a good enough convention (e.g., where to stop scanning? what happen on two subsequent calls with the same arguments?). Assistance is highly appreciated.
Reposting my comments as an answer.
You can't reliably tell in optimized code. And even doing a good job most of the time probably requires human-level AI. e.g. did a function leave a value in RSI because it's a second argument, or was it just using RSI as a scratch register while computing a value for RDI (the first argument)? As Ross says, gcc-generated code for stack-args calling-conventions have more obvious patterns, but still nothing easy to detect.
It's also potentially hard to tell the difference between stores that spill locals to the stack vs. stores that store args to the stack (since gcc can and does use mov stores for stack-args sometimes: see -maccumulate-outgoing-args). One way to tell the difference is that locals will be reloaded later, but args are always assumed to be clobbered.
what happen on two subsequent calls with the same arguments?
Compilers always re-write args before making another call, because they assume that functions clobber their args (even on the stack). The ABI says that functions "own" their args. Compilers do make code that does this (see comments), but compiler-generated code isn't always willing to re-purpose the stack memory holding its args for storing completely different args in order to enable tail-call optimization. :( This is hand-wavey because I don't remember exactly what I've seen as far as missed tail-call optimization opportunities.
Yet if arguments are passed by the stack, then it shall probably be the easier case (and I conclude that all 6 registers are used as well).
Even that isn't reliable. The System V x86-64 ABI is not simple.
int foo(int, big_struct, int) would pass the two integer args in regs, but pass the big struct by value on the stack. FP args are also a major complication. You can't conclude that seeing stuff on the stack means that all 6 integer arg-passing slots are used.
The Windows x64 ABI is significantly different: For example, if the 2nd arg (after adding a hidden return-value pointer if needed) is integer/pointer, it always goes in RDX, regardless of whether the first arg went in RCX, XMM0, or on the stack. It also requires the caller to leave "shadow space".
So you might be able to come up with some heuristics to will work ok for un-optimized code. Even that will be hard to get right.
For optimized code generated by different compilers, I think it would be more work to implement anything even close to useful than you'd ever save by having it.

Why should you keep ESP in EBP inside a call?

I'm reading in Professional Assembly Language by Richard Blum that when you enter a call you should copy the value of the ESP register to EBP, and he also provided the following template:
function_label:
pushl %ebp
movl %esp, %ebp
< normal function code goes here>
movl %ebp, %esp
popl %ebp
ret
I don't understand why this is necessary. When you push something inside the function, you obviously intend to pop it back, thus restoring ESP to it's original value.
So why have this template?
And what's the use of the EBP register anyway?
I'm obviously missing something, but what is it?
When you push something inside the function, you obviously intend to pop it back
That's just part of the reason for using stack. The far more common usage is the one that's missing from your snippet, storing local variables. The next common code you see after setting up EBP is a substraction on ESP, equivalent to the amount of space required for local variable storage. That's of course easy to balance as well, just add the same amount back at the function epilogue. It gets more difficult when the code is also using things like C99 variable length arrays or the non-standard but commonly available _alloca() function. Being able to restore ESP from EBP makes this simple.
More to the point perhaps, it is not necessary to setup the stack frame like this. Most any x86 compiler supports an optimization option called "frame pointer omission". Turned on with GCC's -fomit-frame-pointer, /Oy on MSVC. Which makes the EBP register available for general usage, that can be very helpful on x86 with its dearth of cpu registers.
That optimization has a very grave disadvantage though. Without the EBP register pointing at the start of a stack frame, it gets very difficult to perform stack walks. That matters when you need to debug your code. A stack trace can be very important to find out how your code ended up crashing. Invaluable when you get a "core dump" of a crash from your customer. So valuable that Microsoft agreed to turn off the optimization on Windows binaries to give their customers a shot at diagnosing crashes.

Stack(s), Registers in ActionScript ByteCode AVM2, which all are there?

From the AVM2 Overview PDF I encountered references to two types of stacks - Scope Stack and Operand Stack.
1) I assume these are two different memory stacks, each handling different things. Are there even more stacks?
2) pushstring "hello" - this would push a start of memory address where "hello" string is located onto Operand Stack. Right?
3) setlocal 0 - this would store a value from the stack (above) into register0 by popping it off. Right?
4) PushScope() - hmm, docs say pop value of stack, push value onto Scope Stack. Why?
I know a little bit of NASM but ABC seems more complex than that. Especially I'm confused about Scope Stack and the whole concept of multiple stacks.
I am no AVM2 expert, but here's what I know:
There are only 2 stacks, the two you mention: scope and operand.
Yes, pushstring "hello" will push the string onto the operand stack.
Also, correct. setlocal0 will pop "hello" off the stack and store it in reg 0.
The scope stack is used by all operations that require a name lookup for scope, for instance closures and exceptions. Often in ASM code you'll see getlocal_0 immediately followed by a pushscope. This is pretty common. You can kind of think of it as adding the "this" object to the scope stack for future reference in method calls, scope for closures, etc.
I highly recommend downloading the Tamarin source and playing with the decompiler there. Also, Yogda looks to be pretty handy for learning: http://www.yogda.com/

Why is memoization not a language feature?

I was wondering: why is memoization not provided natively as a language feature in any language I know about?
Edit: to clarify, what I mean is that the language provides a keyword to specify a given function as memoizable, not that every function is automatically memoized "by default" unless specified otherwise. For example, Fortran provides the keyword PURE to specify a specific function as such. I guess that the compiler can take advantage of this information to memoize the call, but I ignore what happens if you declare PURE a function with side effects.
What YOU want from memoization may not be the same as what the compiler memoization option would provide.
You may know that it is only profitable to memoize the last 10 or so distinct values computed, because you know how the function will be used.
You may know that it only makes sense to memoize the last 2 or 3 values, because you will never use values older than that. (Fibonacci's Sequence comes to mind.)
You may be generating a LOT of values on some runs, and just a few on others.
You may want to "throw away" some of the memoized values and start over. (I memoized a random number generator this way, so I could replay the sequence of random numbers that built a certain structure, while some other parameters of the structure had been changed.)
Memoization as an optimization depends on the search for the memoized value being a lot cheaper than recomputation of the value. This in turn depends on the ordering of the input requests. This has implications for the memoization database: Does it use a stack, an array of all possible input values (which may be very large), a bucket hash, or a b-tree?
The memoizing compiler has to either provide a "one size fits all" memoization, or it has to provide lots of possible alternatives, and parameters to control the alternatives. At some point, it becomes easier for everyone to require the user to provide his own memoization.
Because compilers have to emit semantically correct programs. You can't memoize a function without changing program semantics unless it is referentially transparent. In most programming languages not all functions are referentially transparent (pure functional programming languages are an exception) so you can't memoize everything. But then a mechanism is needed for detecting referential transparency and that is too hard.
In Haskell, memoization is automatic for (pure) functions you've defined that take no arguments. And the Fibonacci example in that Wiki is really about the simplest demonstrable example I would be able to think of either.
Haskell can do this because your pure functions are defined to produce the same results every time; of course, monadic functions that depend on side effects won't be memoized.
I'm not sure what the upper limits are -- obviously, it won't memoize more than the available memory. And I'm also not sure offhand if the memoization occurs at compile-time (if the values can be determined at compile-time), or if it always occurs the first time the function is called.
Clojure has a memoize function (http://richhickey.github.com/clojure/clojure.core-api.html#clojure.core/memoize):
memoize
function
Usage: (memoize f)
Returns a memoized version of a referentially transparent function. The
memoized version of the function keeps a cache of the mapping from arguments
to results and, when calls with the same arguments are repeated often, has
higher performance at the expense of higher memory use.
A) Memoization trades space for time. I imagine that this can turn out to a fairly unbound property, in the sense, that the amount of data programs or libraries would have to store could consume large parts of memory really quick.
For a couple of languages, memoization is easy to implement and easy to customize for the given requirements.
As an example take some natural language processing on large bodies of text, where you don't want to compute basic properties of texts (word count, frequency, cooccurrences, ...) over and over again. In that case a memoization in combination with object serialization can be useful as opposed to memory caching, since you may run your application multiple times on unchanged corpora.
B) Another aspect: It's not true, that all functions or methods yield the same output for a same given input. Anyway some keyword or syntax for memoization would be necessary, along with configuration (memory limits, invalidation policy, ...) ...
Because you shouldn't implement something as a language feature when it can easily be implemented in the language itself. A memoization feature belongs in a library, which is exactly where most languages put it.
Your question also leaves open the solution of your learning more languages. I think that Lisp supports memoization, and I know that Mathematica does.
In order for memoization to work as a language feature there would be a couple requirements.
The compiler would need to be identify valid functions for memoization (e.g. they are referentially transparent).
The run-time would have to be able to intelligently select candidates for memoization without slowing down the overall performance.
There are some assumptions in the other language, but if we can have performance gains by just-in-time compilation of hot-spots in a Java VM, then one can surely write an automated memoziation system.
While non-trivial I think this is all theoretically possible to get performance gains in a language (especially an interpreted one) and is a worthwhile area for research.
Not all the languages natively support function decorators. I guess it would be a more general approach to support rather than supporting just memoization.
Reverse the question. Why it should? As someone has said, it can be put in a library so no need of add syntax to the language, it's only usable on pure functions which are hard to identify automatically(unless you force the programmer to annotate them). It's also very hard to determine if memoization is going to speed up things or not. I don't think it's a desirable feature for a programming language.
I really think such an option should be.
In data processing tasks there is an immutable input data (as time series, for example, where for a given time as soon as a value is known, it can never change). Taking in mind today RAM affordability, if a function result only depends on such immutable data, it is rational to memoize it rather than reread every time it's needed. Currently I have (in Scala and C#) to manually introduce an in-memory storage table and write 3 functions instead of one - one reading a value from file/db/ws, one storing it into an in-memory table, one to wrap them and read from memory if available or call the raw function if not. I think this could and should be implemented as a keyword and done behind the scenes.