IDA Not Recognizing Instructions as Code? - function

I made a new segment in the ELF (ARM) game library I am working on (Page Aligned, .text segment (Pure Code)). The entirety of the segment includes only NOPs (0000A0E1). However, IDA recognizes the NOPs as:
DCD 0xE1A00000
If I press C to make it into code, it shows one NOP per line I press it on. When I try to create a function, it tells me that the function has undefined instruction/data at the specified address, and won't let me do it. In addition, each line of the DCD 0xE1A00000s is referenced in the .plt segment because they're referencing an offset out of the segment. This happens even after I turn it into code.
Is there a way to declare the entire segment as code, and/or put a segment property that does this, possibly fixing the function problem?

Related

How do i know when a function body ends in assembly

I want to know when a function body end in assemby, for example in c you have this brakets {} that tell you when the function body start and when it ends but how do i know this in assembly?
Is there a parser that can extract me all the functions from assembly and start line and endline of their body?
There's no foolproof way, and there might not even be a well-defined correct answer in hand-written asm.
Usually (e.g. in compiler-generated code) you know a function ends when you see the next global symbol, like objdump does to decide when to print a new "banner". But without all function-start symbols being visible, there's no unambigious way. That's why some object file formats have room for size metadata associated with a symbol. Like .size foo, . - foo in GAS syntax.
It's not as easy as looking for a ret; some functions end with a jmp tail-call to another function. And some call a noreturn function like abort or __stack_chk_fail (not tailcall because they want to push a return address for a backtrace.) Or just fall off into whatever's next because that path had undefined behaviour in the source so the compiler assumed it wasn't reachable and stopped generating instructions for it, e.g. a C++ non-void function where execution can/does fall off the end without a return.
In general, assembly can blur the lines of what a function is.
Asm has features you can use to implement the high-level concept of a function, but you're not restricted to that.
e.g. multiple asm functions could all return by jumping to a common block of code that pops some registers before a ret. Is that shared tail a separate function that's called with a tail-called with a special calling convention?
Compilers don't usually do that, but humans could.
As for function entry points, usually some other code somewhere in the program will contain a call to it. But not necessarily; it might only be reachable via a table of function pointers, and you don't know that a block of .rodata holds function pointers until you find some code loading from it and calling or jumping.
But that doesn't work if the lowest-address instruction of the function isn't its entry point. See Does a function with instructions before the entry-point label cause problems for anything (linking)? for an example
Compilers don't generate code like that, but humans can. (It's a handy trick sometimes for https://codegolf.stackexchange.com/ questions.)
Or in the general case, a function might have multiple entry points. Or you could describe that as multiple functions with overlapping implementations. Sometimes it's as simple as one tailcalling another by falling into it without needing a jmp, i.e. it starts a few instructions before another.
I wan't to know when a function body ends in assembly, [...]
There are mainly four ways that the execution of a stream of (userspace) instructions can "end":
An unconditional jump like jmp or a conditional one like Jcc (je,jnz,jg ...)
A ret instruction (meaning the end of a subroutine) which probably comes closest to the intent of your question (including the ExitProcess "ret" command)
The call of another "function"
An exception. Not a C style exception, but rather an exception like "Invalid instruction" or "Division by 0" which terminates the user space program
[...] for example in c you have this brakets {} that tell you when the function body start and when it ends but how do i know this in assembly?
Simple answer: you don't. On the machine level every address can (theoretically) be an entry point to a "function". So there is no unique entry point to a "function" other than defined - and you can define anything.
On a tangent, this relates to self-modifying code and viruses, but it must not. The exit/end is as described in the first part above.
Is there a parser that can extract me all the functions from assembly and
start line and endline of their body?
Disassemblers create some kind of "functions" with entry and exit points. But they are merely assumed. No way to know if that assumption is correct. This may cause problems.
The usual approach is using a disassembler and the work to recombinate the stream of instructions to different "functions" remains to the person that mandated this task (vulgo: you). Some tools exist that claim to simplify this, but I cannot judge their efficacy.
From the perspective of a high level language, there are decompilers that try to reverse the transformation from (for example) C to assembly/machine code that try to automatize that task and will work more or less or in some cases.

IDA Pro jumping to offset from base

I use CheatEngine as a debugger (and get a lot of crap for it). When I find addresses, I always write them down based on the offset from where the start of the instructions are (e.g. program.exe+402C0). It would be nice to be able to use the goto function with this method of referencing a location; is there a way to do this?
According to IDA Pro's documentation:
If the entered [goto] string can not be recognized as a hexadecimal or location name, IDA will try to interpreet it as an expression using the current script interpreter. The default interpreter is IDC.
So what you can do is define a global variable in the IDC interpreter (using the bar at the bottom of your IDA view) that identifies the base address of your module as such:
extern ModuleBaseAddress;
ModuleBaseAddress = 0x400000; // Example base address
Then whenever you want to go to the base address + offset you would simply open the Jump window (using the g-key) and type in:
ModuleBaseAddress + 0x1000 // 0x1000 is your offset

How do I use a function from one lisp file to solve for something in another lisp file?

I'm new to lisp and my professor gave some .lisp files to play around with.
http://pastebin.com/eDPUmTa1 (search functions)
http://pastebin.com/xuxgeeaM (water jug problem saved as waterjug.lisp)
The problem is I don't know how to implement running functions from one file to solve problems from another. The most I've done is compiled functions from one file and played around with it in the terminal. I'm not sure how to load 2 files in this IDE as well as how I should run the function. I'm trying, for example, to run the breadth-first-search function to solve the problem to no avail.
I'm currently using emacs as the text editor SBCL as the common lisp implementation along with quicklisp and slime.
Assuming each file is in its own buffer, say f1.lisp and f2.lisp, then you only have to call slime-compile-and-load-file when you are in each buffer. This is bound by default to C-c C-k. You have to compile the first file first, because it contains definitions for the second one.
But, your second file (f2.lisp) has two problems: search for (break and (bread and remove those strings. Check if the forms around them have their parenthesis well balanced.
Take care of warning messages and errors while compiling your file.
Then, if you want to evaluate something directly from the buffer, put your cursor (the point) after the form you want to evaluate, and type C-x C-e (imagine the cursor is represented by % below):
(dump-5 (start-state *water-jug*))%
This will print the result in the minibuffer, in your case something like #<JUG-STATE {1004B61A63}>, which represents an instance of the JUG-STATE class. Keep a window open to the REPL buffer in case the functions write something to standard output (this is the case with the (describe ...) expression below).
If instead you do C-c I, this will ask you which expression you want to inspect, already filled with the form before the point. When you press enter, the inspector buffer will show up:
#<JUG-STATE {1004BD8F53}>
--------------------
Class: #<STANDARD-CLASS COMMON-LISP-USER::JUG-STATE>
--------------------
Group slots by inheritance [ ]
Sort slots alphabetically [X]
All Slots:
[ ] FIVE = 0
[ ] TWO = 2
[set value] [make unbound]
Read http://www.cliki.net/slime-howto.

Why does '.CODE' have to be inserted before 'main PROC' only?

I am learning assembly language and had a question on syntax of procedure.
(I am using VS 2012).
For main procedure, if the line 'main PROC' was inserted before '.data', the error would occur.
.data
.code
main Proc
;some code
main ENDP
local1 Proc
.data
.code
ret
.local1 ENDP
END main
But for other local procedure under the main, it works fine with '.data' after the declaration of the procedure.
Could someone explain to me why?
p.s. Also is assembly lanaugage unpopular? I have learned little java and c++ and compare to them, there are much less discussion and searches on google.
Imagine assembler as a machine which reads input source and writes (emits) to two output channels: one for executable code and another one for data.
Segment-switch directives .data and .code tell assembler where it should write the emitted information.
Using directive .data you actually command your assembler: stop emitting to whichever output segment was current at the moment, switch to the segment .data and continue writing to this segment at the next free space (origin), where you left it the last time it was active.
Alternation between .code and .data segment in source text is good for readability of the program, it allows to keep code procedures close to the global data it operates with. On the other hand, when the compiled program is loaded in memory, all procedures have to be linked together in one code segment and all data are kept together in data segment. Operating system usually does not allow to execute any instruction from segments marked as "data", and to write any data to segment marked as "code". That is why it is programmer's responsibility to switch to .code before any executable statement is emitted.

How does backpatching work with markers?

I've searched all over the Internet and could not find a proper explanation of how backpatching works?
Can you please explain me how does backpatching works? How does it work with the markers?
I know it has 2 main types of markers:
with next-quad in it
with next-list in it
I found this code, in which they are taking an input file and creating a file with RISKI language.
In their first roll they have:
PROGRAM : N FUNCTION M MAIN_FUNCTION
and you can see that N and M are markers (they are empty rolls).
One-pass code generation has a small problem with generating code for conditionals. A typical if statement:
if CONDITION then ALTERNATIVE_1 else ALTERNATIVE_2
needs to be compiled into something like this:
compute CONDITION
JUMP_IF_TRUE label1
JUMP_IF_FALSE label2
label1:
code for ALTERNATIVE_1
JUMP label3
label2:
code for ALTERNATIVE_2
JUMP label3
label3:
next statement
But when the code for CONDITION is being generated, it is not known where label1 and label2 are, and when the code for ALTERNATIVE_1 and ALTERNATIVE_2 is being generated, it is not known where label3 is.
One way to do this is to use symbolic names for the the labels, as in the above pseudocode, and fill in the actual values when they are known. That requires storing a symbolic name in the jump statement, which complicates the datastructures (in particular, you can't just use a binary assembler code). It also requires a second pass, just to fill in jump targets.
A (possibly) simpler way is to just remember the address of the jump statements, and patch in the target address when it is known. This is known as "backpatching", because you go back and patch the generated code.
It turns out that in many cases, you end up with multiple branches to the same label. A typical case is "short-circuit" booleans like the C-family's && and || operators. For example, extending the original example:
if (CONDITION_1 and CONDITION_2) or CONDITION_3 then ALTERNATIVE_1 else ALTERNATIVE_2
compute CONDITION_1
JUMP_IF_TRUE label1
JUMP_IF_FALSE label2
label1:
compute CONDITION_2
JUMP_IF_TRUE label3
JUMP_IF_FALSE label2
label2:
compute CONDITION_3
JUMP_IF_TRUE label3
JUMP_IF_FALSE label4
label3:
code for ALTERNATIVE_1
JUMP label5
label4:
code for ALTERNATIVE_2
JUMP label5
label5:
next statement
It turns out that for simple languages, it is only necessary to remember two incomplete jump statements (often called "true" and "false"). Because there might be multiple jumps to the same target, each of these is actually a linked list of incomplete jump statements, where the target address is used to point to the next jump in the list. So the backpatching walks back through the list, patching in the correct target and using the original target to find the previous statement which needs to be patched.
What you call markers (which are an instance of what yacc/bison refers to as "Mid-Rule Productions") are not really related to backpatching. They can be used for many purposes. In one-pass code-generation, it is often necessary to perform some action in the middle of a rule, and backpatching is just one example.
For example, in the hypothetical if statement, it would be necessary to initialize the backpatch lists before the CONDITION is parsed and then backpatch at the beginning of the THEN and ELSE clauses. (Another backpatch will be triggered at the end of the parse of the entire if statement, but that one will be in the rule's finnal action.)
The easiest way to perform actions in the middle of a rule is to insert a mid-rule action, which is equivalent to inserting an empty "marker" production with an action, as in the example bison file you point to.