using dis to disassemble opcodes - reverse-engineering

Basically I want to disassemble a bunch of opcodes using cpython.
I have the following opcodes:
data = b"\xdb\xda\xbb\x04\xd7\x03\xf9\xd9\x74\x24\xf4\x5f\x2b\xc9\xb1\x56\x31\x5f\x18\x03\x5f\x18\x83\xc7\x00\x35\xf6\x05\xe0\x3b\xf9\xf5\xf0\x5b\x73\x10\xc1\x5b\xe7\x50\x71\x6c\x63\x34\x7d\x07\x21\xad\xf6\x65\xee\xc2\xbf\xc0\xc8\xed\x40\x78\x28\x6f\xc2\x83\x7d\x4f\xfb\x4b\x70\x8e\x3c\xb1\x79\xc2\x95\xbd\x2c\xf3\x92\x88\xec\x78\xe8\x1d\x75\x9c\xb8\x1c\x54\x33\xb3\x46\x76\xb5\x10\xf3\x3f\xad\x75\x3e\x89\x46\x4d\xb4\x08\x8f\x9c\x35\xa6\xee\x11\xc4\xb6\x37\x95\x37\xcd\x41\xe6\xca\xd6\x95\x95\x10\x52\x0e\x3d\xd2\xc4\xea\xbc\x37\x92\x79\xb2\xfc\xd0\x26\xd6\x03\x34\x5d\xe2\x88\xbb\xb2\x63\xca\x9f\x16\x28\x88\xbe\x0f\x94\x7f\xbe\x50\x77\xdf\x1a\x1a\x95\x34\x17\x41\xf1\xf9\x1a\x7a\x01\x96\x2d\x09\x33\x39\x86\x85\x7f\xb2\x00\x51\xf6\xd4\xb2\x8d\xb0\xb5\x4c\x2e\xc0\x9c\x8a\x7a\x90\xb6\x3b\x03\x7b\x47\xc3\xd6\x11\x4d\x53\xd3\xe5\x51\xc2\x8b\xe7\x51\x15\x10\x6e\xb7\x45\xf8\x20\x68\x26\xa8\x80\xd8\xce\xa2\x0f\x06\xee\xcc\xda\x2f\x85\x22\xb2\x18\x32\xda\x9f\xd3\xa3\x23\x0a\x9e\xe4\xa8\xbe\x5e\xaa\x58\xcb\x4c\xdb\x3e\x33\x8d\x1c\xab\x33\xe7\x18\x7d\x64\x9f\x22\x58\x42\x00\xdc\x8f\xd1\x47\x22\x4e\xe3\x3c\x15\xc4\x4b\x2b\x5a\x08\x4b\xab\x0c\x42\x4b\xc3\xe8\x36\x18\xf6\xf6\xe2\x0d\xab\x62\x0d\x67\x1f\x24\x65\x85\x46\x02\x2a\x76\xad\x10\x2d\x88\x33\x3f\x96\xe0\xcb\x7f\x26\xf0\xa1\x7f\x76\x98\x3e\xaf\x79\x68\xbe\x7a\xd2\xe0\x35\xeb\x90\x91\x4a\x26\x74\x0f\x4a\xc5\xad\xa0\x31\xa6\x52\x41\xc6\xae\x36\x42\xc6\xce\x48\x7f\x10\xf7\x3e\xbe\xa0\x4c\x30\xf5\x85\xe5\xdb\xf5\x9a\xf6\xc9"
I would like to disassemble this. I tried using Cs but sometimes Cs cannot disassemble all of the opcodes, I.e in above Cs fails on \xf0 or \xf0\x5b\x73\x10. Below is how I use Cs module:
md = Cs(CS_ARCH_X86, CS_MODE_64)
md.detail = True
for i in md.disasm(data, dataLength):
print(i)
So, I thought that Cs is not that intelligent to disassemble all. Otherwise please tell me what is wrong here?
I looked around trying to find an alternative better python module that could disassemble the opcodes, and the first results were using module dis. Reading through the documentation, I could'nt find an example to disassemble a bunch of opcodes. All I found was to disassemble a function. Can anyone tell me how to disassemble using dis? or is there any better disassembling module.

Related

Is it possible to execute part of the decompiled code?

I am currently trying to solve a reversing challenge, where c code is compiled for a 32bit linux system.
To solve this challenge I am trying to make use of ghidra but am faced with a few issues. A bit of a summary what I have done up to this point:
I have two OS available to me, one 64bit Linux System on my Laptop and this 64bit Windows 10. Apparantly the programm was compiled with gcc without a -g option making ghidra fail to debug the programm. Manually debugging it with gdb in Terminal is possible but terrible to use (at least for me).
So all I can do is look at the assembler code in the CodeBrowser of Ghidra and its respective decomipled c code. With that I got to understand that some of the instructions are decrypted during the runtime of the programm and in order to further analyse the code, I want to be able to execute parts of the instructions to slowly but surely decrypt and understand the hidden parts of the programm.
That being said, the only issue here is that I do not know how I can do that. I have noticed that ghidra has the ability to run java code, but all the examples I looked at that were provided by ghidra allow me to only patch hardcoded instructions into the programm but not to actually execute/evaluate them.
My specific issue at hand is following part of the programm (green marked part):
Ghidra has all the knowledge it needs to execute this part and I just do not know how to do that. I could of cause do it by hand, but that is just boring and not really why I am doing these challenges and that is the same reason as why I am not looking for finished scripts that unpack this programm for me but for a way to execute my analysis.
Finally to summarize my question: I am asking for a way to execute the green marked decrypting part of the targeted programm in ghidra without starting the debugger (since the ghidra debugger keeps failing on me).
I think you are mixing up a few things here. You say:
the programm was compiled with gcc without a -g option making ghidra fail to debug the programm
The debug information added with -g makes it easier to analyze and debug a program because you have information that would have otherwise have to be recovered by reverse engineering. This should not have an influence on whether you can run the program under a debugger in the first place, and as you noted running it with gdb in the terminal works. The Ghidra debugger basically just runs gdb in the background and attaches to it to exchange information, so it should work.
You have a few options now:
1. Get the Ghidra Debugger to run with this binary
Whatever issue you are encountering with the Ghidra debugger is probably a valid question for https://reverseengineering.stackexchange.com/
From then on you can pursue your initial plan to solve this via debugging.
2. Write a GhidraScript to reimplement the decryption
Understand the basic idea of what you recognized correctly as some kind of decryption loop. Then you can use one of Ghidra's scripting options[0] to write a simple script that reimplements this decryption, but writes the decrypted values to the Ghidra memory directly.
Any scripting language will obviously include basic arithmetic operations like + -, and xor and loops, and the Ghidra API provides the functions byte getByte(Address address) and setByte(Address address, byte value). If you encounter any issues or API questions while writing this script that will also be a valid follow up question for the RE Stack Exchange.
This approach has the advantage that you can then statically analyse the resulting data inside Ghidra again, e.g. disassemble the resulting code.
[0] Ghidra natively supports Python 2.7 and Java based Scripts and a rudimentary Python REPL, but there are other options like Jupyter and Script based Kotlin or Ruby, Kotlin and Clojure Scripts

passing c++ variables to python via gdb

I am developing/debugging a c++ code which extensively uses c++ STL vectors and blitz cpp arrays
(vectors/arrays are multidimensional, upto 4D/5D arrays)
I am currently using cout/print to log the outputs of inputs/outputs of functions but it is getting very tedious. To be able to print the vectors/arrays while debugging, can you suggest any options.
I thought of a couple of options
(a) write template functions on c++ to print and use GDBs "call" feature. but unable to use the "call" functionality of GDB for c++ template functions but works for normal functions though.
(b) Is it possible to pass c++ variables to python interface of GDB and print them ? any examples for the same ?
I googled before posting this question, but did not find any useful thread.
Any help is highly appreciated (even if some links can be provided)
Thanks a lot in advance !
Writing code in C++ to print the array and call it from gdb is certainly an option, but it might be unreliable because the print function you write might not be accessible (the linker might have dropped it because it was not used in your c++ code, for instance). Also, remember that templates are just "recipes" and you actually need to use them in order for the compiler to generate a class/function from it.
Is it possible to pass c++ variables to python interface of GDB and print them ? any examples for the same ?
A simple answer to this is "yes". You can use the parse_and_eval function in the gdb module when you use gdb's python API. Something such as
py print(gdb.parse_and_eval('your_variable'))
would print the value of a variable called your_variable using gdb's python API. But just that would be the same as just p your_variable in gdb's regular prompt without using the python API. The real power comes when you use gdb's python API to write pretty-printers for the types you want to debug.
A pretty-printer is basically just some code that you or someone else wrote to tell gdb how to print some type in a nice way. With a pretty-printer for a type just p your_variable in gdb's prompt prints the variable in the nice way defined by your pretty-printer.
I couldn't find a pretty-printer for blitz with a quick google search and I haven't used blitz before. However, I have used another library for vectors and matrices in scientific computing called armadillo and thus faced similar problems. I have thus written some pretty printers for armadillo here that might help you in case you decide to write pretty printers for blitz.
As an illustration, below you can see how the arma::mat (a matrix of doubles) type from armadillo is printed in gdb without a pretty printer (the m1 variable, which is a 6x3 matrix of doubles)
Notice that we can't even see the matrix elements. They are stored in a continuous memory region pointed by the mem attribute of the arma::mat object.
Now the same matrix with the pretty printer available here.
That makes debugging code a lot easier.
Note: You can also write pretty printers in the guile language, but I bet python is a much more common choice.

Stanford NLP chunking for nested grammar

I made a switch from nltk to stanford-nlp and am finding trouble getting chunk results. A simple nested chunk grammar like
JJI:{(<JJ.*|VB.*>*)}
JJJ:{<RB.*>*(<TO>)?<JJI>?}
works like a charm in NLTK with easy code
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
whereas its overly complicated in Stanford-NLP
TokenSequencePattern.getMultiPatternMatcher failed to give results for nested grammer
I turned to CoreMapExpressionExtractor.createExtractorFromFiles which required extensive configurations with ENV.defaults["stage"] and action: result: properties and intensive debugging. After much struggle I turned here to SO.
Is there no simpler way/api in Stanford-NLP like nltk.RegexpParser to perform such an elementary task?
Also if CoreMapExpressionExtractor.createExtractorFromFiles is the only way forward could someone please advice me on how the NERRulesFile.rules files should look like for this usecase?

How to use JRuby's org.jruby.lexer.yacc.RubyYaccLexer

I'm using ripper to doing ruby-code lexing in mri-1.9.*, I would like to do the same thing in JRuby, I noticed there is this org.jruby.lexer.yacc.RubyYaccLexer used in org.jruby.parser.DefaultRubyParser, I'm thinking that I can use it to do what ripper in mri-1.9.* does, though definitely at a lower level as compared to ripper. Being a noob in java, I couldn't figure out how to use it from within jruby. I'm not sure if it is doable at all, hope to get some advice on this.
Take a look at this post from JRuby committer Ola Bini. In it he shows some brief usage of JRuby's AST. You can use the code from JRuby to create an AST and navigate it in memory, manipulate it, and turn it back into executable code.
require 'jruby'
JRuby.ast_for "puts 'hello'"
# => RootNode
# NewlineNode
# FCallOneArgNode |puts|
# ArrayNode
# StrNode =="hello"
It doesn't give you the same event-like approach like Ripper does, but by traversing the AST you can get similar information.

Understanding run time code interpretation and execution

I'm creating a game in XNA and was thinking of creating my own scripting language (extremely simple mind you). I know there's better ways to go about this (and that I'm reinventing the wheel), but I want the learning experience more than to be productive and fast.
When confronted with code at run time, from what I understand, the usual approach is to parse into a machine code or byte code or something else that is actually executable and then execute that, right? But, for instance, when Chrome first came out they said their JavaScript engine was fast because it compiles the JavaScript into machine code. This implies other engines weren't compiling into machine code.
I'd prefer not compiling to a lower language, so are there any known modern techniques for parsing and executing code without compiling to low level? Perhaps something like parsing the code into some sort of tree, branching through the tree, and comparing each symbol and calling some function that handles that symbol? (Wild guessing and stabbing in the dark)
I personally wouldn't roll your own parser ( turning the input into tokens ) or lexer ( checking the input tokens for your language grammar ). Take a look at ANTLR for parsing/lexing - it's a great framework and has full source code if you want to dig into the guts of it.
For executing code that you've parsed, I'd look at running a simple virtual machine or even better look at llvm which is an open-source(ish) attempt to standardise the virtual machine byte code format and provide nice features like JITing ( turning your script compiled byte code into assembly ).
I wouldn't discourage you from the more advanced options that you machine such as native machine code execution but bear in mind that this is a very specialist area and gets real complex, real fast!
Earlz pointed out that my reply might seem to imply 'don't bother doing this yourself. Re-reading my post it does sound a bit that way. The reason I mentioned ANTLR and LLVM is they both have heaps of source code and tutorials so I feel this is a good reference source. Take it as a base and play
You can try this framework for building languages (it works well with XNA):
http://www.meta-alternative.net/mbase.html
There are some tutorials:
http://www.meta-alternative.net/calc.pdf
http://www.meta-alternative.net/pfront.pdf
Python is great as a scripting language. I would recommend you make a C# binding for its C API and use that. Embedding Python is easy. Your application can define functions, types/classes and variables inside modules which the Python interpreter can access. The application can also call functions in Python scripts and get a result back. These two features combined gives you a two-way communication scheme.
Basically, you get the Python syntax and semantics for free. What you would need to implement is the API your application exposes to Python. An example could be access to game logic functions and render functions. Python scripts would then define functions which calls these, and the host application would invoke the Python functions (with parameters) to get work done.
EDIT: Seems like IronPython can save you even more work. It's a C# implementation of CPython, and has its own embedding API: http://www.ironpython.net/