How can you reverse engineer a binary thrift file? - binary

I've been asked to process some files serialized as binary (not text/JSON unfortunately) Thrift objects, but I don't have access to the program or programmer that created the files, so I have no idea of their structure, field order, etc. Is there a way using the Thrift libraries to open a binary file and analyze it, getting a list of the field types, values, nesting, etc.?

Unfortunately it appears that Thrift's binary protocol does not do very much tagging of data at all; to decode it appears to assume you have the .thrift file in hand so you know, say, the next 4 bytes are supposed to be an integer, and aren't actually the first half of a float. So it appears you are stuck with, basically, looking at the files in a hex editor (or equivalent) and trying to deduce fields based on the exact patterns you're seeing.
There are a very few helpful bits:
Each file begins with a version, protocol identifier string, and sequence number. Maps will begin with 6 bytes that identify the key and value types (first two bytes, as integer codes) plus the number of elements as a 4 byte integer. The type codes appear to be standard (the canonical location of their definitions seems to be TProtocol.h in the Thrift sources, for instance a boolean value is specified by type code 2, UTF-8 string by type code 16, and so on). Strings are prefixed by a 4 byte integer length field, and lists are prefixed by the type (1 byte) and a 4 byte length. It looks like all integer fields are saved big-endian, and floating points are saved in IEEE format (which should make doubles relatively easy to find, at least).
The TBinaryProtocol* files in Thrift have a few more helpful details; on the plus side, there are a number of different implementations so you can read the ones implemented in the language you are most comfortable with.
Sorry, I know this probably isn't that helpful but it really does appear this is all the information the Thrift binary format provides; clearly the binary format was designed with the intent that you would always know the exact protocol spec already, and that the goal was the minimize wire space, rather than make it at all easy to decode blindly.

Related

Convert binary coordinates to decimal ASN.1 UPER

I cannot convert correctly binary numbers to decimal using ASN.1 compilation. Those binaries correspond with lat and long.
lat 1001110010100100101010110011111
long 01101100100101011100100100111000
If I convertem to decimal I get 1314018719 and 1821755704, respectively. However, the coordinates should be this:
enter image description here
I've tried multiple converters but without exit. Any clue?
I don't how you think the encoding works. ASN.1 PER is specified by ITU-T X.680 and ITU-T X.691. (UPER is unaligned PER, a variant of PER defined in the same specs.) The rules for integers include doing things such as encoding as an offset from a lower bound, using a length determinant and minimal octets, using a fixed number of octets and no length determinant, etc., depending on the INTEGER type's constraints. Nobody can tell you how to treat the data you've provided without having the ASN.1 schema and knowing what part of it relates to this data, as well as knowing whether the bits you have include the length determinant or not (if there is one).

When JSON is send over network how are numbers represented (as binary or text)?

This might be a trivial question... Or might not be. When I serialize an object to JSON how are numbers represented?
Specifically, I need to know how efficiently they are encoded to binary. There are 2 ways:
Transform number to its decimal string representation and then encode that string to binary.
Or encode the number directly to binary.
Which is the case?
That is a big difference: Let's say serialized object contains number 12345678. Encoded first way it will take 8 B to transfer, encoded second way only 4 B. When it comes to lots of big numbers (my case) than in the first case I would better use base64 as pre-process for serialization.
I can imagine that this might be dependent on serializer (though I really hope it is not). In that case, I am using Firebase Realtime database SDK.
JSON is a textual notation. So the number 12345678 is sent as those eight characters, 1, 2, 3, etc. Depending on your text encoding, that's probably eight bytes (e.g., UTF-8 or Windows-1252; but if you were using UTF-16, for instance, it would be 16 bytes).
There have been various "binary JSON" proposals over the years, but I don't think any of them really caught on outside of specific applications (for instance, BSON in MongoDB).

How do computers differentiate between letters and numbers in binary?

I was just curious because 65 is the same as the letter A
If this is the wrong stack sorry.
"65 is the same as the letter A": It is true if you say it is. But not saying more than that isn't very useful.
There is no text but encoded text. There are no numbers but encoded numbers. To the CPU, some number encodings are native, everything else is just undifferentiated data.
(Some data is just data for programs, other data is the CPU instructions of programs. It's a security problem if a CPU executes data as instructions inappropriately. Some architectures keep program data and instructions separate.)
Common native number encodings are signed and unsigned integers of 1, 2, 4, and 8 bytes and IEEE-754 single and double precision floating point numbers. Signed integers are usually two's-complement. Multi-byte integers have a byte ordering (or endianness) because on typical machines each byte is individually addressable. If a number encoding is not native, a program library is needed to process such data.
Text is a sequence of encoded characters from a character set. There are hundreds of character sets. A character set is an assignment of a conceptual character to a number called a codepoint. Sometimes the conceptual characters are categorized as lowercase letter, digit, symbol, etc. A codepoint value is mapped to bytes using a character encoding. Most character sets have one encoding, but Unicode has several. Some character sets are subsets of other character sets—such relationships are not generally useful because exactly one character set is used in any one context.
A program is a set of instructions that operate on data. It must apply the correct operations to the right data. So, it is the program that differentiates between text and number, usually by its location or flow path.
Stored data must be in a known layout of encoded text and numbers. Sometimes the layout is stored also. The layout is called metadata. Without the metadata accompanying the data, or being agreed upon, the data cannot be used.
It's all quite simple with appropriate bookkeeping. But there are several methods of bookkeeping so there is no general solution to how to handle data without metadata. Methods include: Well-known and/or registered file extensions, HTTP headers, MIME types, HTML meta charset tag, XML encoding declaration. Some methods only work in a certain context, such as audio/video codecs having a four-character code (FourCC), and unix shell scripts with a shebang. Some methods only help narrow guessing, such as file signatures. Needless to say, guessing should be avoided; it leads to security issues and data loss.
Unfortunately, text files are often without metadata. It is particularly important to agree upon or separately communicate the metadata.
Data without metadata is "binary". So the writer of text must agree with the reader on which character encoding is to be used. Similarly, for all types of data. Here reader and writer are both humans and programs.
Short answer. They don't. Longer answer, every binary combination between 00000000 and 11111111 has a character representation in the ASCII character set. 01000001 just happens to be the first capital letter in the Latin alphabet that was designated over 30 years ago. There are other character sets, and code pages that represent different letter, numbers, non-printable and accented letters. It's entirely possible that the binary 01000001 could be a lower case z with a tilde over the top in a different character set. 'computers' don't know (or care) what a particular binary representation means to humans.

Bulk export of binary waveform data from oscilloscope to data points (csv preferred)

I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.
Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.

Why is it useful to know how to convert between numeric bases?

We are learning about converting Binary to Decimal (and vice-versa) as well as other base-conversion methods, but I don't understand the necessity of this knowledge.
Are there any real-world uses for converting numbers between different bases?
When dealing with Unicode escape codes— '\u2014' in Javascript is — in HTML
When debugging— many debuggers show all numbers in hex
When writing bitmasks— it's more convenient to specify powers of two in hex (or by writing 1 << 4)
In this article I describe a concrete use case. In short, suppose you have a series of bytes you want to transfer using some transport mechanism, but you cannot simply pass the payload as bytes, because you are not able to send binary content. Let's say you can only use 64 characters for encoding the payload. A solution to this problem is to convert the bytes (8-bit characters) into 6-bit characters. Here the number conversion comes into play. Consider the series of bytes as a big number whose base is 256. Then convert it into a number with base 64 and you are done. Each digit of the new base 64 number now denotes a character of your encoded payload...
If you have a device, such as a hard drive, that can only have a set number of states, you can only count in a number system with that many states.
Because a computer's byte only have on and off, you can only represent 0 and 1. Therefore a base2 system is used.
If you have a device that had 3 states, you could represent 0, 1 and 2, and therefore count in a base 3 system.