Help Me Understand This Binary File Format - binary

I am attempting to write a small utility to produce a binary file that will mimic the one produced by another closed application. I've used hex editors to decrypt the format by I'm stuck trying to understand what the format/encoding is so that I can produce it using C++ or C#.
The file starts with the first four bytes: 01 00 followed by FF FE. My understanding is that the file begins with SOH followed by the byte order mark for little endian. After these four bytes, the program appears to write BSTR's for each of the string fields from the app's GUI.
Using C#, I have produced a unicode file that starts with FF FE, but I'm not sure how to insert the SOH character first.
I would be forever grateful if someone could offer insight to the file format or encoding and why the file starts with the SOH character.
Thank you in advance.

Reverse engineering a binary file format can be a challenging task. On the surface, I don't recognize this as an obvious, well-known file format ... but there are thousands out there, so who knows.
Legal issues aside, I would suggest you look at some of the following resources that talk about approaches to such an endeavor:
How To Crack a Binary File Format
Tools to Reverse Engineer Binary Files
Basics of Reverse Engineering File Formats
File Format Reverse Engineering

If you are just having trouble writing out the first four bytes this will do it for you.
using (var stream = new FileStream("myfile.bin", FileMode.Create))
{
using (var binaryWriter = new BinaryWriter(stream))
{
binaryWriter.Write((byte)1);
binaryWriter.Write((byte)0);
binaryWriter.Write((byte)0xFF);
binaryWriter.Write((byte)0xFE);
binaryWriter.Write(Encoding.Unicode.GetBytes("string"));
}
}
This will output the following file
01 00 FF FE 73 00 74 00 72 00 69 00 6e 00 67 00 ....s.t.r.i.n.g.
Edit: Added Mark H's suggestion for writing out a string.

Related

Tiff tag data buffer strange hex values for floating point variables

So, I have been assigned a task to read and save values of C-struct that was stored within tifftag of tiff image as byte buffer. This might be quite simple, but I am quite new to this realm of programming.
I know the exact positions I need to read bytes from. When I use python tiff tag readers, I get these weird values of bytes, that I could not make sense of. I was expecting it to be \xb5\x00\x00\x00\x01
format, not something strange like \n\xd7#=\n\xd7#=K.
Here is the snippet of weird buffer values
However, in utility app AsTiffViewer, those are perfectly fine as shown
here.
How do I decode this? What does this all mean?
\n\xd7#=\n\xd7#=K (0A D7 23 3D 0A D7 23 3D - as per AsTiffViewer)
By the way, these 0A D7 23 3D & 0A D7 23 3D are supposed to be two float value, each of them 4 bytes.
I was expecting tiff tag byte buffer to be in format of\xb5\x00\x00\x00\x01 etc, However, it spit out some weird format - \n\xd7#=\n\xd7#=K. I don't know how to decode or read this.
So, after mucking around a bit, I found out that, \n\xd7#=\n\xd7#=K, this is nothing but how python represents float in binary string.

How to convert 4 bytes hex to decimal manually

I am doing a CTF challenge. I open a broken BMP image file with a hex editor (HexFiend). I highlight 4 bytes in hex 8E262C00. In the bottom, HexFiend shows their value in decimal 2893454. However, I use online hex to decimal converting tool, their value is 2384866304.
Do anyone know how HexFiend comes up with 2893454?. I believe it is a correct answer, because that is the size of the file.
It's the endianness of the file.
A binary encoded file can be encoded with small or big endian. The difference is which succession the single bytes have, i.e. if you read them from left or from right. Note that the order of bits almost always is big endian. The natural way of reading is big ending; the bytes are stores as you would expect it. 8E262C00 becomes 8E 26 2C 00. This file, however, seems to be stored in small endian format. The order is flipped. In other words; 8E262C00 now becomes 00 2C 26 8E which then results in the decimal representation of 2893454
I think it's the Big Endian and Little Endian things.
You should check out this online converting tool, BMP file format is the Little Endian, but i think the tool maybe convert it by Big Endian method.
try it: https://www.scadacore.com/tools/programming-calculators/online-hex-converter/

Figuring out how a number is represented in hex form

Currently trying to essentially reverse engineer a file format that is produced by a CNC machine when backing up programs on the machine so that i can read the programs on a standard PC. Have opened a few of the backup files created and can clearly see patterns of data such as the program name etc. which can be clearly seen in plaintext form. One thing i am struggling with is how numbers are represented in this.
for Example: the number '20' is represented in this file in hex form as '40 0D 03 00'.
More examples:
"-213.6287": "21 67 DF FF"
"-500.3366": "9A A7 B3 FF"
Any help with trying to figure out how these hex values make up those numbers?
Thanks
These numbers are stored as little-endian signed integers, as a count of ten-thousandths.
for Example: the number '20' is represented in this file in hex form as '40 0D 03 00'.
0x00030d40 = 200000.
"-213.6287": "21 67 DF FF"
0xffdf6721 = -2136287.
"-500.3366": "9A A7 B3 FF"
0xffb3a79a = -5003366.

In-memory layout of array in Turbo Pascal

We have an old application in Turbo Pascal which can save its internal state into a file, and we need to be able to read/write this file in a C# application.
The old application generates the file by dumping various in-memory data structures. In one place, the application just dumps a range of memory, and this memory range contains some arrays. I am trying to noodle out the purpose of the bytes immediately preceding the actual array elements. In particular, the first two items in the block can be represented as:
type
string2 = string[2];
stringarr2 = array[0..64] of string2;
string4 = string[4];
stringarr4 = array[0..64] of string4;
In the data file, I see the following byte sequence:
25 00 02 02 41 42 02 43 44 ...
The 25 is the number of elements in the array. The 02 41 42 is the first string element, "AB"; the 02 43 44 is the second string element, "CD", and so on. I don't know what the 00 02 between the array element count and the first array element refers to. It's possible the array element count is 25 00 and the element size is 02, but each array element is actually 3 bytes in size.
In the place in the file where the array of 4-character strings starts, I see the following:
25 00 04 00 00 04 41 42 43 44 04 45 46 47 48
Again, there's the 25 which is the number of elements in the array; 04 41 42 43 44 is the first element in the array, "ABCD", and so on. In between there are the bytes 00 04 00 00. Maybe they are flags. Maybe they are some kind of indication of the shape of the array (but I don't see how 02 and 04 both indicate a one-dimensional array).
I don't have access to Turbo Pascal to try writing different kinds of arrays to a file, and don't have authorization to install something like Free Pascal, so my opportunities for experimentation along those lines are very limited.
These arrays are not dynamic, since Turbo Pascal didn't have them.
Thanks in advance for any dusty memories.
Pascal arrays have no bookkeeping data. You have an array of five-byte data structures (string[4]), so an array of 65 of them occupies 65*5=325 bytes. If the program wrote more than that, then it's because the program took special measures to write more. The "extra" values weren't just sitting in memory that the program happened to write to disk when it naively wrote the whole data structure with SizeOf. Thus, the only way to know what those bytes mean is to find the source code or the documentation. Merely knowing that it's Turbo Pascal is no help.
It's possible that the first section of the file is intentionally the same size as all the other array elements. For the two-character strings, the "header" is three bytes, and for the four-character strings, the "header" is five bytes, the same as the size of the strings. That would have allowed the program to use a file of string4 data type for the file, and then just skip the file's first record. The zero between the file length and the string length in the header might belong to either of those fields, and the remaining two zero bytes might just be filler.
Besides the layout of the individual strings of characters in the file, you will also need to consider what code page those single-byte characters are from. C# chars are unicode 2 byte chars.
If you're lucky, the original file data contains only ASCII 7 bit characters, which covers characters of the English alphabet. If the original data contains "European" letters such as umlauts or accented characters, these will be "high ascii" char values in the range 128..255. You'll need to perform an encoding conversion to see these characters correctly in C#. Code page 1252 Windows Latin 1 would be a good starting point.
If the original file data contains Japanese, Chinese, Korean, Thai, or characters from other "Eastern" scripts, you have a lot of work ahead of you.
Turbo Pascal strings are prefixed with a length byte. So a string[2] is actually 3 bytes: length, char1 and char2. An array of string[2] will hold all the strings one by one directly after each other in memory. If you do a blockwrite with the array as a parameter it will immediately start with the first string, it will not write any headers etc. So if you have the source you should be able to see what it writes before the array.

Creating "holes" in a binary H.264 bitstream

I am trying to simulate data loss in a video by selectively removing H.264 bitstream data. The data is simply a raw H.264 file, which is essentially a binary file. My plan is to delete 2 bytes for every 100 bytes so as to achieve a 2% loss. Eventually, I will be testing the effectiveness of some motion vector error concealment algorithms.
It would be nice to be able to do this in a Unix environment. So far, I have investigated the command xxd for a bit and I am able to save a specific portion of a hex dump from a binary file. For example, to skip the first 50 bytes of a binary bitstream and save the subsequent 100 bytes, I would do the following:
xxd -s 50 -l 100 inputBinaryFile | xxd -r > outputBinaryFile
I'm hoping to incorporate something similar into a bash script that will automatically delete the last 2 bytes per 100 bytes. Furthermore, I would like the script to skip everything before the second occurrence of the sequence 00 00 01 06 05 (first P-frame SEI start code).
I don't know how much easier this could be in a C-based language but my programming skills are quite limited and I would rather deal with just Linux programming for now if possible.
Thanks.