HTML entities appear to contain nonsense - html

신영 안
Above is the html, below is the code. Is this a name? What does it mean?
신옠안

You have a double Mojibake, data mangled by using incorrect codecs.
It's actually Korean, a name:
신영 안
or, if using HTML entities, this should have been encoded to
신영 안
It translates to English as Shin-Young An.
When encoded to UTF-8, and grouped per input code-point then displayed using hex digits you would get this:
ec 8b a0
ec 98 81
20
ec 95 88
To produce the output you have, someone must have:
Decoded the above UTF-8 data using Windows codepage 1252, producing
ì‹<A0>ì˜<81> 안
(where <A0> is the non-breaking space character, and <81> is an invalid CP1252 byte, but this is often ignored in many decoders; I've included them in this notation because they'd not otherwise be printable)
Encoded the resulting mess to UTF-8 again to give you the following byte values:
c3 ac e2 80 b9 c2 a0
c3 ac cb 9c c2 81
20
c3 ac e2 80 a2 cb 86
(the grouping matches the correct UTF-8, above)
Decoded those UTF-8 bytes a second time using the same Windows CP1252 codec, this time producing:
ì‹Â<A0>ì˜Â<81> 안
(with the same note on the <A0> and <81> characters)
Finally encoded the resulting characters to HTML entities:
신영 안
If you have Python installed, then the ftfy library can 'repair' text like this in a single step:
>>> import ftfy
>>> sample = '신영 안'
>>> ftfy.ftfy(sample)
'신영 안'
I used that library to tell me what codecs were used, as well as use it's sloppy CP1252 decoder to produce the decodes above.
E.g. for your input I used:
>>> ftfy.fixes.fix_encoding_and_explain(ftfy.fixes.unescape_html(sample))
('신영 안', [('encode', 'sloppy-windows-1252', 0), ('decode', 'utf-8', 0), ('encode', 'sloppy-windows-1252', 0), ('decode', 'utf-8', 0)])
to see the repair plan, and reversed it to explain how the Mojibake was produced in the first place.

Related

How to encode raw bytes?

We have a 2D datamatrix barcode which outputs as 12002052 (CR+LF after the value). When scanning into Chrome the barcode is triggering the downloads menu - which I have read from other posts that this is due to the CR+LF. To troubleshoot, we generated a new 2D datamatrix barcode with an online generator for 12002052 which scans successfully in Chrome (doesn't trigger the downloads menu) but when scanned into notepad++ (showing all characters) it shows the exact same output as the original/bad barcode.
I took an image of both the good and bad barcode and uploaded them to a datamatrix decoding website (zxing) and what is interesting is the last value in the "raw bytes" is different for each barcode
bad 2D
Raw text 12002052
Raw bytes 8e 82 96 b6 81
Barcode format DATA_MATRIX
Parsed Result Type TEXT
Parsed Result 12002052
good 2D
Raw text 12002052
Raw bytes 8e 82 96 b6 0b
Barcode format DATA_MATRIX
Parsed Result Type TEXT
Parsed Result 12002052
my question is what exactly are the "raw bytes" and how could I possible encode them to hopefully reverse engineer this and find what is differentiating the 2 barcodes?
I believe that 'Raw bytes' would refer to a byte array. Byte arrays are exactly what they sound like, an array of bytes which are 8 bits each. So, the raw bytes '8e 82 96 b6 0b' refer to hexidecimal representations of each byte.
That said, from the string you have provided - I do not get a corresponding byte array that matches the raw text input provided. (There are plenty of string to byte converters online) Perhaps some character encoding other than ASCII or UTF8 is used in this case.

Figuring out how a number is represented in hex form

Currently trying to essentially reverse engineer a file format that is produced by a CNC machine when backing up programs on the machine so that i can read the programs on a standard PC. Have opened a few of the backup files created and can clearly see patterns of data such as the program name etc. which can be clearly seen in plaintext form. One thing i am struggling with is how numbers are represented in this.
for Example: the number '20' is represented in this file in hex form as '40 0D 03 00'.
More examples:
"-213.6287": "21 67 DF FF"
"-500.3366": "9A A7 B3 FF"
Any help with trying to figure out how these hex values make up those numbers?
Thanks
These numbers are stored as little-endian signed integers, as a count of ten-thousandths.
for Example: the number '20' is represented in this file in hex form as '40 0D 03 00'.
0x00030d40 = 200000.
"-213.6287": "21 67 DF FF"
0xffdf6721 = -2136287.
"-500.3366": "9A A7 B3 FF"
0xffb3a79a = -5003366.

What is the difference between binary and ASCII based file comparison?

If I use a file comparison tool like fc in Windows, you can choose between ASCII and binary comparison.
What is the actual difference between these two comparisons? If I compare two ASCII files, don't I want the binary data of the files to be identical?
WARNING: this is 5 year old loose remembrance of knowledge from uni
Binary representation means you compare the binary exactly, and ascii is a comparison of data type. to put it in a simple case the char 'A' is a representation of 01000001, but that is also an 8 bit integer equal to '65', so that means A = 65 in binary. so if you were doing A + A as a string and 65 43 65 (43 is '+' in binary to decimal), in binary they would be equivalent, but in ascii they would not. This is a very loose explanation and i'm sure i missed a lot, but that should sum it up loosely.
In a text file you want ASCII because you write in ascii characters. In say, a program state saved to a file you want binary to get a direct comparison.

In-memory layout of array in Turbo Pascal

We have an old application in Turbo Pascal which can save its internal state into a file, and we need to be able to read/write this file in a C# application.
The old application generates the file by dumping various in-memory data structures. In one place, the application just dumps a range of memory, and this memory range contains some arrays. I am trying to noodle out the purpose of the bytes immediately preceding the actual array elements. In particular, the first two items in the block can be represented as:
type
string2 = string[2];
stringarr2 = array[0..64] of string2;
string4 = string[4];
stringarr4 = array[0..64] of string4;
In the data file, I see the following byte sequence:
25 00 02 02 41 42 02 43 44 ...
The 25 is the number of elements in the array. The 02 41 42 is the first string element, "AB"; the 02 43 44 is the second string element, "CD", and so on. I don't know what the 00 02 between the array element count and the first array element refers to. It's possible the array element count is 25 00 and the element size is 02, but each array element is actually 3 bytes in size.
In the place in the file where the array of 4-character strings starts, I see the following:
25 00 04 00 00 04 41 42 43 44 04 45 46 47 48
Again, there's the 25 which is the number of elements in the array; 04 41 42 43 44 is the first element in the array, "ABCD", and so on. In between there are the bytes 00 04 00 00. Maybe they are flags. Maybe they are some kind of indication of the shape of the array (but I don't see how 02 and 04 both indicate a one-dimensional array).
I don't have access to Turbo Pascal to try writing different kinds of arrays to a file, and don't have authorization to install something like Free Pascal, so my opportunities for experimentation along those lines are very limited.
These arrays are not dynamic, since Turbo Pascal didn't have them.
Thanks in advance for any dusty memories.
Pascal arrays have no bookkeeping data. You have an array of five-byte data structures (string[4]), so an array of 65 of them occupies 65*5=325 bytes. If the program wrote more than that, then it's because the program took special measures to write more. The "extra" values weren't just sitting in memory that the program happened to write to disk when it naively wrote the whole data structure with SizeOf. Thus, the only way to know what those bytes mean is to find the source code or the documentation. Merely knowing that it's Turbo Pascal is no help.
It's possible that the first section of the file is intentionally the same size as all the other array elements. For the two-character strings, the "header" is three bytes, and for the four-character strings, the "header" is five bytes, the same as the size of the strings. That would have allowed the program to use a file of string4 data type for the file, and then just skip the file's first record. The zero between the file length and the string length in the header might belong to either of those fields, and the remaining two zero bytes might just be filler.
Besides the layout of the individual strings of characters in the file, you will also need to consider what code page those single-byte characters are from. C# chars are unicode 2 byte chars.
If you're lucky, the original file data contains only ASCII 7 bit characters, which covers characters of the English alphabet. If the original data contains "European" letters such as umlauts or accented characters, these will be "high ascii" char values in the range 128..255. You'll need to perform an encoding conversion to see these characters correctly in C#. Code page 1252 Windows Latin 1 would be a good starting point.
If the original file data contains Japanese, Chinese, Korean, Thai, or characters from other "Eastern" scripts, you have a lot of work ahead of you.
Turbo Pascal strings are prefixed with a length byte. So a string[2] is actually 3 bytes: length, char1 and char2. An array of string[2] will hold all the strings one by one directly after each other in memory. If you do a blockwrite with the array as a parameter it will immediately start with the first string, it will not write any headers etc. So if you have the source you should be able to see what it writes before the array.

Help Me Understand This Binary File Format

I am attempting to write a small utility to produce a binary file that will mimic the one produced by another closed application. I've used hex editors to decrypt the format by I'm stuck trying to understand what the format/encoding is so that I can produce it using C++ or C#.
The file starts with the first four bytes: 01 00 followed by FF FE. My understanding is that the file begins with SOH followed by the byte order mark for little endian. After these four bytes, the program appears to write BSTR's for each of the string fields from the app's GUI.
Using C#, I have produced a unicode file that starts with FF FE, but I'm not sure how to insert the SOH character first.
I would be forever grateful if someone could offer insight to the file format or encoding and why the file starts with the SOH character.
Thank you in advance.
Reverse engineering a binary file format can be a challenging task. On the surface, I don't recognize this as an obvious, well-known file format ... but there are thousands out there, so who knows.
Legal issues aside, I would suggest you look at some of the following resources that talk about approaches to such an endeavor:
How To Crack a Binary File Format
Tools to Reverse Engineer Binary Files
Basics of Reverse Engineering File Formats
File Format Reverse Engineering
If you are just having trouble writing out the first four bytes this will do it for you.
using (var stream = new FileStream("myfile.bin", FileMode.Create))
{
using (var binaryWriter = new BinaryWriter(stream))
{
binaryWriter.Write((byte)1);
binaryWriter.Write((byte)0);
binaryWriter.Write((byte)0xFF);
binaryWriter.Write((byte)0xFE);
binaryWriter.Write(Encoding.Unicode.GetBytes("string"));
}
}
This will output the following file
01 00 FF FE 73 00 74 00 72 00 69 00 6e 00 67 00 ....s.t.r.i.n.g.
Edit: Added Mark H's suggestion for writing out a string.