Understanding characters in a binary file

Understanding characters in a binary file - binary

I am reading a binary file written in 16bits (little endian and signed).
I successfully read the file and got the good values from the conversion from bytes to integers. But there are some characters that I don't understand, so I hope that someone can explain me it :)
b'\xff\xff' gives me -1 which is good and I understand that \x indicates a hexadecimal character escape.
b'\x00\x00' gives 0, logic.
b'v\x1d' gives 7542, which is the good value (I know it because I know the value that I should get and it is this one), but I don't understand the meaning of the 'v'. What is its signification? I have found on the web the ASCII - Binary Character Table in which 'v' is 01110110. If we consider this value for 'v' and '1d' as 00011101', then we have 01110110 00011101 which is not 7542 but 30237, so the 'v' is wrong here...
b'K\x1d' gives 7499. Same here, the value is good but I do not understand the 'K'.
So if anyone can explain to me what is the meaning of the 'v' and the 'K' it would be great!
Thank you.

Your guess regarding K and v is half right: as it's little endian, your value will have least significant byte first:
"v" is 76 in hex, so v\x1d = 1D76 in hex = 7542
"K" is 4B in hex, so K\x1d = 1D4B in hex = 7499

Related

Reading binary as decimal or ascii confusion

I was looking over some hex data the other day and I’ve got a bit confused with something.
If I see the hex code: #41, 65 in decimal or 0100 0001 in binary.
Fine!
But, what confuses me is that #41 is the code for letter A in ascii.
So when I was looking at the stream of hex bytes in sublime it actually picked up it as “A” and not the number 65.
So the confusion is, how did it know to represent this hex or binary as the letter A instead of the integer 65? Is there some kind of flag in the binary that sublime used to determine if it should show the character or the integer?
In other words, if someone gave me a byte of binary how do I then determine if they wanted me to see it as ascii or an integer without them actually telling me?

I believe the answer to this question (albeit very late) is because the ascii code for the letter A is in the 65th index position of the character set.
The number 6 would be 0000 0110, or just "6". 65 in a text string is just a collection of individual numbers, and not something like an int data type.
Admittedly, I don't know how you'd handle identifying if someone is asking you to give them the ascii value for the hex, or the denary value of the hex. I'm still too new to this concept.
Here's where I derived my answer: https://www.bbc.co.uk/bitesize/guides/zp73wmn/revision/5

why heaxadecimal numbers are prefixed with "0* "

Instead of writing ffff why the syntax of writing heaxadecimal number's are like 0*ffff.What is the meaning of "0*". Does it specify something?
Anyhow A,B,C,D,E,F notations only in hexa decimal number system. Then whats the need of "0*".
Sorry "*" was not the character i supposed it is "x" .
Is it a nomenclature or notation for hexadecimal number systems.

I don't know what language you are talking about, but if you for example in C# write
var ffffff = "Some unrelated string";
...
var nowYouveDoneIt = ffffff;
what do you expect to happen? How does the compiler know if ffffff refers to the hexadecimal representation of the decimal number 16777215 or to the string variable defined earlier?
Since identifiers (in C#) can't begin with a number, prefixing with a 0 and some other character (in C# it's 0xffffff or hex and 0b111111111111111111111111 for binary IIRC) is a handy way of communicating what base the number literal is in.
EDIT: Another issue, if you were to write var myCoolNumber = 10, how would you have ANY way of knowing if this means 2, 10 or 16? Or something else entirely.

It's typically 0xFFFF: the letter, not the multiplication symbol.
As for why, 0x is just the most common convention, like how some programming languages allow binary to be prefixed by 0b. Prefixing a number with just 0 is typically reserved for octal, or base 8; they wanted a way to tell the machine that the following number is in hexadecimal, or base 16 (10 != 0b10 [2] != 010 [8] != 0x10 [16]). They typically omitted a small 'o' from identifying octal for human readability purposes.
Interestingly enough, most Assembly-based implementations I've come across use (or at least allow the use of) 0h instead or as well.

It's there to indicate the number as heX. It's not '*', it's 'x' actually.
See:
http://www.tutorialspoint.com/cprogramming/c_constants.htm

How to convert alphabet to binary?

How to convert alphabet to binary? I search on Google and it says that first convert alphabet to its ASCII numeric value and than convert the numeric value to binary. Is there any other way to convert ?
And if that's the only way than is the binary value of "A" and 65 are same?
BECAUSE ASCII vale of 'A'=65 and when converted to binary its 01000001
AND 65 =01000001

That is indeed the way which text is converted to binary.
And to answer your second question, yes it is true that the binary value of A and 65 are the same. If you are wondering how CPU distinguishes between "A" and "65" in that case, you should know that it doesn't. It is up to your operating system and program to distinguish how to treat the data at hand. For instance, say your memory looked like the following starting at 0 on the left and incrementing right:
00000001 00001111 000000001 01100110
This binary data could mean anything, and only has a meaning in the context of whatever program it is in. In a given program, you could have it be read as:
1. An integer, in which case you'll get one number.
2. Character data, in which case you'll output 4 ASCII characters.
In short, binary is read by CPUs, which do not understand the context of anything and simply execute whatever they are given. It is up to your program/OS to specify instructions in order for data to be handled properly.
Thus, converting the alphabet to binary is dependent on the program in which you are doing so, and outside the context of a program/OS converting the alphabet to binary is really the exact same thing as converting a sequence of numbers to binary, as far as a CPU is concerned.

Number 65 in decimal is 0100 0001 in binary and it refers to letter A in binary alphabet table (ASCII) https://www.bin-dec-hex.com/binary-alphabet-the-alphabet-letters-in-binary. The easiest way to convert alphabet to binary is to use some online converter or you can do it manually with binary alphabet table.

Is the doc wrong about interpreting \xhh

This is from the main TCL doc:
\xhh The hexadecimal digits hh give an eight-bit hexadecimal value for the
Unicode character that will be inserted. Any number of hexadecimal digits may be
present; however, **all but the last two are ignored** (the result is always a
one-byte quantity).
My doubt is this part, all but the last two are ignored. Here is my experiment:
>set a "\x22"
"
>set a "\x2230"
"30
So you can see that it is the first 2 hexadecimal digits are taken and the rest are just treated as plain char.
Do I miss something?
[EDIT] Looks like I am right, here is from parser.c of tcl8.6:
860 case 'x':
861 count += TclParseHex(p+1, (numBytes > 3) ? 2 : numBytes-2, &result);
So only the first immediate 2 digits are taken. Weird, how come nobody finds this doc error.

This is a place where the behaviour changed from Tcl 8.5 (and before) to 8.6. It was a bug fix because the old behaviour was so damn weird that nobody ever expected it. (Or the Spanish Inquisition, but I digress…)
In 8.6, the documentation says:
\xhh
The hexadecimal digits hh (one or two of them) give an eight-bit hexadecimal value for the Unicode character that will be inserted. The upper bits of the Unicode character will be 0.
In 8.5, the documentation says:
\xhh
The hexadecimal digits hh give an eight-bit hexadecimal value for the Unicode character that will be inserted. Any number of hexadecimal digits may be present; however, all but the last two are ignored (the result is always a one-byte quantity). The upper bits of the Unicode character will be 0.
The difference is plain, and 8.5 and 8.6 behave differently here. The change was due to TIP #388 “Extending Unicode literals past the BMP” (part of a general programme of fixes, some of which had to be postponed to after 8.6 due to the impact on the ABI) which was voted on in September 2011; project lead was Jan Nijtmans.
I remember voting for that TIP, and that fix was something I was very glad was in there.
Sorry it wasn't flagged as a Potential Incompatibility. Missed that one (probably because the old behaviour was so badly broken that nobody really believed that we hadn't fixed it long before…)

Are digits represented in sequence in all text encodings?

This question is language agnostic but is inspired by these c/c++ questions.
How to convert a single char into an int
Char to int conversion in C
Is it safe to assume that the characters for digits (0123456789) appear contigiously in all text encodings?
i.e. is it safe to assume that
'9'-'8' = 1
'9'-'7' = 2
...
'9'-'0' = 9
in all encodings?
I'm looking forward to a definitive answer to this one :)
Thanks,
Update: OK, let me limit all encodings to mean anything as old as ASCII and/or EBCDIC and afterwards. Sandscrit I'm not so worried about . . .

I don't know about all encodings, but at least in ASCII and <shudder> EBCDIC, the digits 0-9 all come consecutively and in increasing numeric order. Which means that all ASCII- and EBCDIC-based encodings should also have their digits in order. So for pretty much anything you'll encounter, barring Morse code or worse, I'm going to say yes.

You're going to find it hard to prove a negative. Nobody can possibly know every text encoding ever invented.
All encodings in common use today (except EBCDIC, is it still in common use?) are supersets of ASCII. I'd say you're more likely to win the lottery than you are to find a practical environment where the strict ordering of '0' to '9' doesn't hold.

Both the C++ Standard and the C standard require that this be so, for C++ and C program text.

According to K&R ANSI C it is.
Excerpt:
..."This particular program relies on the properties of the character representation of the digits. For example, the test
if (c >= '0' && c <= '9') ...
determines whether the character in c is a digit. If it is, the numeric value of that
digit is
c - '0'
This works only if '0', '1', ..., '9' have consecutive increasing values. Fortunately, this is true for all character sets...."

All text encodings I know of typically order each representation of digits sequentially. However, your question becomes a lot broader when you include all of the other representations of digits in other encodings, such as Japanese: １２３４５６７８９０. Notice how the characters for the numbers are different? Well, they are actually different code points. So, I really think the answer to your question is a hard maybe, since there are so many encodings out there and they have multiple representations of digits in them.
A better question is to ask yourself, why do I need to count on digits to be in sequential code points in the first place?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Understanding characters in a binary file - binary

Your guess regarding K and v is half right: as it's little endian, your value will have least significant byte first: "v" is 76 in hex, so v\x1d = 1D76 in hex = 7542 "K" is 4B in hex, so K\x1d = 1D4B in hex = 7499

Related

Reading binary as decimal or ascii confusion

why heaxadecimal numbers are prefixed with "0* "

How to convert alphabet to binary?

Is the doc wrong about interpreting \xhh

Are digits represented in sequence in all text encodings?

Categories

Resources