Is the doc wrong about interpreting \xhh - tcl

This is from the main TCL doc:
\xhh The hexadecimal digits hh give an eight-bit hexadecimal value for the
Unicode character that will be inserted. Any number of hexadecimal digits may be
present; however, **all but the last two are ignored** (the result is always a
one-byte quantity).
My doubt is this part, all but the last two are ignored. Here is my experiment:
>set a "\x22"
"
>set a "\x2230"
"30
So you can see that it is the first 2 hexadecimal digits are taken and the rest are just treated as plain char.
Do I miss something?
[EDIT] Looks like I am right, here is from parser.c of tcl8.6:
860 case 'x':
861 count += TclParseHex(p+1, (numBytes > 3) ? 2 : numBytes-2, &result);
So only the first immediate 2 digits are taken. Weird, how come nobody finds this doc error.

This is a place where the behaviour changed from Tcl 8.5 (and before) to 8.6. It was a bug fix because the old behaviour was so damn weird that nobody ever expected it. (Or the Spanish Inquisition, but I digress…)
In 8.6, the documentation says:
\xhh
The hexadecimal digits hh (one or two of them) give an eight-bit hexadecimal value for the Unicode character that will be inserted. The upper bits of the Unicode character will be 0.
In 8.5, the documentation says:
\xhh
The hexadecimal digits hh give an eight-bit hexadecimal value for the Unicode character that will be inserted. Any number of hexadecimal digits may be present; however, all but the last two are ignored (the result is always a one-byte quantity). The upper bits of the Unicode character will be 0.
The difference is plain, and 8.5 and 8.6 behave differently here. The change was due to TIP #388 “Extending Unicode literals past the BMP” (part of a general programme of fixes, some of which had to be postponed to after 8.6 due to the impact on the ABI) which was voted on in September 2011; project lead was Jan Nijtmans.
I remember voting for that TIP, and that fix was something I was very glad was in there.
Sorry it wasn't flagged as a Potential Incompatibility. Missed that one (probably because the old behaviour was so badly broken that nobody really believed that we hadn't fixed it long before…)

Related

Reading binary as decimal or ascii confusion

I was looking over some hex data the other day and I’ve got a bit confused with something.
If I see the hex code: #41, 65 in decimal or 0100 0001 in binary.
Fine!
But, what confuses me is that #41 is the code for letter A in ascii.
So when I was looking at the stream of hex bytes in sublime it actually picked up it as “A” and not the number 65.
So the confusion is, how did it know to represent this hex or binary as the letter A instead of the integer 65? Is there some kind of flag in the binary that sublime used to determine if it should show the character or the integer?
In other words, if someone gave me a byte of binary how do I then determine if they wanted me to see it as ascii or an integer without them actually telling me?
I believe the answer to this question (albeit very late) is because the ascii code for the letter A is in the 65th index position of the character set.
The number 6 would be 0000 0110, or just "6". 65 in a text string is just a collection of individual numbers, and not something like an int data type.
Admittedly, I don't know how you'd handle identifying if someone is asking you to give them the ascii value for the hex, or the denary value of the hex. I'm still too new to this concept.
Here's where I derived my answer: https://www.bbc.co.uk/bitesize/guides/zp73wmn/revision/5

What's the exact meaning of the statement "Since ASCII used 7 bits for the character, it could only represent 128 different characters"?

I come across the below statement while studying about HTML Character Sets and Character Encoding :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
When we convert any decimal value from the ASCII character set to its binary equivalent it comes down to a 7-bits long binary number.
E.g. For Capital English Letter 'E' the decimal value of 69 exists in ASCII table. If we convert '69' to it's binary equivalent it comes down to the 7-bits long binary number 1000101
Then, why in the ASCII Table it's been mentioned as a 8-bits long binary number 01000101 instead of a 7-bits long binary number 1000101 ?
This is contradictory to the statement
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
The above statement is saying that ASCII used 7 bits for the character.
Please clear my confusion about considering the binary equivalent of a decimal value. Whether should I consider a 7-bits long binary equivalent or a 8-bits long binary equivalent of any decimal value from the ASCII Table? Please explain to me in an easy to understand language.
Again, consider the below statement :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
According to the above statement how does the number of characters(128) that ASCII supports relates to the fact that ASCII uses 7 bits to represent any character?
Please clear the confusion.
Thank You.
In most processors, memory is byte-addressable and not bit-addressable. That is, a memory address gives the location of an 8-bit value. So, almost all data is manipulated in multiples of 8 bits at a time.
If we were to store a value that has by its nature only 7 bits, we would very often use one byte per value. If the data is a sequence of such values, as text might be, we would still use one byte per value to make counting, sizing, indexing and iterating easier.
When we describe the value of a byte, we often show all of its bits, either in binary or hexadecimal. If a value is some sort of integer (say of 1, 2, 4, or 8 bytes) and its decimal representation would be more understandable, we would write the decimal digits for the whole integer. But in those cases, we might lose the concept of how many bytes it is.
BTW—HTML doesn't have anything to do with ASCII. And, Extended ASCII isn't one encoding. The fundamental rule of character encodings is to read (decode) with the encoding the text was written (encoded) with. So, a communication consists of the transferring of bytes and a shared understanding of the character encoding. (That makes saying "Extended ASCII" so inadequate as to be nearly useless.)
An HTML document represents a sequence of Unicode characters. So, one of the Unicode character encodings (UTF-8) is the most common encoding for an HTML document. Regardless, after it is read, the result is Unicode. An HTML document could be encoded in ASCII but, why do that? If you did know it was ASCII, you could just as easily know that it's UTF-8.
Outside of HTML, ASCII is used billions—if not trillions—of times per second. But, unless you know exactly how it pertains to your work, forget about it, you probably aren't using ASCII.

Why are leading zeroes used to represent octal numbers?

I've always wondered why leading zeroes (0) are used to represent octal numbers, instead of — for example — 0o. The use of 0o would be just as helpful, but would not cause as many problems as leading 0es (e.g. parseInt('08'); in JavaScript). What are the reason(s) behind this design choice?
All modern languages import this convention from C, which imported it from B, which imported it from BCPL.
Except BCPL used #1234 for octal and #x1234 for hexadecimal. B has departed from this convention because # was an unary operator in B (integer to floating point conversion), so #1234 could not be used, and # as a base indicator was replaced with 0.
The designers of B tried to make the syntax very compact. I guess this is the reason they did not use a two-character prefix.
Worth noting that in Python 3.0, they decided that octal literals must be prefixed with '0o' and the old '0' prefix became a SyntaxError, for the exact reasons you mention in your question
https://www.python.org/dev/peps/pep-3127/#removal-of-old-octal-syntax
"0b" is often used for binary rather than for octal. The leading "0" is, I suspect for "O -ctal".
If you know you are going to be parsing octal then use parseInt('08', 10); to make it treat the number as base ten.

How to used the alphabet binary symbols

I was reading an article on binary numbers and it had some practice problems at the end but it didn't give the solutions to the problems. The last is "How many bits are required to represent the alphabet?". Can tell me the answer to that question and briefly explain why?
Thanks.
You would only need 5 bits because you are counting to 26 (if we take only upper or lowercase letters). 5 bits will count up to 31, so you've actually got more space than you need. You can't use 4 because that only counts to 15.
If you want both upper and lowercase then 6 bits is your answer - 6 bits will happily count to 63, while your double alphabet has (2 * 24 = 48) characters, again leaving plenty of headroom.
It depends on your definition of alphabet. If you want to represent one character from the 26-letter Roman alphabet (A-Z), then you need log2(26) = 4.7 bits. Obviously, in practice, you'll need 5 bits.
However, given an infinite stream of characters, you could theoretically come up with an encoding scheme that got close to 4.7 bits (there just won't be a one-to-one mapping between individual characters and bit vectors any more).
If you're talking about representing actual human language, then you can get away with a far lower number than this (in the region of 1.5 bits/character), due to redundancy. But that's too complicated to get into in a single post here... (Google keywords are "entropy", and "information content").
There are 26 letters in the alphabet so you 2^5 = 32 is the minimum word length than contain all the letters.
How direct does the representation need to be? If you need 1:1 with no translation layer, then 5 bits will do. But if a translation layer is an option, then you can get away with less. Morse code, for example, can do it in 3 bits. :)

Are digits represented in sequence in all text encodings?

This question is language agnostic but is inspired by these c/c++ questions.
How to convert a single char into an int
Char to int conversion in C
Is it safe to assume that the characters for digits (0123456789) appear contigiously in all text encodings?
i.e. is it safe to assume that
'9'-'8' = 1
'9'-'7' = 2
...
'9'-'0' = 9
in all encodings?
I'm looking forward to a definitive answer to this one :)
Thanks,
Update: OK, let me limit all encodings to mean anything as old as ASCII and/or EBCDIC and afterwards. Sandscrit I'm not so worried about . . .
I don't know about all encodings, but at least in ASCII and <shudder> EBCDIC, the digits 0-9 all come consecutively and in increasing numeric order. Which means that all ASCII- and EBCDIC-based encodings should also have their digits in order. So for pretty much anything you'll encounter, barring Morse code or worse, I'm going to say yes.
You're going to find it hard to prove a negative. Nobody can possibly know every text encoding ever invented.
All encodings in common use today (except EBCDIC, is it still in common use?) are supersets of ASCII. I'd say you're more likely to win the lottery than you are to find a practical environment where the strict ordering of '0' to '9' doesn't hold.
Both the C++ Standard and the C standard require that this be so, for C++ and C program text.
According to K&R ANSI C it is.
Excerpt:
..."This particular program relies on the properties of the character representation of the digits. For example, the test
if (c >= '0' && c <= '9') ...
determines whether the character in c is a digit. If it is, the numeric value of that
digit is
c - '0'
This works only if '0', '1', ..., '9' have consecutive increasing values. Fortunately, this is true for all character sets...."
All text encodings I know of typically order each representation of digits sequentially. However, your question becomes a lot broader when you include all of the other representations of digits in other encodings, such as Japanese: 1234567890. Notice how the characters for the numbers are different? Well, they are actually different code points. So, I really think the answer to your question is a hard maybe, since there are so many encodings out there and they have multiple representations of digits in them.
A better question is to ask yourself, why do I need to count on digits to be in sequential code points in the first place?