What does the SSIS CODEPOINT value 26 represent - ssis

In one of the existing SSIS projects, I found a Condition Split with an expression of CODEPOINT(column)==26.
But I couldn't find what is the value "26" represents. When I searched the CODEPOINT for alphabet letters it starts from 65 and for 0-9 it starts from 45.

CODEPOINT Expresion merely states the following
Returns the Unicode code point of the leftmost character of a character expression
Exploring wikipedia turned up Unicode encodings
The first 128 Unicode code points, U+0000 to U+007F, used for the C0 Controls and Basic Latin characters and which correspond one-to-one to their ASCII-code equivalents
Sweet, putting that together and consulting the ASCII table allows me to determine that 26 is SUB/Substitute control character. Ctrl-Z for those really wanting to try this at home (and working under a Unix(tm) variant)

Related

What's the exact meaning of the statement "Since ASCII used 7 bits for the character, it could only represent 128 different characters"?

I come across the below statement while studying about HTML Character Sets and Character Encoding :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
When we convert any decimal value from the ASCII character set to its binary equivalent it comes down to a 7-bits long binary number.
E.g. For Capital English Letter 'E' the decimal value of 69 exists in ASCII table. If we convert '69' to it's binary equivalent it comes down to the 7-bits long binary number 1000101
Then, why in the ASCII Table it's been mentioned as a 8-bits long binary number 01000101 instead of a 7-bits long binary number 1000101 ?
This is contradictory to the statement
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
The above statement is saying that ASCII used 7 bits for the character.
Please clear my confusion about considering the binary equivalent of a decimal value. Whether should I consider a 7-bits long binary equivalent or a 8-bits long binary equivalent of any decimal value from the ASCII Table? Please explain to me in an easy to understand language.
Again, consider the below statement :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
According to the above statement how does the number of characters(128) that ASCII supports relates to the fact that ASCII uses 7 bits to represent any character?
Please clear the confusion.
Thank You.
In most processors, memory is byte-addressable and not bit-addressable. That is, a memory address gives the location of an 8-bit value. So, almost all data is manipulated in multiples of 8 bits at a time.
If we were to store a value that has by its nature only 7 bits, we would very often use one byte per value. If the data is a sequence of such values, as text might be, we would still use one byte per value to make counting, sizing, indexing and iterating easier.
When we describe the value of a byte, we often show all of its bits, either in binary or hexadecimal. If a value is some sort of integer (say of 1, 2, 4, or 8 bytes) and its decimal representation would be more understandable, we would write the decimal digits for the whole integer. But in those cases, we might lose the concept of how many bytes it is.
BTW—HTML doesn't have anything to do with ASCII. And, Extended ASCII isn't one encoding. The fundamental rule of character encodings is to read (decode) with the encoding the text was written (encoded) with. So, a communication consists of the transferring of bytes and a shared understanding of the character encoding. (That makes saying "Extended ASCII" so inadequate as to be nearly useless.)
An HTML document represents a sequence of Unicode characters. So, one of the Unicode character encodings (UTF-8) is the most common encoding for an HTML document. Regardless, after it is read, the result is Unicode. An HTML document could be encoded in ASCII but, why do that? If you did know it was ASCII, you could just as easily know that it's UTF-8.
Outside of HTML, ASCII is used billions—if not trillions—of times per second. But, unless you know exactly how it pertains to your work, forget about it, you probably aren't using ASCII.

Converting from lowercase to uppercase using decimal/binary representation of alphabets

I'm using RISC-V and I am limited to using just and, or, xori, addition, subtraction, multiplication, division of integer values.
So for instance, the letter "a" will be represented as 97 and "aa" will be represented as 24929, and so on. The UI converts binary sequence into decimal representation, and I cannot directly modify n-th bit.
Is there anyway I can find a simple, general equation of converting from lowercase to uppercase the decimal representation of a maximum of 8 letter sequence of Strings?
Also, I forgot to add, I can't partition the string into individual letters either. Maybe it's possible, but I don't know how to do it.
Letters or characters are usually represented as byte values, which are easier to read in hexadecimal. This can be seen if you convert 97 and 24929 to hex.
You did not mention the system which was used to encode the characters; mentioning the value for one character is not definitive. Assuming your letters are encoded as ASCII, find an ASCII table and figure out the DIFFERENCE between upper- and lowercase character codes.
Use this knowledge to design an algorithm to transform lowercase character codes to uppercase.
A good uppercase conversion algorithm will not modify characters that are not lowercase letters.
This can be extended to SIMD if you are careful to avoid carries between bytes if you need to add or subtract.

JSON, Unicode: a way to detect that XXXX in \uXXXX does not correspond to a Unicode character?

The JSON specification says that a character may be escaped using this notation: \uXXXX (where XXXX are four hex digits)
However, not every set of four hex digits corresponds to a Unicode character.
Are there tools that can scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character? More generally, how does one determine that \uXXXX does not correspond to any Unicode character?
When the JSON spec talks about Unicode characters, it really means Unicode codepoints. Every valid \uXXXX sequence represents a valid codepoint, as \uXXXX can represent codepoints up to U+FFFF but Unicode defines codepoints all the way up to U+10FFFF.
When not using escaped hex notation, the full range of Unicode codepoints can be used as-is in JSON. On the other hand, when using escaped hex notation, only codepoints up to U+FFFF are allowed. This is OK though, because codepoints above U+FFFF must be represented using UTF-16 surrogate pairs, which consist of 2 codepoints that both fit in the \uXXXX range acting together. This is described in RFC 7159 Section 7 Strings:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lower case. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
...
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
So your question should not be "does \uXXXX correspond to a Unicode character?", because it logically always will as all values 0x0000 - 0xFFFF are valid Unicode codepoints. The real question should be "does \uXXXX correspond to a Unicode codepoint in the BMP, and if not does it belong to a \uXXXX\uXXXX pair that corresponds to a valid UTF-16 surrogate?".

How to convert alphabet to binary?

How to convert alphabet to binary? I search on Google and it says that first convert alphabet to its ASCII numeric value and than convert the numeric value to binary. Is there any other way to convert ?
And if that's the only way than is the binary value of "A" and 65 are same?
BECAUSE ASCII vale of 'A'=65 and when converted to binary its 01000001
AND 65 =01000001
That is indeed the way which text is converted to binary.
And to answer your second question, yes it is true that the binary value of A and 65 are the same. If you are wondering how CPU distinguishes between "A" and "65" in that case, you should know that it doesn't. It is up to your operating system and program to distinguish how to treat the data at hand. For instance, say your memory looked like the following starting at 0 on the left and incrementing right:
00000001 00001111 000000001 01100110
This binary data could mean anything, and only has a meaning in the context of whatever program it is in. In a given program, you could have it be read as:
1. An integer, in which case you'll get one number.
2. Character data, in which case you'll output 4 ASCII characters.
In short, binary is read by CPUs, which do not understand the context of anything and simply execute whatever they are given. It is up to your program/OS to specify instructions in order for data to be handled properly.
Thus, converting the alphabet to binary is dependent on the program in which you are doing so, and outside the context of a program/OS converting the alphabet to binary is really the exact same thing as converting a sequence of numbers to binary, as far as a CPU is concerned.
Number 65 in decimal is 0100 0001 in binary and it refers to letter A in binary alphabet table (ASCII) https://www.bin-dec-hex.com/binary-alphabet-the-alphabet-letters-in-binary. The easiest way to convert alphabet to binary is to use some online converter or you can do it manually with binary alphabet table.

Is the doc wrong about interpreting \xhh

This is from the main TCL doc:
\xhh The hexadecimal digits hh give an eight-bit hexadecimal value for the
Unicode character that will be inserted. Any number of hexadecimal digits may be
present; however, **all but the last two are ignored** (the result is always a
one-byte quantity).
My doubt is this part, all but the last two are ignored. Here is my experiment:
>set a "\x22"
"
>set a "\x2230"
"30
So you can see that it is the first 2 hexadecimal digits are taken and the rest are just treated as plain char.
Do I miss something?
[EDIT] Looks like I am right, here is from parser.c of tcl8.6:
860 case 'x':
861 count += TclParseHex(p+1, (numBytes > 3) ? 2 : numBytes-2, &result);
So only the first immediate 2 digits are taken. Weird, how come nobody finds this doc error.
This is a place where the behaviour changed from Tcl 8.5 (and before) to 8.6. It was a bug fix because the old behaviour was so damn weird that nobody ever expected it. (Or the Spanish Inquisition, but I digress…)
In 8.6, the documentation says:
\xhh
The hexadecimal digits hh (one or two of them) give an eight-bit hexadecimal value for the Unicode character that will be inserted. The upper bits of the Unicode character will be 0.
In 8.5, the documentation says:
\xhh
The hexadecimal digits hh give an eight-bit hexadecimal value for the Unicode character that will be inserted. Any number of hexadecimal digits may be present; however, all but the last two are ignored (the result is always a one-byte quantity). The upper bits of the Unicode character will be 0.
The difference is plain, and 8.5 and 8.6 behave differently here. The change was due to TIP #388 “Extending Unicode literals past the BMP” (part of a general programme of fixes, some of which had to be postponed to after 8.6 due to the impact on the ABI) which was voted on in September 2011; project lead was Jan Nijtmans.
I remember voting for that TIP, and that fix was something I was very glad was in there.
Sorry it wasn't flagged as a Potential Incompatibility. Missed that one (probably because the old behaviour was so badly broken that nobody really believed that we hadn't fixed it long before…)