\r\n as a part of UTF8 character? - language-agnostic

Is it possible, that some UTF8 symbol includes bytes 0x0D 0x0A as it's part? If yes, what are such symbols?
(that task that I'm trying to solve is reading textual UTF8 file from the certain point rather then from the very beginning)

No, every byte of a multibyte encoded codepoint will always have the most significant bit set.
Bytes with values 0-127 in an UTF-8 stream are uniquely mapped to ASCII.

No, every character from range 0-127 ASCII is represented "as is" in UTF-8 text. Each byte of multi byte characters have they 8-bit set. It's one of adventages of UTF-8.

The single Unicode code point U+0D0A will be represented as the three bytes 0xE0 0xB4 0x8A in UTF-8. The two Unicode code points U+000D U+000A will be represented as two bytes 0x0D 0x0A in UTF-8.

Related

UTF-8 - contradictory definitions

My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either
data in the lower 7 bits (0-6) with high bit (7) clear for single byte ASCII range code-points
data in the lower 5 bits (0-4), with high bits 7-5 = 110 to indicate a 2 byte char
data in the lower 4 bits (0-3), with high bits 7-4 = 1110 to indicate a 3 byte char
data in the lower 5 bits (0-2), with high bits 7-3 = 11110 to indicate a 4 byte char
noting that bit 7 is always set and this tells utf-8 parsers that this is a multi-byte char.
This means that any unicode code-point in the range 128-255 has to be encoded in 2 or more bytes, because the high bit that is required if they were to be encoded in one byte is reserved in UTF-8 for the 'multi-byte indicator bit'. So e.g. the character é (e-acute, which is unicode code-point \u00E9, 233 decimal) is encoded in UTF-8 as a two byte character \xC3A9.
The following table from here shows how the code-point \u00E9 is encoded in UTF-8 as \xC3A9.
However this is not how it works in a web page it seems. I have recently had some contradictory behavior in the rendering of unicode chars, and in my exploratory reading came across this:
"UTF-8 is identical to both ANSI and 8859-1 for the values from 160 to 255." (w3schools)
which clearly contradicts the above.
And if I render these various values in jsfiddle I get
So HTML is rendering the unicode code-point as é, not the UTF-8 2-byte encoding of that code-point. In fact HTML renders the UTF-8 char \xC3A9 as the Hangul syllable that has the code-point \xC3A9:
W3schools has a table that explicitly defines the UTF-8 of é as Decimal 233 (\xE9):
So HTML is rendering code-points, not UTF-8 chars.
Am I missing something here? Can anyone explain to me why in a supposedly UTF-8 HTML document, it seems like there is no UTF-8 parsing going on at all?
Your understanding of the encoding of UTF-8 bytes is correct.
Your jsfiddle example is using UTF-8 only as a byte encoding for the HTML file (hence the use of the <meta charset="UTF-8"> HTML tag), but not as an encoding of the HTML itself. HTML only uses ASCII characters for its markup, but that markup can represent Unicode characters.
UTF-8 is a byte encoding for Unicode codepoints. It is commonly used for transmissions of Unicode data, such as an HTML file over HTTP. But HTML itself is defined in terms of Unicode codepoints only, not in UTF-8 specifically. A webbrowser would receive the raw UTF-8 bytes over the wire and decode them to Unicode codepoints before processing them in the context of the HTML.
HTML entities deal in Unicode codepoints only, not in codeunits, such as used in UTF-8.
HTML entities in &#<xxx>; format represent Unicode codepoints by their numeric values directly.
é (é) and é (é) represent integer 233 in decimal and hex formats, respectively. 233 is the numeric value of Unicode codepoint U+00E9 LATIN SMALL LETTER E WITH ACUTE, which is encoded in UTF-8 bytes as 0xC3 0xA9.
쎩 (쎩) represents integer 50089 in hex format (0xC3A9). 50089 is the numeric value of Unicode codepoint U+C3A9 HANGUL SYLLABLE SSYEOLG, which is encoded in UTF-8 as bytes 0xEC 0x8E 0xA9.
HTML entities in &<name>; format represent Unicode codepoints by a human-readable name defined by HTML.
é (é) represents Unicode codepoint U+00E9, same as é and é do.

Why does HTML treat 2 and 3-byte characters the same, but not 4-byte?

I'm doing some GUI work for a website and using the "maxlength" attribute for some text inputs, some of which may contain Unicode characters.
Say I've got a text field with maxlength = 50 and I fill it full of 2-byte Unicode characters (UTF-16). I can get 50 characters in the text field.
I can also do the same with 3-byte characters. 50 of them.
I can only get 25 4-byte characters in the field, however. Stands to reason, since it's twice as many bytes, but why does it still respond normally when using 3-byte characters? How is the extra byte handled?
Unicode characters can generally be encoded in either UTF-8, UTF-16, or UTF-32 (see their faq). Your usage of 2, 3 & 4 byte characters tells me you're working from a UTF-8 perspective.
However, the maxlength attribute is defined as the maximum number of UTF-16 code units, not number of bytes. Each UTF-16 code unit is two bytes.
A 2-byte UTF-8 character will be a single UTF-16 code unit. A 3-byte UTF-8 character will also be a single UTF-16 code unit. However, a 4-byte UTF-8 character represents a Unicode character greater than 0xFFFF. UTF-16 represents this as two code units (called surrogate pairs, see faq linked above).

JSON, Unicode: a way to detect that XXXX in \uXXXX does not correspond to a Unicode character?

The JSON specification says that a character may be escaped using this notation: \uXXXX (where XXXX are four hex digits)
However, not every set of four hex digits corresponds to a Unicode character.
Are there tools that can scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character? More generally, how does one determine that \uXXXX does not correspond to any Unicode character?
When the JSON spec talks about Unicode characters, it really means Unicode codepoints. Every valid \uXXXX sequence represents a valid codepoint, as \uXXXX can represent codepoints up to U+FFFF but Unicode defines codepoints all the way up to U+10FFFF.
When not using escaped hex notation, the full range of Unicode codepoints can be used as-is in JSON. On the other hand, when using escaped hex notation, only codepoints up to U+FFFF are allowed. This is OK though, because codepoints above U+FFFF must be represented using UTF-16 surrogate pairs, which consist of 2 codepoints that both fit in the \uXXXX range acting together. This is described in RFC 7159 Section 7 Strings:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lower case. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
...
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
So your question should not be "does \uXXXX correspond to a Unicode character?", because it logically always will as all values 0x0000 - 0xFFFF are valid Unicode codepoints. The real question should be "does \uXXXX correspond to a Unicode codepoint in the BMP, and if not does it belong to a \uXXXX\uXXXX pair that corresponds to a valid UTF-16 surrogate?".

Why it is very likely to see Chinese characters when I view a random UTF-8 webpage as UTF-16?

Out of curiosity, I chose UTF-16 in the Encoding menu of a random English webpage to see what happens (on Chrome: Tools -> Encoding -> Unicode (UTF-16LE). What interested me is that almost all of the mojibake I see are Chinese characters (and some integral signs).
Are there any statistical reasons for seeing Chinese characters when switching from ASCII/UTF-8 English to UTF-16? Are the random non-Chinese special characters from HTML tags?
Since the smallest unit in UTF-16 is two bytes long, the first byte of most "low" characters like Latin starts with a NUL byte: 00 xx. Since normal content does not typically contain NUL bytes, it's virtually impossible to hit Latin characters when interpreting random byte sequences as UTF-16. Most bytes of UTF-8 encoded content will be somewhere in the lower middle, like say 46 6F. And that happens to be where many Asian languages are situated in UTF-16, and since Chinese is a ginormous block you're very likely to hit it.
Most english characters are ASCII coded in the [0x40-0x5a] hex range. If you transcode UTF-8 to UTF-16, will have most of your characters in the range [0x4040-0x5a5a], that aparently maps to Chinese chars
I agree with Raul Andres as long as you view ASCII or UTF-8 that contains just ASCII characters as utf-16. However you might not view chinese characters anymore if your utf-8 content contains thai, hebrew or other languages that result in 2-byte, 3-byte or 4-byte sequences in utf-8.

Flash CS4/AS3: differing behavior between console and textarea for printing UTF-16 characters

trace(escape("д"));
will print "%D0%B4", the correct URL encoding for this character (Cyrillic equivalent of "A").
However, if I were to do..
myTextArea.htmlText += unescape("%D0%B4");
What gets printed is:
д
which is of course incorrect. Simply tracing the above unescape returns the correct Cyrillic character, though! For this texarea, escaping "д" returns its unicode code-point "%u0434".
I'm not sure what exactly is happening to mess this up, but...
UTF-16 д in web encoding is: %FE%FF%00%D0%00%B4
Whereas
UTF-16 д in web encoding is: %00%D0%00%B4
So it's padding this value with something at the beginning. Why would a trace provide different text than a print to an (empty) textarea? What's goin' on?
The textarea in question has no weird encoding properties attached to it, if that sort of thing is even possible.
The problem is unescape (escape could also be a problem, but it's not the culprit in this case). These functions are not multibyte aware. What escape does is this: it takes a byte in the input string and returns its hex representation with a % prepended. unescape does the opposite. The key point here is that they work with bytes, not characters.
What you want is encodeURIComponent / decodeURIComponent. Both use utf-8 as the string encoding scheme (the encoding using by flash everywhere). Note that it's not utf-16 (which you shouldn't care about as long as flash is concerned).
encodeURIComponent("д"); //%D0%B4
decodeURIComponent("%D0%B4"); // д
Now, if you want to dig a bit deeper, here's what's going on (this assumes a basic knowledge of how utf-8 works).
escape("д")
This returns
%D0%B4
Why?
"д" is treated by flash as utf-8. The codepoint for this character is 0x0434.
In binary:
0000 0100 0011 0100
It fits in two utf-8 bytes, so it's encoded thus (where e means encoding bit, and p means payload bit):
1101 0000 1011 0100
eeep pppp eepp pppp
Converting it to hex, we get:
0xd0 0xb4
So, 0xd0,0xb4 is a utf-8 encoded "д".
This is fed to escape. escape sees two bytes, and gives you:
%d0%b4
Now, you pass this to unescape. But unescape is a little bit brain-dead, so it thinks one byte is one and the same thing as one char, always. As far as unescape is concerned, you have two bytes, hence, you have two chars. If you look up the code-points for 0xd0 and 0xb4, you'll see this:
0xd0 -> Ð
0xb4 -> ´
So, unescape returns a string consisting of two chars, Ð and ´ (instead of figuring out that the two bytes it got where actually just one char, utf-8 encoded). Then, when you assign the text property, you are not really passing д´ butд`, and this is what you see in the text area.