UTF-8 - contradictory definitions - html

My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either
data in the lower 7 bits (0-6) with high bit (7) clear for single byte ASCII range code-points
data in the lower 5 bits (0-4), with high bits 7-5 = 110 to indicate a 2 byte char
data in the lower 4 bits (0-3), with high bits 7-4 = 1110 to indicate a 3 byte char
data in the lower 5 bits (0-2), with high bits 7-3 = 11110 to indicate a 4 byte char
noting that bit 7 is always set and this tells utf-8 parsers that this is a multi-byte char.
This means that any unicode code-point in the range 128-255 has to be encoded in 2 or more bytes, because the high bit that is required if they were to be encoded in one byte is reserved in UTF-8 for the 'multi-byte indicator bit'. So e.g. the character é (e-acute, which is unicode code-point \u00E9, 233 decimal) is encoded in UTF-8 as a two byte character \xC3A9.
The following table from here shows how the code-point \u00E9 is encoded in UTF-8 as \xC3A9.
However this is not how it works in a web page it seems. I have recently had some contradictory behavior in the rendering of unicode chars, and in my exploratory reading came across this:
"UTF-8 is identical to both ANSI and 8859-1 for the values from 160 to 255." (w3schools)
which clearly contradicts the above.
And if I render these various values in jsfiddle I get
So HTML is rendering the unicode code-point as é, not the UTF-8 2-byte encoding of that code-point. In fact HTML renders the UTF-8 char \xC3A9 as the Hangul syllable that has the code-point \xC3A9:
W3schools has a table that explicitly defines the UTF-8 of é as Decimal 233 (\xE9):
So HTML is rendering code-points, not UTF-8 chars.
Am I missing something here? Can anyone explain to me why in a supposedly UTF-8 HTML document, it seems like there is no UTF-8 parsing going on at all?

Your understanding of the encoding of UTF-8 bytes is correct.
Your jsfiddle example is using UTF-8 only as a byte encoding for the HTML file (hence the use of the <meta charset="UTF-8"> HTML tag), but not as an encoding of the HTML itself. HTML only uses ASCII characters for its markup, but that markup can represent Unicode characters.
UTF-8 is a byte encoding for Unicode codepoints. It is commonly used for transmissions of Unicode data, such as an HTML file over HTTP. But HTML itself is defined in terms of Unicode codepoints only, not in UTF-8 specifically. A webbrowser would receive the raw UTF-8 bytes over the wire and decode them to Unicode codepoints before processing them in the context of the HTML.
HTML entities deal in Unicode codepoints only, not in codeunits, such as used in UTF-8.
HTML entities in &#<xxx>; format represent Unicode codepoints by their numeric values directly.
é (é) and é (é) represent integer 233 in decimal and hex formats, respectively. 233 is the numeric value of Unicode codepoint U+00E9 LATIN SMALL LETTER E WITH ACUTE, which is encoded in UTF-8 bytes as 0xC3 0xA9.
쎩 (쎩) represents integer 50089 in hex format (0xC3A9). 50089 is the numeric value of Unicode codepoint U+C3A9 HANGUL SYLLABLE SSYEOLG, which is encoded in UTF-8 as bytes 0xEC 0x8E 0xA9.
HTML entities in &<name>; format represent Unicode codepoints by a human-readable name defined by HTML.
é (é) represents Unicode codepoint U+00E9, same as é and é do.

Related

File encoded in Latin-1 but read in UTF-8 could deal in any problem? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What is the difference between UTF-8 and ISO-8859-1?
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding.
Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.
UTF
UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be representative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points.
Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particulars of the encoding best explained here.
ISO-8859
ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.
The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.
Windows Code Pages
It's worth mentioning that Microsoft also maintains a set of character encodings with limited compatibility with ISO-8859, usually denoted as "cp####". MS seems to have a push to move their recent product releases to using Unicode in one form or another, but for legacy and/or interoperability reasons you're still likely to run into them.
For example, cp1252 is a superset of the ISO-8859-1, containing additional printable characters in the 0x80-0x9F range, notably the Euro symbol € and the much maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1, but will misbehave when one of those extra symbols shows up.
Aside from cp1252, the Turkish cp1254 is a similar superset of ISO-8859-9, but all other Windows Code Pages have at least some fundamental conflicts, if not differing entirely from their 8859 equivalent.
ASCII: 7 bits. 128 code points.
ISO-8859-1: 8 bits. 256 code points.
UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.
Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:
#!/usr/bin/env python3
c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))
Output:
©
b'\xc2\xa9'
b'\xa9'
ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).
One more important thing to realise: if you see iso-8859-1, it probably refers to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F, where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible characters instead.
For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``), while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, …).
The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be a label for windows-1252, and web browsers do not support ISO 8859-1 in any way: the HTML spec says that all encodings in the Encoding spec must be supported, and no more.
Also of interest, HTML numeric character references essentially use Windows-1252 for 8-bit values rather than Unicode code points; per https://html.spec.whatwg.org/#numeric-character-reference-end-state, … will produce U+2026 rather than U+0085.
From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. The caveat is that the file shouldn't have unicode characters in it of course.
My reason for researching this question was from the perspective, is in what way are they compatible. Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte.
Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.

Why does HTML treat 2 and 3-byte characters the same, but not 4-byte?

I'm doing some GUI work for a website and using the "maxlength" attribute for some text inputs, some of which may contain Unicode characters.
Say I've got a text field with maxlength = 50 and I fill it full of 2-byte Unicode characters (UTF-16). I can get 50 characters in the text field.
I can also do the same with 3-byte characters. 50 of them.
I can only get 25 4-byte characters in the field, however. Stands to reason, since it's twice as many bytes, but why does it still respond normally when using 3-byte characters? How is the extra byte handled?
Unicode characters can generally be encoded in either UTF-8, UTF-16, or UTF-32 (see their faq). Your usage of 2, 3 & 4 byte characters tells me you're working from a UTF-8 perspective.
However, the maxlength attribute is defined as the maximum number of UTF-16 code units, not number of bytes. Each UTF-16 code unit is two bytes.
A 2-byte UTF-8 character will be a single UTF-16 code unit. A 3-byte UTF-8 character will also be a single UTF-16 code unit. However, a 4-byte UTF-8 character represents a Unicode character greater than 0xFFFF. UTF-16 represents this as two code units (called surrogate pairs, see faq linked above).

JSON, Unicode: a way to detect that XXXX in \uXXXX does not correspond to a Unicode character?

The JSON specification says that a character may be escaped using this notation: \uXXXX (where XXXX are four hex digits)
However, not every set of four hex digits corresponds to a Unicode character.
Are there tools that can scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character? More generally, how does one determine that \uXXXX does not correspond to any Unicode character?
When the JSON spec talks about Unicode characters, it really means Unicode codepoints. Every valid \uXXXX sequence represents a valid codepoint, as \uXXXX can represent codepoints up to U+FFFF but Unicode defines codepoints all the way up to U+10FFFF.
When not using escaped hex notation, the full range of Unicode codepoints can be used as-is in JSON. On the other hand, when using escaped hex notation, only codepoints up to U+FFFF are allowed. This is OK though, because codepoints above U+FFFF must be represented using UTF-16 surrogate pairs, which consist of 2 codepoints that both fit in the \uXXXX range acting together. This is described in RFC 7159 Section 7 Strings:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lower case. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
...
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
So your question should not be "does \uXXXX correspond to a Unicode character?", because it logically always will as all values 0x0000 - 0xFFFF are valid Unicode codepoints. The real question should be "does \uXXXX correspond to a Unicode codepoint in the BMP, and if not does it belong to a \uXXXX\uXXXX pair that corresponds to a valid UTF-16 surrogate?".

Why it is very likely to see Chinese characters when I view a random UTF-8 webpage as UTF-16?

Out of curiosity, I chose UTF-16 in the Encoding menu of a random English webpage to see what happens (on Chrome: Tools -> Encoding -> Unicode (UTF-16LE). What interested me is that almost all of the mojibake I see are Chinese characters (and some integral signs).
Are there any statistical reasons for seeing Chinese characters when switching from ASCII/UTF-8 English to UTF-16? Are the random non-Chinese special characters from HTML tags?
Since the smallest unit in UTF-16 is two bytes long, the first byte of most "low" characters like Latin starts with a NUL byte: 00 xx. Since normal content does not typically contain NUL bytes, it's virtually impossible to hit Latin characters when interpreting random byte sequences as UTF-16. Most bytes of UTF-8 encoded content will be somewhere in the lower middle, like say 46 6F. And that happens to be where many Asian languages are situated in UTF-16, and since Chinese is a ginormous block you're very likely to hit it.
Most english characters are ASCII coded in the [0x40-0x5a] hex range. If you transcode UTF-8 to UTF-16, will have most of your characters in the range [0x4040-0x5a5a], that aparently maps to Chinese chars
I agree with Raul Andres as long as you view ASCII or UTF-8 that contains just ASCII characters as utf-16. However you might not view chinese characters anymore if your utf-8 content contains thai, hebrew or other languages that result in 2-byte, 3-byte or 4-byte sequences in utf-8.

\r\n as a part of UTF8 character?

Is it possible, that some UTF8 symbol includes bytes 0x0D 0x0A as it's part? If yes, what are such symbols?
(that task that I'm trying to solve is reading textual UTF8 file from the certain point rather then from the very beginning)
No, every byte of a multibyte encoded codepoint will always have the most significant bit set.
Bytes with values 0-127 in an UTF-8 stream are uniquely mapped to ASCII.
No, every character from range 0-127 ASCII is represented "as is" in UTF-8 text. Each byte of multi byte characters have they 8-bit set. It's one of adventages of UTF-8.
The single Unicode code point U+0D0A will be represented as the three bytes 0xE0 0xB4 0x8A in UTF-8. The two Unicode code points U+000D U+000A will be represented as two bytes 0x0D 0x0A in UTF-8.