I've been reading up on the RFC-4627 specification, and I've come to interpretation:
When advertising a payload as application/json mime-type,
there MUST be no BOMs at the beginning of properly encoded JSON streams (based on section "3. Encoding"), and
no media parameters are supported, thus a mime-type header of application/json; charset=utf-8 does not conform to RFC-4627 (based on section "6. IANA Considerations").
Are these correct deductions? Will I run into problem when implementing web-services or web-clients which adhere to this interpretations? Should I file bugs against web browsers which violate the the two properties above?
You are right
The BOM character is illegal in JSON (and not needed)
The MIME charset is illegal in JSON (and not needed as well)
RFC 7159, Section 8.1:
Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
This is put as clearly as it can be. This is the only "MUST NOT" in the entire RFC.
RFC 7159, Section 11:
The MIME media type for JSON text is application/json.
Type name: application
Subtype name: json
Required parameters: n/a
Optional parameters: n/a
[...]
Note: No "charset" parameter is defined for this registration.
JSON encoding
The only valid encodings of JSON are UTF-8, UTF-16 or UTF-32 and since the first character (or first two if there is more than one character) will always have a Unicode value lower than 128 (there is no valid JSON text that can include higher values of the first two characters) it is always possible to know which of the valid encodings and which endianness was used just by looking at the byte stream.
RFC recommendation
The JSON RFC says that the first two characters will always be below 128 and you should check the first 4 bytes.
I would put it differently: since a string "1" is also valid JSON there is no guarantee that you have two characters at all - let alone 4 bytes.
My recommendation
My recommendation of determining the JSON encoding would be slightly different:
Fast method:
if you have 1 byte and it's not NUL - it's UTF-8
(actually the only valid character here would be an ASCII digit)
if you have 2 bytes and none of them are NUL - it's UTF-8
(those must be ASCII digits with no leading '0', {}, [] or "")
if you have 2 bytes and only the first is NUL - it's UTF-16BE
(it must be an ASCII digit encoded as UTF-16, big endian)
if you have 2 bytes and only the second is NUL - it's UTF-16LE
(it must be an ASCII digit encoded as UTF-16, little endian)
if you have 3 bytes and they are not NUL - it's UTF-8
(again, ASCII digits with no leading '0's, "x", [1] etc.)
if you have 4 bytes or more than the RFC method works:
00 00 00 xx - it's UTF-32BE
00 xx 00 xx - it's UTF-16BE
xx 00 00 00 - it's UTF-32LE
xx 00 xx 00 - it's UTF-16LE
xx xx xx xx - it's UTF-8
but it only works if it is indeed a valid string in any of those encodings, which it may not be. Moreover, even if you have a valid string in one of the 5 valid encodings, it may still not be a valid JSON.
My recommendation would be to have a slightly more rigid verification than the one included in the RFC to verify that you have:
a valid encoding of either UTF-8, UTF-16 or UTF-32 (LE or BE)
a valid JSON
Looking only for NUL bytes is not enough.
That having been said, at no point you need to have any BOM characters to determine the encoding, neither you need MIME charset - both of which are not needed and not valid in JSON.
You only have to use the binary content-transfer-encoding when using UTF-16 and UTF-32 because those may contain NUL bytes. UTF-8 doesn't have that problem and 8bit content-transfer-encoding is fine as it doesn't contain NUL in the string (though it still contains bytes >= 128 so 7-bit transfer will not work - there is UTF-7 that would work for such a transfer but it wouldn't be valid JSON, as it is not one of the only valid JSON encodings).
See also this answer for more details.
Answering your followup questions
Are these correct deductions?
Yes.
Will I run into problem when implementing web-services or web-clients which adhere to this interpretations?
Possibly, if you interact with incorrect implementations. Your implementation MAY ignore the BOM for the sake of interoperability with incorrect implementations - see RFC 7159, Section 1.8:
In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
Also, ignoring the MIME charset is the expected behavior of compliant JSON implementations - see RFC 7159, Section 11:
Note: No "charset" parameter is defined for this registration.
Adding one really has no effect on compliant recipients.
Security considerations
I am not personally convinced that silently accepting incorrect JSON streams is always desired. If you decide to accept input with BOM and/or MIME charset then you will have to answer those questions:
What to do in case of a mismatch between MIME charset and actual encoding?
What to do in case of a mismatch between BOM and MIME charset?
What to do in case of a mismatch between BOM and the actual encoding?
What to do when all of them differ?
What to do with encodings other than UTF-8/16/32?
Are you sure that all security checks will work as expected?
Having the encoding defined in three independent places - in a JSON string itself, in the BOM and in the MIME charset makes the question inevitable: what to do if they disagree. And unless you reject such an input then there is no one obvious answer.
For example, if you have a code that verifies the JSON string to see if it's safe to eval it in JavaScript - it might be misled by the MIME charset or the BOM and treat is as a different encoding than it actually is and not detect strings that it would detect if it used the correct encoding. (A similar problem with HTML has led to XSS attacks in the past.)
You have to be prepared for all of those possibilities whenever you decide to accept incorrect JSON strings with multiple and possibly conflicting encoding indicators. It's not to say that you should never do that because you may need to consume input generated by incorrect implementations. I'm just saying that you need to thoroughly consider the implications.
Nonconforming implementations
Should I file bugs against web browsers which violate the the two properties above?
Certainly - if they call it JSON and the implementation doesn't conform to the JSON RFC then it is a bug and should be reported as such.
Have you found any specific implementations that doesn't conform to the JSON specification and yet they advertise to do so?
I think you are correct about question 1, due to Section 3 about the first two characters being ASCII and the unicode FAQ on BOMs, see "Q: How I should deal with BOMs?", answer part 3. Your emphasis on MUST may be a bit strong: the FAQ seems to imply SHOULD.
Don't know the answer to question 2.
Related
I have this piece of html from wikipedia.
Judith Ehrlich
I understand "=3D" is Quoted-Printable encoding for "=" but im not sure what 3D"URL" means. Normally when I would see a link in HTML it would be written like this
Judith Ehrlich
In quoted-printable, any non-standard octets are represented as an = sign followed by two hex digits representing the octet's value. To represent a plain =, it needs to be represented using quoted-printable encoding too: 3D are the hex digits corresponding to ='s ASCII value (61).
In other words, the sequence of =3D"URL" in those fields is converted to just ="URL". 3D"URL" without = has no meaning on its own.
If used in a parsing/rendering situation that is interpreting = as a QP encoding, omitting 3D would result in the parser wrongly interpreting the next characters (e.g. "U) as a byte encoding. Using =3D would be necessary to insert an actual = in the parsed result.
For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?
≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).
• is 1 single bullet point character.
That is the most obvious difference.
The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.
The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.
Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.
However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.
•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.
But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).
A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).
I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:
...
One should write:
...
Apparently, the former example shouldn't work, but browser error recovery means it does.
Is this still the case in HTML5?
We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?
It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:
The ampersand (&) may be left unescaped in more cases compared to HTML4.
In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.
In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:
a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
any other character ===> the parser will try to find a named character reference, e.g., something like ∉.
The last case is the one of interest to you since your example has:
...
You have the character sequence
AMPERSAND
LATIN SMALL LETTER Y
EQUAL SIGN
Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:
If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.
You don't have a semicolon there, so you don't have a parse error.
Now suppose you had, instead,
...
which is different because é is a named entity reference in HTML. In this case, the following rule kicks in:
If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
So there the = makes it an error, because legacy browsers might get confused.
Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.
It would be interesting to see what validators can do.
Im reading one chapter from the W3C HTML Document Representation
In the 5.1 says this:
User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.
Then in the 5.2 says this:
The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters.
Char-Bytes
Bytes-Char
So im wrong or there are 2 encodings between the representation...
A "character encoding" such as UTF-8 is, strictly speaking, a specification for representing characters as a sequence of bytes. But the encodings are always reversible, so we can speak of a (single) character encoding as going both ways.
Other character encodings used in practice are UTF-16 ad UTF-32.
Each of these are specifications under which you can encode text as bytes and decode bytes into characters. Two parts of the same specification.
RFC 4627 on Json reads:
Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
What does it mean "Since the first two characters of a JSON text will always be ASCII characters [RFC0020]"? I've looked at RFC0020 but couldn't find anything about it. JSON could be {" or { " (ie whitespace before the quote.
It means that since JSON will always start with ASCII characters (non-ASCII is only permitted in strings, which cannot be the root object), it is possible to determine from the start of the stream/file what encoding it is in.
UTF-16 and UTF-32 should have a BOM that appears at the start of the stream and by finding out what it is, you can determine the exact encoding. This is possible as one can determine if the first characters are JSON or not.
I assume the spec specifically mentions this as for many other text streams/files, this is not always possible (as most text files can start with any two characters and the two first bytes of the actual file are not known in advance).
RFC 4627 requires a JSON document to represent either an object or an array. So, the first characters must be (with any amount of JSON whitespace characters) [ followed by a value or { followed by ". Values are null, true, false, or a string ("…), object or array. So, since JSON whitespace characters, [, {, n, t, f, and " are in the C0 Controls and Basic Latin block, they are also in the ASCII character set [by the design of Unicode]. (Not sure why the standard is fixated on "ASCII" when it says, "JSON text SHALL be encoded in Unicode." Future standards drop the reference.)
UTF-32 has four bytes per character. UTF-16 has two. So, to distinguish between UTF-16 and UTF-32, you need 4 bytes. In both of those encodings, characters from the C0 Controls and Basic Latin block are encoded with at most one non-zero byte (a byte with a value of 0 is sometimes called a "null byte"). Also, U+0000 (which is encoded as 0x00 0x00 0x00 0x00 in UTF-32 and 0x00 0x00 in UTF-16) is not valid JSON whitespace. So, the pattern of 0x00 bytes can be used to determine which of the allowed encodings, a valid JSON document uses.
RFC 7159 changed JSON to allow a JSON document to represent any value, not just an object or array. So, the statement in the previous standard is no longer valid. Therefore, the character detection algorithm was broken and removed from the standard.
For accurate detection, you need to see the beginning and the end of the document. 0x22 0x00 0x00 0x00 at the beginning could be any of UTF-8, UTF-16LE, and UTF-32LE; It's the start of a string with zero or more U+0000 characters. In this case, you need the number of 0x00 bytes at the end to tell which.
RFC 8259 changed JSON to require UTF-8 (for JSON "exchanged between systems that are not part of a closed ecosystem"). Out of practically, a JSON reader would still accept UTF-16 and UTF-32.
In the end, some popular JSON parsers leave character decoding up to the caller, having APIs that accept only the "native" string type for the programming environment. (This opens up the very common hazard of using the wrong character encoding for reading text files or HTTP bodies.)