I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:
...
One should write:
...
Apparently, the former example shouldn't work, but browser error recovery means it does.
Is this still the case in HTML5?
We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?
It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:
The ampersand (&) may be left unescaped in more cases compared to HTML4.
In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.
In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:
a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
any other character ===> the parser will try to find a named character reference, e.g., something like ∉.
The last case is the one of interest to you since your example has:
...
You have the character sequence
AMPERSAND
LATIN SMALL LETTER Y
EQUAL SIGN
Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:
If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.
You don't have a semicolon there, so you don't have a parse error.
Now suppose you had, instead,
...
which is different because é is a named entity reference in HTML. In this case, the following rule kicks in:
If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
So there the = makes it an error, because legacy browsers might get confused.
Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.
It would be interesting to see what validators can do.
Related
I have this piece of html from wikipedia.
Judith Ehrlich
I understand "=3D" is Quoted-Printable encoding for "=" but im not sure what 3D"URL" means. Normally when I would see a link in HTML it would be written like this
Judith Ehrlich
In quoted-printable, any non-standard octets are represented as an = sign followed by two hex digits representing the octet's value. To represent a plain =, it needs to be represented using quoted-printable encoding too: 3D are the hex digits corresponding to ='s ASCII value (61).
In other words, the sequence of =3D"URL" in those fields is converted to just ="URL". 3D"URL" without = has no meaning on its own.
If used in a parsing/rendering situation that is interpreting = as a QP encoding, omitting 3D would result in the parser wrongly interpreting the next characters (e.g. "U) as a byte encoding. Using =3D would be necessary to insert an actual = in the parsed result.
For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?
≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).
• is 1 single bullet point character.
That is the most obvious difference.
The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.
The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.
Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.
However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.
•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.
But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).
A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).
There are html equivalents for ">" and "<" ("<" and ">") in the OBX-5 field which is causing the Terser.get(..) method to only fetch the characters up to the ampersand character. The encoding characters in MSH-2 are "^~\&". Is the terser.get(..) failing because there's an encoding character in the OBX-5 field? Is there a way to change these characters to ">" and "<" easily?
Thanks a lot for your help.
Yes, it fails because the ampersand has been declared as subcomponent separator and the message you are trying to process is not valid -- it should not contain (unescaped) html character entities (< and >).
If you cannot help how the incoming messages are encoded you should preprocess the message before giving it to terser, replacing illegal characters. I'm pretty sure HAPI cannot help you there.
In a valid HL7v2 message, the data type used in OBX-5 is determined by OBX-2. OBX-5 should only contain the characters and escape sequences allowed by declared data type. < and > are among them (if not declared as separators in MSH-2).
HL7 standtard defines escape sequences for the separator and delimiter characters (e.g. \T\ is the escape sequence for subcomponent separator).
I've been reading up on the RFC-4627 specification, and I've come to interpretation:
When advertising a payload as application/json mime-type,
there MUST be no BOMs at the beginning of properly encoded JSON streams (based on section "3. Encoding"), and
no media parameters are supported, thus a mime-type header of application/json; charset=utf-8 does not conform to RFC-4627 (based on section "6. IANA Considerations").
Are these correct deductions? Will I run into problem when implementing web-services or web-clients which adhere to this interpretations? Should I file bugs against web browsers which violate the the two properties above?
You are right
The BOM character is illegal in JSON (and not needed)
The MIME charset is illegal in JSON (and not needed as well)
RFC 7159, Section 8.1:
Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
This is put as clearly as it can be. This is the only "MUST NOT" in the entire RFC.
RFC 7159, Section 11:
The MIME media type for JSON text is application/json.
Type name: application
Subtype name: json
Required parameters: n/a
Optional parameters: n/a
[...]
Note: No "charset" parameter is defined for this registration.
JSON encoding
The only valid encodings of JSON are UTF-8, UTF-16 or UTF-32 and since the first character (or first two if there is more than one character) will always have a Unicode value lower than 128 (there is no valid JSON text that can include higher values of the first two characters) it is always possible to know which of the valid encodings and which endianness was used just by looking at the byte stream.
RFC recommendation
The JSON RFC says that the first two characters will always be below 128 and you should check the first 4 bytes.
I would put it differently: since a string "1" is also valid JSON there is no guarantee that you have two characters at all - let alone 4 bytes.
My recommendation
My recommendation of determining the JSON encoding would be slightly different:
Fast method:
if you have 1 byte and it's not NUL - it's UTF-8
(actually the only valid character here would be an ASCII digit)
if you have 2 bytes and none of them are NUL - it's UTF-8
(those must be ASCII digits with no leading '0', {}, [] or "")
if you have 2 bytes and only the first is NUL - it's UTF-16BE
(it must be an ASCII digit encoded as UTF-16, big endian)
if you have 2 bytes and only the second is NUL - it's UTF-16LE
(it must be an ASCII digit encoded as UTF-16, little endian)
if you have 3 bytes and they are not NUL - it's UTF-8
(again, ASCII digits with no leading '0's, "x", [1] etc.)
if you have 4 bytes or more than the RFC method works:
00 00 00 xx - it's UTF-32BE
00 xx 00 xx - it's UTF-16BE
xx 00 00 00 - it's UTF-32LE
xx 00 xx 00 - it's UTF-16LE
xx xx xx xx - it's UTF-8
but it only works if it is indeed a valid string in any of those encodings, which it may not be. Moreover, even if you have a valid string in one of the 5 valid encodings, it may still not be a valid JSON.
My recommendation would be to have a slightly more rigid verification than the one included in the RFC to verify that you have:
a valid encoding of either UTF-8, UTF-16 or UTF-32 (LE or BE)
a valid JSON
Looking only for NUL bytes is not enough.
That having been said, at no point you need to have any BOM characters to determine the encoding, neither you need MIME charset - both of which are not needed and not valid in JSON.
You only have to use the binary content-transfer-encoding when using UTF-16 and UTF-32 because those may contain NUL bytes. UTF-8 doesn't have that problem and 8bit content-transfer-encoding is fine as it doesn't contain NUL in the string (though it still contains bytes >= 128 so 7-bit transfer will not work - there is UTF-7 that would work for such a transfer but it wouldn't be valid JSON, as it is not one of the only valid JSON encodings).
See also this answer for more details.
Answering your followup questions
Are these correct deductions?
Yes.
Will I run into problem when implementing web-services or web-clients which adhere to this interpretations?
Possibly, if you interact with incorrect implementations. Your implementation MAY ignore the BOM for the sake of interoperability with incorrect implementations - see RFC 7159, Section 1.8:
In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
Also, ignoring the MIME charset is the expected behavior of compliant JSON implementations - see RFC 7159, Section 11:
Note: No "charset" parameter is defined for this registration.
Adding one really has no effect on compliant recipients.
Security considerations
I am not personally convinced that silently accepting incorrect JSON streams is always desired. If you decide to accept input with BOM and/or MIME charset then you will have to answer those questions:
What to do in case of a mismatch between MIME charset and actual encoding?
What to do in case of a mismatch between BOM and MIME charset?
What to do in case of a mismatch between BOM and the actual encoding?
What to do when all of them differ?
What to do with encodings other than UTF-8/16/32?
Are you sure that all security checks will work as expected?
Having the encoding defined in three independent places - in a JSON string itself, in the BOM and in the MIME charset makes the question inevitable: what to do if they disagree. And unless you reject such an input then there is no one obvious answer.
For example, if you have a code that verifies the JSON string to see if it's safe to eval it in JavaScript - it might be misled by the MIME charset or the BOM and treat is as a different encoding than it actually is and not detect strings that it would detect if it used the correct encoding. (A similar problem with HTML has led to XSS attacks in the past.)
You have to be prepared for all of those possibilities whenever you decide to accept incorrect JSON strings with multiple and possibly conflicting encoding indicators. It's not to say that you should never do that because you may need to consume input generated by incorrect implementations. I'm just saying that you need to thoroughly consider the implications.
Nonconforming implementations
Should I file bugs against web browsers which violate the the two properties above?
Certainly - if they call it JSON and the implementation doesn't conform to the JSON RFC then it is a bug and should be reported as such.
Have you found any specific implementations that doesn't conform to the JSON specification and yet they advertise to do so?
I think you are correct about question 1, due to Section 3 about the first two characters being ASCII and the unicode FAQ on BOMs, see "Q: How I should deal with BOMs?", answer part 3. Your emphasis on MUST may be a bit strong: the FAQ seems to imply SHOULD.
Don't know the answer to question 2.
I am proposing to convert my windows-1252 XHTML web pages to UTF-8.
I have the following character entities in my coding:
' — apostrophe,
► — right pointer,
◄ — left pointer.
If I change the charset and save the pages as UTF-8 using my editor:
the apostrophe remains in as a character entity;
the pointers are converted to symbols within the code (presumably because the entities are not supported in UTF-8?).
Questions:
If I understand UTF-8 correctly, you don't need to use the entities and can type characters directly into the code. In which case is it safe for me to replace #39 with a typed in apostrophe?
Is it correct that the editor has placed the pointer symbols directly into my code and will these be displayed reliably on modern browsers, it seems to be ok? Presumably, I can't revert to the entities anyway, if I use UTF-8?
Thanks.
It's charset, not chartset.
1) it depends on where the apostrophe is used, it's a valid ASCII character as well so depending on the characters intention (wether its for display only (inside a DOMText node) or used in code) you may or may not be able to use a literal apostrophe.
2) if your editor is a modern editor, it will be using utf sequences instead of just char to display text. most of the sequences used in code are just plain ASCII (and ASCII is a subset of utf8) so those characters will take up one byte. other characters may take up two, three or even four bytes in a specialized manner. they will still be displayed to you as one character, but the relation between character and byte has become different.
Anyway; since all valid ASCII characters are exactly the same in ASCII, utf8 and even windows-1252. you should not see any problems using utf8. And you can still use numeric and named entities because they are written in those valid characters. You just don't have to.
P.S. All modern browsers can do utf8 just fine. but our definitions of "modern" may vary.
Entities have three purposes: Encoding characters it isn't possible to encode in the character encoding used (not relevant with UTF-8), encoding characters it is not convenient to type on a given keyboard, and encoding characters that are illegal unescaped.
► should always produce ► no matter what the encoding. If it doesn't, it's a bug elsewhere.
► directly in the source is fine in UTF-8. You can do either that or the entity, and it makes no difference.
' is fine in most contexts, but not some. The following are both allowed:
<span title="Jon's example">This is Jon's example</span>
But would have to be encoded in:
<span title='Jon's example'>This is Jon's example</span>
because otherwise it would be taken as the ' that ends the attribute value.
Use entities if you copy/paste content from a word processor or if the code is an XML dialect. Use a macro in your text-editor to find/replace the common ones in one shot. Here is a simple list:
Half: ½ => ½
Acute Accent: é => é
Ampersand: & => &
Apostrophe: ’ => '
Backtick: ‘ => `
Backslash: \ => \
Bullet: • => •
Dollar Sign: $ => $
Cents Sign: ¢ => ¢
Ellipsis: … => …
Emdash: — => —
Endash: – => –
Left Quote: “ => “
Right Quote: ” => ”
References
XML Entity Names