What Are The Reserved Characters In (X)HTML? - html

Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space (&nbsp) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??

The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&

Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>

By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.

The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.

In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html

Related

Difference in HTML Entity length in JavaScript

Why does the entity have length 6 while the entity ↓ has length 1? Is this in the spec somewhere? (Tested in Firefox, Chrome and Safari.)
JSFiddle
I agree that this is very weird behavior, but at least it's specified.
The HTML fragment serialization algorithm states that:
Escaping a string (for the purposes of the algorithm above) consists of replacing any occurrences of the "&" character by the string "&", any occurrences of the "<" character by the string "<", any occurrences of the ">" character by the string ">", any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ", and, if the algorithm was invoked in the attribute mode, any occurrences of the """ character by the string """.
Emphasis by me. If I had to guess this is to support backwards compatibility in older browsers that did this and to get consistent behavior when deserializing and serializing strings. If the browser serialized the DOM tree result of <div> </div> to <div> </div> deserializing it to the DOM tree again would result in a single space*. This is pretty much the only way the browser can achieve consistent behavior.
The replacement to ↓ on the other hand is completely safe and makes sense.
If you're actually interested in the length of the string stored inside the text using .textContent you'd get the result you were interested in.
* well, not really since it would still be a U+00A0 - but I could get why people think it might be confusing in the early DOM days
Consider the following HTML snippet:
<div>
<p>foo & bar 𝌆 baz</p>
</div>
Let’s look up innerHTML in the HTML Living Standard to see what happens when we run div.innerHTML in the context of the above HTML document. Ah, it defers to the DOM Parsing spec, which says:
On getting, if the context object’s node document is an HTML document, then the attribute must return the result of running the HTML fragment serialization algorithm on the context object; […]
The HTML fragment serialization algorithm is defined in the HTML Living Standard. Following the algorithm with the div.innerHTML example in mind, it’s clear that the first time it will descend to the “if current node is an Element” branch under step 3.2. This adds <p> to the output.
Then it calls the algorithm again on the text node within. This time we end up in the “if current node is a Text node” branch. It says:
[…] Otherwise, append the value of current node’s data IDL attribute, escaped as described below.
The data IDL attribute contains the textual contents of the element. The escaping instructions are defined as follows:
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
Replace any occurrence of the & character by the string &.
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string .
If the algorithm was invoked in the attribute mode, replace any occurrences of the " character by the string ".
If the algorithm was not invoked in the attribute mode, replace any occurrences of the < character by the string <, and any occurrences of the > character by the string >.
Only the abovementioned symbols are escaped as HTML entities in the result of .innerHTML – other Unicode symbols are just displayed in their raw form, regardless of how they are represented in the HTML source code.
Because of this, "↓" in the HTML source code turns into "↓" when reading it back out through innerHTML. But e.g. "&" or "&" turn into "&", and " " or   become " ".

What characters must be escaped in HTML 5?

HTML 4 states pretty which characters should be escaped:
Four character entity references deserve special mention since they
are frequently used to escape special characters:
"<" represents the < sign.
">" represents the > sign.
"&" represents the & sign.
"" represents the " mark.
Authors wishing
to put the "<" character in text should use "<" (ASCII decimal 60)
to avoid possible confusion with the beginning of a tag (start tag
open delimiter). Similarly, authors should use ">" (ASCII decimal
62) in text instead of ">" to avoid problems with older user agents
that incorrectly perceive this as the end of a tag (tag close
delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid
confusion with the beginning of a character reference (entity
reference open delimiter). Authors should also use "&" in
attribute values since character references are allowed within CDATA
attribute values.
Some authors use the character entity reference """ to encode
instances of the double quote mark (") since that character may be
used to delimit attribute values.
I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:
Use pre and code instead, and escape "<" and "&" characters as "<" and "&" respectively.
Could somewhat point to the official source on this matter?
The specification defines the syntax for normal elements as:
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
So you have to escape <, or & when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)
These rules don’t apply to <script> and <style>; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>, replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialization.)
From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments
Escaping a string (for the purposes of the algorithm* above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
*Algorithm is the built-in serialization algorithm as called e.g. by the innerHTML getter.
Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:
The & character should be replaced by &
Non-breaking spaces should be escaped as (surprise!...)
Within attributes, " should be escaped as "
Outside of attributes, < should be escaped as < and > should be escaped as >
I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.
Adding my voice to insist that things are not that easy -- strictly speaking:
HTML5 is a language specifications
it could be serialized either as HTML or as XML
Case 1 : HTML serialization
(the most common)
If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."
An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"
Furthermore, "the parsing of certain named character references in attributes happens even with the closing semicolon being omitted."
So, in that case editable && copy (notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.
As a counter example: editable&&copy is not safe (even if this might work) as the last sequence &copy might be interpreted as the entity reference for ©
Case 1 : XML serialization
(the less common)
Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &.
In that case && (with or without spaces) is invalid XML. You should write &&
Tricky, isn't it ?

escaping inside html tag attribute value

I am having trouble understanding how escaping works inside html tag attribute values that are javascript.
I was lead to believe that you should always escape & ' " < > . So for javascript as an attribute value I tried:
It doesn't work. However:
and
does work in all browsers!
Now I am totally confused. If all my attribute values are enclosed in double quotes, does this mean I do not have to escape single quotes? Or is apos and ascii 39 technically different characters? Such that javascript requires ascii 39, but not apos?
There are two types of “escapes” involved here, HTML and JavaScript. When interpreting an HTML document, the HTML escapes are parsed first.
As far as HTML is considered, the rules within an attribute value are the same as elsewhere plus one additional rule:
The less-than character < should be escaped. Usually < is used for this. Technically, depending on HTML version, escaping is not always required, but it has always been good practice.
The ampersand & should be escaped. Usually & is used for this. This, too, is not always obligatory, but it is simpler to do it always than to learn and remember when it is required.
The character that is used as delimiters around the attribute value must be escaped inside it. If you use the Ascii quotation mark " as delimiter, it is customary to escape its occurrences using " whereas for the Ascii apostrophe, the entity reference &apos; is defined in some HTML versions only, so it it safest to use the numeric reference ' (or ').
You can escape > (or any other data character) if you like, but it is never needed.
On the JavaScript side, there are some escape mechanisms (with \) in string literals. But these are a different issue, and not relevant in your case.
In your example, on a browser that conforms to current specifications, the JavaScript interpreter sees exactly the same code alert('Hello');. The browser has “unescaped” &apos; or ' to '. I was somewhat surprised to hear that &apos; is not universally supported these days, but it’s not an issue: there is seldom any need to escape the Ascii apostrophe in HTML (escaping is only needed within attribute values and only if you use the Ascii apostrophe as its delimiter), and when there is, you can use the ' reference.
&apos; is not a valid HTML reference entity. You should escape using '

Is > ever necessary?

I now develop websites and XML interfaces since 7 years, and never, ever came in a situation, where it was really necessary to use the > for a >. All disambiguition could so far be handled by quoting <, &, " and ' alone.
Has anyone ever been in a situation (related to, e.g., SGML processing, browser issues, XSLT, ...) where you found it indespensable to escape the greater-than sign with >?
Update: I just checked with the XML spec, where it says, for example, about character data in section 2.4:
Character Data
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
So even there, the > isn't mentioned as something special, except from the ending sequence of a CDATA section.
This one single case, where the > is of any significance, would be the ending of a CDATA section, ]]>, but then again, if you'd quote it, the quote (i.e., the literal string ]]>) would land literally in the output (since it's CDATA).
You don't need to absolutely because almost any XML interpreter will understand what you mean. But still you use a special character without any protection if you do so.
XML is all about semantic, and this is not really semantic compliant.
About your update, you forgot this part :
The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
The use case given in the documentation is more about something like this :
<xmlmarkup>
]]>
</xmlmarkup>
Here the ]]> part could be a problem with old SGML parsers, so it must be escaped into = ]]> for compatibilities reasons.
I used one not 19 hours ago to pass a strict xml validator. Another case is when you use them actually in html/xml content text (rather than attributes), like this: <.
Sure, a lax parser will accept most anything you throw at it, but if you're ever worried about XSS, < is your friend.
Update: Here's an example where you need to escape > in Firefox:
<?xml version="1.0" encoding="utf-8" ?>
<test>
]]>
</test>
Granted, it still isn't an example of having to escape a lone >.
Not so much as an author of (x)html documents, but more as a user of sloppy written comments fields in websites, that "offer" you to insert html.
I mean if you do your site the right way, you wouldn't hardcode your content anyway, right? So your call to htmlentities or whatever (long time no see, php) would take care of replacing special characters for you.
So sure, you wouldn't manually type > but I hope you take measures so > is automatically replaced.
I just thought of another example, where you need to quote > in HTML5 (not XHTML5) documents: If you need it in attributes without quotes (which is something, that can be argued of course).
<img src=arrow.png alt=>>
should be equivalent to XHTML
<img src="arrow.png" alt=">" />
But then again, (?<!X)HTML is not SGML.
Imagine that you have the following text this is a not a ]]> nice day and you decide to surround it by CDATA sections <![CDATA[this is a not a ]]> nice day]]>.
In order to avoid that (and for allowing parsing of SGML fragments with unterminated marked sections), clause 10.4 of ISO 8879:1986 declares that the occurrence of ]]> outside a marked
section is an error.
Also, in the times of SGML marked sections were very popular, as they were not only used for CDATA (as in XML), but also for RCDATA (only entities and character references allowed) and IGNORE and INCLUDE (which allowed for recognition of markup inside them).
For instance, in SGML one could write:
<!ENTITY %WHATTODO "INCLUDE">
<![%WHATTODO;[<b>]]></b>]]>
Which is equivalent to:
<b>]]></b>

Why shouldn't `&apos;` be used to escape single quotes?

As stated in, When did single quotes in HTML become so popular? and Jquery embedded quote in attribute, the Wikipedia entry on HTML says the following:
The single-quote character ('), when used to quote an attribute value, must also be escaped as ' or ' (should NOT be escaped as &apos; except in XHTML documents) when it appears within the attribute value itself.
Why shouldn't &apos; be used? Also, is " safe to be used instead of "?
" is on the official list of valid HTML 4 entities, but &apos; is not.
From C.16. The Named Character Reference &apos;:
The named character reference &apos;
(the apostrophe, U+0027) was
introduced in XML 1.0 but does not
appear in HTML. Authors should
therefore use ' instead of
&apos; to work as expected in HTML 4
user agents.
" is valid in both HTML5 and HTML4.
&apos; is valid in HTML5, but not HTML4. However, most browsers support &apos; for HTML4 anyway.
&apos; is not part of the HTML 4 standard.
" is, though, so is fine to use.
If you need to write semantically correct mark-up, even in HTML5, you must not use &apos; to escape single quotes. Although, I can imagine you actually meant apostrophe rather then single quote.
single quotes and apostrophes are not the same, semantically, although they might look the same.
Here's one apostrophe.
Use ' to insert it if you need HTML4 support. (edited)
In British English, single quotes are used like this:
"He told me to 'give it a try'", I said.
Quotes come in pairs. You can use:
<p><q>He told me to <q>give it a try</q></q>, I said.<p>
to have nested quotes in a semantically correct way, deferring the substitution of the actual characters to the rendering engine. This substitution can then be affected by CSS rules, like:
q {
quotes: '"' '"' '<' '>';
}
An old but seemingly still relevant article about semantically correct mark-up: The Trouble With EM ’n EN (and Other Shady Characters).
(edited) This used to be:
Use ’ to insert it if you need HTML4 support.
But, as #James_pic pointed out, that is not the straight single quote, but the "Single curved quote, right".
If you really need single quotes, apostrophes, you can use
html | numeric | hex
‘ | ‘ | ‘ // for the left/beginning single-quote and
’ | ’ | ’ // for the right/ending single-quote