HTML 5 Custom element names? - html

In HTML 5 specification the parser and the specification state that the element name can be everything starting with a letter and followed by alpha-numeric characters.
Now the question is what happens if I introduce additional elements not part of the specification but valid in terms of compliance to the specified syntax.
What do all those browsers do when they encounter elements with custom yet unknown name? Does those elements got treaten like any element or are they left out, stripped out or replaced?
How do for instance do HTML5 editor behave?
Is there anything in the specifications I have overlooked regarding valid element tag names?
[Update]
The specification was missleading here since it states the alpha-numerical character of the HTML element names. While reading a HTML 5 Specification, I missconcluded that this is true for all element names.
That is appearently wrong. In the parser section it states that a element name must only start with an ASCII letter and after that letter everthing except:
"tab" (U+0009)
"LF" (U+000A)
"FF" (U+000C)
U+0020 SPACE
"/" (U+002F)
">" (U+003E)
U+0000 NULL
EOF
Beside those mentioned characters which require special treatment involving errors or ending the tag name all other possible characters seam to be allowed.
Anything else
--> Append the current input character to the current tag token's tag name.
From my field test also additional uni-code letters are allowed for the first letter by several parsers (at least they are graceful with those).
[/Update]

In HTML 5 specification the parser and the specification state that
the element name can be everything starting with a letter and followed
by alpha-numeric characters.
Incorrect. The specification states that the element name must be one of the names explicitly listed in that document, or in another applicable specification. These include but are not limited to SVG and MathML.
The specification also includes a processing specification for consumers of HTML, such as browsers. This doesn't describe what's "allowed", it describes what those consumers should do with each character of the document regardless of whether it contains things that are allowed or not allowed.
Now the question is what happens if I introduce additional elements
not part of the specification but valid in terms of compliance to the
specified syntax.
The above rules are followed. The "specified syntax" is irrelevant. The specification describes what the consumer should do for any input stream of characters.
What do all those browsers do when they encounter elements with custom
yet unknown name? Does those elements got treated like any element or
are they left out, stripped out or replaced?
They are treated as elements in the http://www.w3.org/1999/xhtml namespace which implement the HTMLUnknownElement interface.
How for instance do HTML5 editor behave?
If they are HTML5 compliant they will behave the same way when reading in the HTML.
Is there anything in the specifications I have overlooked regarding
valid element tag names?
See the first paragraph above. Also the Custom Elements spec which makes any element name starting with an ASCII letter and containing a hyphen to be considered valid. It is unclear whether that specification is currently an "HTML5 applicable specification" but if not, it will very probably be one soon.

Related

Limit of characters inside tag property [duplicate]

How long is too long for an attribute value in HTML?
I'm using HTML5 style data attributes (data-foo="bar") in a new application, and in one place it would be really handy to store a fair whack of data (upwards of 100 characters). While I suspect that this amount is fine, it raises the question of how much is too much?
HTML5 has no limits on the length of attribute values.
As the spec says, "This version of HTML thus returns to a non-SGML basis."
Later on, when describing how to parse HTML5, the following passage appears (emphasis added):
The algorithm described below places
no limit on the depth of the DOM tree
generated, or on the length of tag
names, attribute names, attribute
values, text nodes, etc. While
implementors are encouraged to avoid
arbitrary limits, it is recognized
that practical concerns will likely
force user agents to impose nesting
depth constraints.
Therefore, (theoretically) there is no limit to the length/size of HTML5 attributes.
See revision history for original answer covering HTML4.
I've just written a test (Note! see update below) which puts a string of length 10 million into an attribute and then retrieves it again, and it works fine (Firefox 3.5.2 & Internet Explorer 7)
50 million makes the browser hang with the "This script is taking a long time to complete" message.
Update: I've fixed the script: it previously set the innerHTML to a long string and now it's setting a data attribute. https://output.jsbin.com/wikulamuni It works for me with a length of 100 million. YMMV.
el.setAttribute('data-test', <<a really long string>>)
I really don't think there is any limit. I know now you can do
<a onclick=" //...insert 100KB of javascript code here">
and it works fine. Albeit a little unreadable.
From HTML5 syntax doc
9.1.2.3 Attributes
Attributes for an element are
expressed inside the element's start
tag.
Attributes have a name and a value.
Attribute names must consist of one or
more characters other than the space
characters, U+0000 NULL, U+0022
QUOTATION MARK ("), U+0027 APOSTROPHE
('), U+003E GREATER-THAN SIGN (>),
U+002F SOLIDUS (/), and U+003D EQUALS
SIGN (=) characters, the control
characters, and any characters that
are not defined by Unicode. In the
HTML syntax, attribute names may be
written with any mix of lower- and
uppercase letters that are an ASCII
case-insensitive match for the
attribute's name.
Attribute values are a mixture of text
and character references, except with
the additional restriction that the
text cannot contain an ambiguous
ampersand.
Attributes can be specified in four
different ways:
Empty attribute syntax
Unquoted attribute value syntax
Single-quoted attribute value syntax
Double-quoted attribute value syntax
Here there hasn't mentioned a limit on the size of the attribute value. So I think there should be none.
You can also validate your document against the
HTML5 Validator(Highly Experimental)
I've never heard of any limit on the length of attributes.
In the HTML 4.01 specifications, in the section on Attributes there is nothing that mention any limitation on this.
Same in the HTML 4.01 DTD -- in fact, as far as I know, DTD don't allow you to specify a length to attributes.
If there is nothing about that in HTML 4, I don't imagine anything like that would appear for HTML 5 -- and I actually don't see any length limitation in the 9.1.2.3 Attributes section for HTML 5 either.
Tested recently in Edge (Version 81.0.416.58 (64 bits)), and data-attributes seem to have a limit of 64k.
From http://dev.w3.org/html5/spec/Overview.html#embedding-custom-non-visible-data:
Every HTML element may have any number of custom data attributes specified, with any value.
That which is used to parse/process these data-* attribute values will have
limitations.
Turns out the data-attributes and values are placed in a DOMStringMap object.
This has no inherent limits.
From http://dev.w3.org/html5/spec/Overview.html#domstringmap:
Note: The DOMStringMap interface definition here is only intended for JavaScript
environments. Other language bindings will need to define how DOMStringMap is to be
implemented for those languages
DOMStringMap is an interface with a getter, setter, greator and deleter.
The setter has two parameters of type DOMString, name and value.
The value is of type DOMString that is is mapped directly to a JavaScript String.
From https://bytes.com/topic/javascript/answers/92088-max-allowed-length-javascript-string:
The maximum length of a JavaScript String is implementation specific.
The SGML Defines attributes with a limit set of 65k characters, seen here:
http://www.highdots.com/forums/html/length-html-attribute-175546.html
Although for what you are doing, you should be fine.
As for the upper limits, I have seen jQuery use data attributes hold a few k of data personally as well.

Does HTML5 change the standard for HTML commenting?

Recently I found that there is, possibly, a new way of commenting in HTML5.
Instead of the typical <!-- --> multi-line commenting I've read about, I thought I noticed that my IDE made a regular <!div > commented out. So I tested it out, and to my surprise Chrome had commented out that tag. It only commented out the tag and not the contents of the div, so I had to comment out the closer <!/div> to avoid closing other divs.
I tested another and it appears that generally putting an exclamation marker in front of the opening of any tag, this symbol <, makes that tag commented out.
Is this actually new? Is it bad practice? It is actually very convenient, but is it practical yet (if not new)?
Extra details:
Although a syntax error or misinterpretations of this particular syntax is a good reason, how come Chrome actually renders them as full comments?
The code is written as:
<!div displayed> some text here that is still displayed <!/div>
And then it is rendered as:
<!--div displayed--> some text here that is still displayed <!--/div-->
There is no new standard for comments in HTML5. The only valid comment syntax is still <!-- -->. From section 8.1.6 of W3C HTML5:
Comments must start with the four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--).
The <! syntax originates in SGML DTD markup, which is not part of HTML5. In HTML5, it is reserved for comments, CDATA sections, and the DOCTYPE declaration. Therefore whether this alternative is bad practice depends on whether you consider the use of (or worse, the dependence on) obsolete markup to be bad practice.
Validator.nu calls what you have a "Bogus comment." — which means that it's treated like a comment even though it's not a valid comment. This is presumably for backward compatibility with pre-HTML5, which was SGML-based, and had markup declarations that took the form <!FOO>, so I wouldn't call this new. The reason they're treated like comments is because SGML markup declarations were special declarations not meant to be rendered, but since they are meaningless in HTML5 (with the above exceptions), as far as the HTML5 DOM is concerned they are nothing more than comments.
The following steps within section 8.2.4 lead to this conclusion, which Chrome appears to be following to the letter:
8.2.4.1 Data state:
Consume the next input character:
"<" (U+003C)
Switch to the tag open state.
8.2.4.8 Tag open state:
Consume the next input character:
"!" (U+0021)
Switch to the markup declaration open state.
8.2.4.45 Markup declaration open state:
If the next two characters are both "-" (U+002D) characters, consume those two characters, create a comment token whose data is the empty string, and switch to the comment start state.
Otherwise, if the next seven characters are an ASCII case-insensitive match for the word "DOCTYPE", then consume those characters and switch to the DOCTYPE state.
Otherwise, if there is an adjusted current node and it is not an element in the HTML namespace and the next seven characters are a case-sensitive match for the string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after), then consume those characters and switch to the CDATA section state.
Otherwise, this is a parse error. Switch to the bogus comment state. The next character that is consumed, if any, is the first character that will be in the comment.
Notice that it says to switch to the comment start state only if the sequence of characters encountered is <!--, otherwise it's a bogus comment. This reflects what is stated in section 8.1.6 above.
8.2.4.44 Bogus comment state:
Consume every character up to and including the first ">" (U+003E) character or the end of the file (EOF), whichever comes first. Emit a comment token whose data is the concatenation of all the characters starting from and including the character that caused the state machine to switch into the bogus comment state, up to and including the character immediately before the last consumed character (i.e. up to the character just before the U+003E or EOF character), but with any U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER characters. (If the comment was started by the end of the file (EOF), the token is empty. Similarly, the token is empty if it was generated by the string "<!>".)
In plain English, this turns <!div displayed> into <!--div displayed--> and <!/div> into <!--/div-->, exactly as described in the question.
On a final note, you can probably expect other HTML5-compliant parsers to behave the same as Chrome.
I don't think this is a good habit to take since <! stands for markup declarations like <!DOCTYPE. Thus you think it's commented (well... browser will try to interpret it).
Even if it doesn't appear, this seems not to be the correct syntax for commenting HTML code.

Can an HTML element have the same attribute twice?

I'm considering writing code which produces an HTML tag that could have duplicate attributes, like this:
<div data-foo="bar" class="some-class" data-foo="baz">
Is this legal HTML? Does one of the data-foo-values take precendence over the other? Can I count on semi-modern browsers (IE >= 9) to parse it without choking?
Or am I about to do something really stupid here?
It is not valid to have the same attribute name twice in an element. The authoritative references for this are somewhat complicated, as old HTML versions were nominally based on SGML and the restriction is implied by a normative reference to the SGML standard. In HTML5 PR, section 8.1.2.3 Attributes explicitly says: “There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other.”
What happens in practice is that the latter attribute is ignored. Well, future browsers might do otherwise. In the DOM, attributes appear as properties of the element node as well as in the attributes object, so there would be no natural way to store two values.
It's not technically valid, but every browser will ignore duplicate attributes in HTML documents and use the first value (data-foo="bar" in your case).
Using the same attribute name twice in a tag is considered an internal parse error. It would cause your document to fail validation, if that's something you're worried about. However, it's important to understand that HTML 5 defines an expected result even for cases where you have a "parse error". The parser is allowed to stop when it encounters an error, but if it chooses not to stop it must produce a specific result described in the specification. In practice, no browsers choose to stop when encountering errors in HTML documents (XML/XHTML is a different matter), so all modern browsers will handle this case successfully and consistently.
The WHATWG HTML specification describes this case in section 12.2.4.33 "Attribute name state":
When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be dropped, along with the value that gets associated with it (if any).
See also its description of "parse error" from the opening of section 12.2 "Parsing HTML documents":
Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.
I wanted to add a comment to the excellent accepted answer, but my reputation is not high enough.
I wanted to add it is important to consider how your code gets compiled.
For example, Angular removes prior duplicate (non-angular) class attributes and only keeps the last one.
Note: Angular also modifies the value of the class attribute with ngClass and any [class.class-name] attributes.
This is also something you can use linter for.
See htmlhint (attr-no-duplication) or htmllint (attr-no-dup).

What are all the HTML escaping contexts?

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):
<div>This is regular text</div>
As well as within the values of attributes:
<input value="this is value text">
And, I believe, within HTML comments:
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)
Tangentially: The following text will throw errors as HTML 4.01 Strict:
foo
Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.
<div>This is regular text</div>
Text content: & must be escaped. < must be escaped.
If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.
In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
<input value="this is value text">
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.
Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.
In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.
To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.
For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- < -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.
<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)
There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.
The following text will throw errors as HTML 4.01 Strict:
foo
Yes, and it's just as much an error in Transitional.
If you put a space after the &, however, it validates just fine.
Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)
The above contexts clearly have different rules about what needs to be escaped.
I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.
E.g.
<h1>Fish & Chips</h1>
<img alt="Awesome picture of Meat Pie & Chips" />
Fish & Chips
The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).
So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?
Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?
I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).
Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."
For URL Encoding, this article seems a good starting point.
Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)
If you are looking for the best practices to escape characters in web browsers (including HTML, JavaScript and style sheets), the XSS prevention cheat sheet by Michael Coates is probably what you're looking for. It includes a description of the different interpretation contexts, tables indicating how to encode characters in each context and code samples (using ESAPI).
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet
Beware that <script> followed by <!-- followed by <script> again, enters double-escaped state, in which you probably never want to be, so ideally you should escape < with "\u003C" within your script's strings (and regexps) to not trigger it accidentally.
You can read more about it here http://qbolec-memdump.blogspot.com/2013/11/script-tag-content-madness.html
If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.
You don't say what environment you are targeting.

Is there a limit to the length of HTML attributes?

How long is too long for an attribute value in HTML?
I'm using HTML5 style data attributes (data-foo="bar") in a new application, and in one place it would be really handy to store a fair whack of data (upwards of 100 characters). While I suspect that this amount is fine, it raises the question of how much is too much?
HTML5 has no limits on the length of attribute values.
As the spec says, "This version of HTML thus returns to a non-SGML basis."
Later on, when describing how to parse HTML5, the following passage appears (emphasis added):
The algorithm described below places
no limit on the depth of the DOM tree
generated, or on the length of tag
names, attribute names, attribute
values, text nodes, etc. While
implementors are encouraged to avoid
arbitrary limits, it is recognized
that practical concerns will likely
force user agents to impose nesting
depth constraints.
Therefore, (theoretically) there is no limit to the length/size of HTML5 attributes.
See revision history for original answer covering HTML4.
I've just written a test (Note! see update below) which puts a string of length 10 million into an attribute and then retrieves it again, and it works fine (Firefox 3.5.2 & Internet Explorer 7)
50 million makes the browser hang with the "This script is taking a long time to complete" message.
Update: I've fixed the script: it previously set the innerHTML to a long string and now it's setting a data attribute. https://output.jsbin.com/wikulamuni It works for me with a length of 100 million. YMMV.
el.setAttribute('data-test', <<a really long string>>)
I really don't think there is any limit. I know now you can do
<a onclick=" //...insert 100KB of javascript code here">
and it works fine. Albeit a little unreadable.
From HTML5 syntax doc
9.1.2.3 Attributes
Attributes for an element are
expressed inside the element's start
tag.
Attributes have a name and a value.
Attribute names must consist of one or
more characters other than the space
characters, U+0000 NULL, U+0022
QUOTATION MARK ("), U+0027 APOSTROPHE
('), U+003E GREATER-THAN SIGN (>),
U+002F SOLIDUS (/), and U+003D EQUALS
SIGN (=) characters, the control
characters, and any characters that
are not defined by Unicode. In the
HTML syntax, attribute names may be
written with any mix of lower- and
uppercase letters that are an ASCII
case-insensitive match for the
attribute's name.
Attribute values are a mixture of text
and character references, except with
the additional restriction that the
text cannot contain an ambiguous
ampersand.
Attributes can be specified in four
different ways:
Empty attribute syntax
Unquoted attribute value syntax
Single-quoted attribute value syntax
Double-quoted attribute value syntax
Here there hasn't mentioned a limit on the size of the attribute value. So I think there should be none.
You can also validate your document against the
HTML5 Validator(Highly Experimental)
I've never heard of any limit on the length of attributes.
In the HTML 4.01 specifications, in the section on Attributes there is nothing that mention any limitation on this.
Same in the HTML 4.01 DTD -- in fact, as far as I know, DTD don't allow you to specify a length to attributes.
If there is nothing about that in HTML 4, I don't imagine anything like that would appear for HTML 5 -- and I actually don't see any length limitation in the 9.1.2.3 Attributes section for HTML 5 either.
Tested recently in Edge (Version 81.0.416.58 (64 bits)), and data-attributes seem to have a limit of 64k.
From http://dev.w3.org/html5/spec/Overview.html#embedding-custom-non-visible-data:
Every HTML element may have any number of custom data attributes specified, with any value.
That which is used to parse/process these data-* attribute values will have
limitations.
Turns out the data-attributes and values are placed in a DOMStringMap object.
This has no inherent limits.
From http://dev.w3.org/html5/spec/Overview.html#domstringmap:
Note: The DOMStringMap interface definition here is only intended for JavaScript
environments. Other language bindings will need to define how DOMStringMap is to be
implemented for those languages
DOMStringMap is an interface with a getter, setter, greator and deleter.
The setter has two parameters of type DOMString, name and value.
The value is of type DOMString that is is mapped directly to a JavaScript String.
From https://bytes.com/topic/javascript/answers/92088-max-allowed-length-javascript-string:
The maximum length of a JavaScript String is implementation specific.
The SGML Defines attributes with a limit set of 65k characters, seen here:
http://www.highdots.com/forums/html/length-html-attribute-175546.html
Although for what you are doing, you should be fine.
As for the upper limits, I have seen jQuery use data attributes hold a few k of data personally as well.