In the HTML specs one can find the following line:
In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.
After looking into the Grammar definition there are the following sections:
On tag names it states:
Tags contain a tag name, giving the element's name. HTML elements all have names that only use alphanumeric ASCII characters. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
This leaves almost no room for interpretation. There is no underscore or dollar sign here. Also there is no ':' making it impossible to legally express names spaces. It also makes it possible to use only a number like <1> but then the grammar states:
Uppercase ASCII letter
Create a new start tag token, set its tag name to the lowercase version of the current input character (add 0x0020 to the character's code point), then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
Lowercase ASCII letter
Create a new start tag token, set its tag name to the current input character, then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
So we are only left to something like <a1234>.
On attribute names it states:
Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/" (U+002F), and "=" (U+003D) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
Reading this it seems this is possible:
<div ::::::="hello" $_$="dollar"></div>
From all this using namespaces for tag names is forbidden and for attributes it's mere a convention you may follow but do not need to.
So to put it simple namespace for HTML 5 does not exist and at least for the tag name can not be emulated and we have no underscore and no dot or something alike.
Is this correct? On the other hand HTML 5 specs state that we are free to add xmlns attributes to the elements making it possible to clearly introduce new namespaces. How does this fit?
[Update]
I rechecked the specification using the single page version of the specs and it actually stats that the name space declartion is allowed for xhtml left overs but it actually has to be ignored so no name spaces for us. Sad thing.
[/Update]
So the only question left is, if there is no ':' or anything else what can I legally do with element tag names. Can I use some special one I have made up. Remember we habe a relaxed specification for the parser here. The parser should be build in a way that it can handle unkown element tags. The question here is, how do they handle unknown element tags?
The HTML 5 specification allows only xmlns name space attributes with regard to the xhtml document specification. Those name spaces are ignored and not valued.
The tag name section of the specs is a bit confusing since it only talks about HTML elements. The parser section for tag names reads:
8.2.4.10 Tag name state
Consume the next input character:
"tab" (U+0009)
"LF" (U+000A)
"FF" (U+000C)
U+0020 SPACE
-> Switch to the before attribute name state.
"/" (U+002F)
-> Switch to the self-closing start tag state.
">" (U+003E)
-> Switch to the data state. Emit the current tag token.
Uppercase ASCII letter
-> Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.
U+0000 NULL
-> Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.
EOF
-> Parse error. Switch to the data state. Reconsume the EOF character.
Anything else
-> Append the current input character to the current tag token's tag name.
The last line is the important part. Also the specification only states for HTML elements defined as those. Therefore we are free to do things like and it is considered a valid element but not a valid HTML element. The question is how a browser or Editor reacts toward this character soup. But again it is a valid element name but not a valid HTML element name.
Related
So recently I found a question that was
Which of the following is a valid tag?
<213person>
<_person> (This is given as the right answer)
Both
None
(Note: this is the explanation that was given:- Valid HTML tags are surrounded by the angle brackets and the tag name can only either start from an alphabet or an underscore(_))
As far as my knowledge goes none of the reserved tags start with an underscore and according to what I've read about custom HTML tags it has to start with an alphabet(I tested it and it doesn't work with a custom tag starting with any character that's not an alphabet). So in my opinion and according to what I tested HTML tags can only start with alphabets or! (in case of !-- -- and !DOCTYPE HTML)
What I want to know is if the given explanation is correct or not and if it's correct then can someone provide some proper documentation and working examples for it?
As mentioned by #Rob, the standard defines a valid tag name as string containing alphanumeric ASCII characters, being:
0-9|a-z|A-Z
However, browsers handle things differently.
There's a few main points that I've noticed which don't align with the current standard.
Tag names must start with a letter
If a tag name starts with any character outside a-z|A-Z, the start tag ends up being interpreted as text and the end tag gets converted into a comment.
Special characters can be used
The following HTML is valid in a lot of browsers and will create an element:
<Z[\]^_`a></Z[\]^_`a>
This seems to be browsers only checking if the characters are ASCII. The only exception is the first character (as stated above).
Initially, I thought this was a simplified check, so instead of [A-Z]|[a-z| they checked [A-z], but you can use any character outside this range.
This makes the following HTML also "valid" in the eyes of certain browsers:
<a!></a!>
<aʬ></aʬ>
<a͢͢͢></a͢͢͢>
<a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ></a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ>
<a></a>
I tested the HTML elements in both Chrome and Firefox, I didn't test any other browsers. I also didn't test every ASCII character, just some very high and low in terms of their character code.
From the HTML standard:
Start tags must have the following format:
The first character of a start tag must be a U+003C LESS-THAN SIGN
character (<). The next few characters of a start tag must be the
element's tag name.
So what is allowed in the element's tag name? This is defined just above:
Tags contain a tag name, giving the element's name. HTML elements all
have names that only use ASCII alphanumerics. In the HTML syntax, tag
names, even those for foreign elements, may be written with any mix of
lower- and uppercase letters that, when converted to all-lowercase,
matches the element's tag name; tag names are case-insensitive.
I was wondering if you can use a random letter as an html tag. Like, f isn't a tag, but I tried it in some code and it worked just like a span tag. Sorry if this is a bad question, I've just been curious about it for a while, and I couldn't find anything online.
I was wondering if you can use a random letter as an html tag.
Yes and no.
"Yes" - in that it works, but it isn't correct: when you have something like <z> it only works because the web (HTML+CSS+JS) has a degree of forwards compatibility built-in: browsers will render HTML elements that they don't recognize basically the same as a <span> (i.e. an inline element that doesn't do anything other than reify a range of the document's text).
However, to use HTML5 Custom Elements correctly you need to conform to the Custom Elements specification which states:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
So if you use <my-z> then you'll be fine.
The HTML Living Standard document, as of 2021-12-04, indeed makes an explicit reference to forward-compatibility in its list of requirements for custom element names:
https://html.spec.whatwg.org/#valid-custom-element-name
They start with an ASCII lower alpha, ensuring that the HTML parser will treat them as tags instead of as text.
They do not contain any ASCII upper alphas, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively.
They contain a hyphen, used for namespacing and to ensure forward compatibility (since no elements will be added to HTML, SVG, or MathML with hyphen-containing local names in the future).
They can always be created with createElement() and createElementNS(), which have restrictions that go beyond the parser's.
Apart from these restrictions, a large variety of names is allowed, to give maximum flexibility for use cases like <math-α> or <emotion-😍>.
So, by example:
<a>, <q>, <b>, <i>, <u>, <p>, <s>
No: these single-letter elements are already used by HTML.
<z>
No: element names that don't contain a hyphen - cannot be custom elements and will be interpreted by present-day browsers as invalid/unrecognized markup that they will nevertheless (largely) treat the same as a <span> element.
<a:z>
No: using a colon to use an XML element namespace is not a thing in HTML5 unless you're using XHTML5.
<-z>
No - the element name must start with a lowercase ASCII character from a to z, so - is not allowed.
<a-z>
Yes - this is fine.
<a-> and <a-->
Unsure - these two names are curious:
The HTML spec says the name must match the grammar rule [a-z] (PCENChar)* '-' (PCENChar)*.
The * denotes "zero-or-more" which is odd, because that implies the hyphen doesn't need to be followed by another character.
PCENChar represents a huge range of visible characters permitted in element names, curiously this includes -, so by that rule <a--> should be valid.
But note that -- is a reserved character sequence in the greater SGML-family (including HTML and XML) which may cause weirdness. YMMV!
I was browsing through the source code of a moderately popular repo, and not sure what are the following tags.
see https://github.com/pusher/react-slack-clone/blob/master/src/index.js#L243
<row->
<col->
....
</col->
</row->
why - after the html tags? and how is it an acceptable tag?
They are custom elements. In regards to the tag's validity, you may have noticed that it is not defined anywhere in the code. As per step 5 of the spec, it is valid, and has a namespace of Element.
For a higher-level overview of custom elements, take a look at the MDN tutorial on using custom elements.
An additional note: These tags could be replaced by regular <div> tags with classes, and the functionality would be no different.
This is most likely an error in the source code which has gone unnoticed (possibly by using search & replace?). React accepts element names which end on a - character and it gets rendered to the DOM via document.createElement() as any other element (for a simple example see here: https://jsfiddle.net/nso3gjpw/ ). Since browsers are very forgiving in case of weird html, it just renders the element as an unknown custom element which behaves roughly like a span element. The row- and col- elements are also styled (https://github.com/pusher/react-slack-clone/blob/master/src/index.css#L73).
In the Blink rendering engine source code the following definition for tag names is given (https://www.w3.org/TR/REC-xml/#NT-CombiningChar):
// DOM Level 2 says (letters added):
//
// a) Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl.
// b) Name characters other than Name-start characters must have one of the categories Mc, Me, Mn, Lm, or Nd.
// c) Characters in the compatibility area (i.e. with character code greater than #xF900 and less than #xFFFE) are not allowed in XML names.
// d) Characters which have a font or compatibility decomposition (i.e. those with a "compatibility formatting tag" in field 5 of the database -- marked by field 5 beginning with a "<") are not allowed.
// e) The following characters are treated as name-start characters rather than name characters, because the property file classifies them as Alphabetic: [#x02BB-#x02C1], #x0559, #x06E5, #x06E6.
// f) Characters #x20DD-#x20E0 are excluded (in accordance with Unicode, section 5.14).
// g) Character #x00B7 is classified as an extender, because the property list so identifies it.
// h) Character #x0387 is added as a name character, because #x00B7 is its canonical equivalent.
// i) Characters ':' and '_' are allowed as name-start characters.
// j) Characters '-' and '.' are allowed as name characters.
//
// It also contains complete tables. If we decide it's better, we could include those instead of the following code.
Especially important here is rule j) Characters '-' and '.' are allowed as name characters.
I am developing a JS plugin for serving of retina images. The attributes that identify these images are supposed to be the following:
data-retina#2x,
data-retina#1.5x,
data-retina#2.5x.
Could you tell me if these attributes are valid? What characters are allowed (not allowed) in the names of custom data-* attributes in HTML and XHTML?
See the definition of the data-* attribute in the W3C HTML5 Recommendation:
In HTML5, the name must be XML-compatible (and it gets ASCII-lowercased automatically).
In XHTML5, the name must be XML-compatible and must not contain uppercase ASCII letters.
The definition of XML-compatible says that it
must not contain : characters
must match the Name production in the XML 1.0 specification
This Name production lists which characters are allowed.
tl;dr: For the part after data-, you may use the following characters:
0-9
a-z
A-Z (not in XHTML5)
- _ . ·
and characters from these Unicode ranges:
[#x0300-#x036F] (Combining Diacritical Marks)
[#x203F-#x2040] (‿ ⁀)
[#xC0-#xD6]
[#xD8-#xF6]
[#xF8-#x2FF]
[#x370-#x37D]
[#x37F-#x1FFF]
[#x200C-#x200D] (ZERO WIDTH NON-JOINER, ZERO WIDTH JOINER)
[#x2070-#x218F]
[#x2C00-#x2FEF]
[#x3001-#xD7FF]
[#xF900-#xFDCF]
[#xFDF0-#xFFFD]
[#x10000-#xEFFFF]
So the # (U+0040) is not allowed.
Please refer the Before attribute name state section of the HTML5 Spec:
U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
Uppercase ASCII letter
Start a new attribute in the current tag token. Set that attribute's name to the lowercase version of the current input character (add 0x0020 to the character's code point), and its value to the empty string. Switch to the attribute name state.
U+0000 NULL
Parse error. Start a new attribute in the current tag token. Set that attribute's name to a U+FFFD REPLACEMENT CHARACTER character, and its value to the empty string. Switch to the attribute name state.
U+0022 QUOTATION MARK (")
U+0027 APOSTROPHE (')
U+003C LESS-THAN SIGN (<)
U+003D EQUALS SIGN (=)
Parse error. Treat it as per the "anything else" entry below.
EOF
Parse error. Switch to the data state. Reconsume the EOF character.
Anything else
Start a new attribute in the current tag token. Set that attribute's name to the current input character, and its value to the empty string. Switch to the attribute name state.
In simple words:
It says all characters except tab, line feed, form feed, space, solidus, greater than sign, quotation mark, apostrophe and equals sign will be treated as part of the attribute name. Personally, I wouldn't attempt pushing the edge cases of this though.
Inspired from: What characters are allowed in an HTML attribute name?
I wasn't able to find any restrictions what characters are allowed in Text does this imply that erverthing is allowed or are there restrictions that affect HTML documents in general?
For example the Character Reference Section states that:
The numeric character reference forms [...] are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.
Are those characters still allowed in their "unescaped" form in Text? E.g. as attribute value: <span title="Hello ␀ World"></span> where ␀ is the U+0000 NULL character (not U+2400).
The character restriction for text on your page and in your markup is defined according to your selected character set. If you don't define a character set, the browser will take a guess or assert its default option (usually, whatever is the least restrictive). The character set is defined by using the meta tag with the charset attribute in your document's head section. The most common example of this uses the UTF-8 character set:
<meta charset="UTF-8" />
The value of this attribute can be any of the character sets defined by the Internet Assigned Numbers Authority (IANA). The full list of defined character sets is available here.
Additionally, there may be specific restrictions on unescaped text used within certain elements (or types of elements). In this case, you would have to read the specifications for that tag or type of tag, or simply escape the characters in question by replacing them with their ampersand-encoded html entities escape values.
I dont think that there is any restriction which is there on Text in the context which you have pointed. The text here means all the allowed alphabets,numbers and alphanumeric characters.
The answer is in the link you provided:
Text is allowed inside elements, attribute values, and comments. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections
Now if we go to the syntax definition for CDATA sections:
CDATA sections must consist of the following components, in this
order:
The string "<![CDATA[".
Optionally, text, with the additional restriction that the text must not contain the string "]]>".
The string "]]>".
So every type of content has it's own set of restrictions, and text is just used to define the superset of all characters, symbols and so on...