Does "Text" in the HTML5 syntax mean "any character"? - html

I wasn't able to find any restrictions what characters are allowed in Text does this imply that erverthing is allowed or are there restrictions that affect HTML documents in general?
For example the Character Reference Section states that:
The numeric character reference forms [...] are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.
Are those characters still allowed in their "unescaped" form in Text? E.g. as attribute value: <span title="Hello ␀ World"></span> where ␀ is the U+0000 NULL character (not U+2400).

The character restriction for text on your page and in your markup is defined according to your selected character set. If you don't define a character set, the browser will take a guess or assert its default option (usually, whatever is the least restrictive). The character set is defined by using the meta tag with the charset attribute in your document's head section. The most common example of this uses the UTF-8 character set:
<meta charset="UTF-8" />
The value of this attribute can be any of the character sets defined by the Internet Assigned Numbers Authority (IANA). The full list of defined character sets is available here.
Additionally, there may be specific restrictions on unescaped text used within certain elements (or types of elements). In this case, you would have to read the specifications for that tag or type of tag, or simply escape the characters in question by replacing them with their ampersand-encoded html entities escape values.

I dont think that there is any restriction which is there on Text in the context which you have pointed. The text here means all the allowed alphabets,numbers and alphanumeric characters.

The answer is in the link you provided:
Text is allowed inside elements, attribute values, and comments. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections
Now if we go to the syntax definition for CDATA sections:
CDATA sections must consist of the following components, in this
order:
The string "<![CDATA[".
Optionally, text, with the additional restriction that the text must not contain the string "]]>".
The string "]]>".
So every type of content has it's own set of restrictions, and text is just used to define the superset of all characters, symbols and so on...

Related

Valid and Invalid HTML tags

So recently I found a question that was
Which of the following is a valid tag?
<213person>
<_person> (This is given as the right answer)
Both
None
(Note: this is the explanation that was given:- Valid HTML tags are surrounded by the angle brackets and the tag name can only either start from an alphabet or an underscore(_))
As far as my knowledge goes none of the reserved tags start with an underscore and according to what I've read about custom HTML tags it has to start with an alphabet(I tested it and it doesn't work with a custom tag starting with any character that's not an alphabet). So in my opinion and according to what I tested HTML tags can only start with alphabets or! (in case of !-- -- and !DOCTYPE HTML)
What I want to know is if the given explanation is correct or not and if it's correct then can someone provide some proper documentation and working examples for it?
As mentioned by #Rob, the standard defines a valid tag name as string containing alphanumeric ASCII characters, being:
0-9|a-z|A-Z
However, browsers handle things differently.
There's a few main points that I've noticed which don't align with the current standard.
Tag names must start with a letter
If a tag name starts with any character outside a-z|A-Z, the start tag ends up being interpreted as text and the end tag gets converted into a comment.
Special characters can be used
The following HTML is valid in a lot of browsers and will create an element:
<Z[\]^_`a></Z[\]^_`a>
This seems to be browsers only checking if the characters are ASCII. The only exception is the first character (as stated above).
Initially, I thought this was a simplified check, so instead of [A-Z]|[a-z| they checked [A-z], but you can use any character outside this range.
This makes the following HTML also "valid" in the eyes of certain browsers:
<a!></a!>
<aʬ></aʬ>
<a͢͢͢></a͢͢͢>
<a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ></a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ>
<a></a>
I tested the HTML elements in both Chrome and Firefox, I didn't test any other browsers. I also didn't test every ASCII character, just some very high and low in terms of their character code.
From the HTML standard:
Start tags must have the following format:
The first character of a start tag must be a U+003C LESS-THAN SIGN
character (<). The next few characters of a start tag must be the
element's tag name.
So what is allowed in the element's tag name? This is defined just above:
Tags contain a tag name, giving the element's name. HTML elements all
have names that only use ASCII alphanumerics. In the HTML syntax, tag
names, even those for foreign elements, may be written with any mix of
lower- and uppercase letters that, when converted to all-lowercase,
matches the element's tag name; tag names are case-insensitive.

Namespace and HTML 5

In the HTML specs one can find the following line:
In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.
After looking into the Grammar definition there are the following sections:
On tag names it states:
Tags contain a tag name, giving the element's name. HTML elements all have names that only use alphanumeric ASCII characters. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
This leaves almost no room for interpretation. There is no underscore or dollar sign here. Also there is no ':' making it impossible to legally express names spaces. It also makes it possible to use only a number like <1> but then the grammar states:
Uppercase ASCII letter
Create a new start tag token, set its tag name to the lowercase version of the current input character (add 0x0020 to the character's code point), then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
Lowercase ASCII letter
Create a new start tag token, set its tag name to the current input character, then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
So we are only left to something like <a1234>.
On attribute names it states:
Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/" (U+002F), and "=" (U+003D) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
Reading this it seems this is possible:
<div ::::::="hello" $_$="dollar"></div>
From all this using namespaces for tag names is forbidden and for attributes it's mere a convention you may follow but do not need to.
So to put it simple namespace for HTML 5 does not exist and at least for the tag name can not be emulated and we have no underscore and no dot or something alike.
Is this correct? On the other hand HTML 5 specs state that we are free to add xmlns attributes to the elements making it possible to clearly introduce new namespaces. How does this fit?
[Update]
I rechecked the specification using the single page version of the specs and it actually stats that the name space declartion is allowed for xhtml left overs but it actually has to be ignored so no name spaces for us. Sad thing.
[/Update]
So the only question left is, if there is no ':' or anything else what can I legally do with element tag names. Can I use some special one I have made up. Remember we habe a relaxed specification for the parser here. The parser should be build in a way that it can handle unkown element tags. The question here is, how do they handle unknown element tags?
The HTML 5 specification allows only xmlns name space attributes with regard to the xhtml document specification. Those name spaces are ignored and not valued.
The tag name section of the specs is a bit confusing since it only talks about HTML elements. The parser section for tag names reads:
8.2.4.10 Tag name state
Consume the next input character:
"tab" (U+0009)
"LF" (U+000A)
"FF" (U+000C)
U+0020 SPACE
-> Switch to the before attribute name state.
"/" (U+002F)
-> Switch to the self-closing start tag state.
">" (U+003E)
-> Switch to the data state. Emit the current tag token.
Uppercase ASCII letter
-> Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.
U+0000 NULL
-> Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.
EOF
-> Parse error. Switch to the data state. Reconsume the EOF character.
Anything else
-> Append the current input character to the current tag token's tag name.
The last line is the important part. Also the specification only states for HTML elements defined as those. Therefore we are free to do things like and it is considered a valid element but not a valid HTML element. The question is how a browser or Editor reacts toward this character soup. But again it is a valid element name but not a valid HTML element name.

Newlines and special characters in HTML attributes

My questions are simple:
Is the following valid? If it is, would it break in some browsers?
<div data-text="Blah blah blah
More blah
And just a little extra blah to finish"> ... </div>
Which characters "must" be encoded in attribute values? I know " should be ", but are any others required to be encoded?
Is the following valid?
It's a valid fragment of HTML5, yes.
would it break in some browsers?
Unlikely.
Which characters "must" be encoded in attribute values? I know " should be ", but are any others required to be encoded?
That depends on whether the attribute value is double quoted, single quoted or unquoted.
For the double quoted form " must be replaced by its character reference, and & may need to be replaced by its character reference depending on the characters that follow it. See attribute-value-double-quoted-state
For the single quoted form ' must be replaced by its character reference, and & may need to be replaced by its character reference depending on the characters that follow it. See attribute-value-single-quoted-state
For the unquoted form TAB, LINEFEED, FORMFEED, SPACE, > must be replaced by their character references, and & may need to be replaced by its character reference depending on the characters that follow it. See attribute-value-unquoted-state
HTML 5 spec
There are different requirements for different attributes so there isn't one answer.
For instance, title attributes allow lines feeds, but a class attribute is a space seperated line of string tokens.
For data elements though the spec says of the namespace:
contains no characters in the range U+0041 to U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z).
Other than that, it doesn't make any distinctions.

Valid Characters in Option Value

I was just wondering what valid characters can be included as the value for a <option>
i.e. is this valid?
<select>
<option value='0dbl,2sgl'>0 Double and 2 Singles</option>
<option value='1dbl,0sgl'>1 Double and 0 Singles</option>
</select>
Yes, that's perfectly valid. See the specification: it says the content of value should be CDATA, in which pretty much everything is valid, with the following caveats:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
For HTML4:
OPTION Attribute definitions
selected [CI]
When set, this boolean attribute specifies that this option is pre-selected.
value = cdata [CS]
This attribute specifies the initial value of the control. If this attribute is not set, the initial value is set to the contents of the OPTION element.
label = text [CS]
This attribute allows authors to specify a shorter label for an option than the content of the OPTION element. When specified, user agents should use the value of this attribute rather than the content of the OPTION element as the option label.
Source: http://www.w3.org/TR/html401/interact/forms.html#h-17.6
So we go to the definition of CDATA:
CDATA is a sequence of characters from
the document character set and may
include character entities. User
agents should interpret attribute
values as follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values (e.g., " myval " may be interpreted as "myval"). Authors should not declare attribute values with leading or trailing white space.
For some HTML 4 attributes with CDATA
attribute values, the specification
imposes further constraints on the set
of legal values for the attribute that
may not be expressed by the DTD.
Source: http://www.w3.org/TR/html401/types.html#type-cdata
As there is no constraint noted, the valid content of value must have properly escaped entities, properly defined entities, and be within the scope of the document's encoding.

Can data-* attribute contain HTML tags?

I.E. <img src="world.jpg" data-title="Hello World!<br/>What gives?"/>
As far as I understand the guidelines, it is basically valid, but it's better to use HTML entities.
From the HTML 4 reference:
You should also escape & within attribute values since entity references are allowed within cdata attribute values. In addition, you should escape > as > to avoid problems with older user agents that incorrectly perceive this as the end of a tag when coming across this character in quoted attribute values.
From the HTML 5 reference:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
So the best thing to do, as #tdammers already says, is to escape these characters (quoting the W3C reference)
& to represent the & sign.
< to represent the < sign.
> to represent the > sign.
" to represent the " mark.
and decoding them from their entity values if they are to be used as HTML.
Providing you're serving it as text/html, then yes it's valid.
Note that not only is it possible to include markup inside attributes, but the HTML5 srcdoc attribute on the iframe element positively encourages it. The HTML5 draft says:
In the HTML syntax, authors need only
remember to use U+0022 QUOTATION MARK
characters (") to wrap the attribute
contents and then to escape all U+0022
QUOTATION MARK (") and U+0026
AMPERSAND (&) characters, ....
Note, that when served with an XML content type (e.g. application/xhtml+xml), it is not valid, or even well-formed.
I'd say yes, as in it's still valid HTML5. Older browsers (which ones?) may not parse correctly.
Section 3.2.4.1 Attributes of the current HTML5 draft says this:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
HTML tags inside attributes also validates at http://html5.validator.nu
No. That would be invalid - HTML does not allow < or > inside attributes.
<img src="world.jpg" data-title="Hello World!<br/>What gives?"/> would be valid, but it would display the <br/> literally, not as a newline.