Valid Characters in Option Value - html

I was just wondering what valid characters can be included as the value for a <option>
i.e. is this valid?
<select>
<option value='0dbl,2sgl'>0 Double and 2 Singles</option>
<option value='1dbl,0sgl'>1 Double and 0 Singles</option>
</select>

Yes, that's perfectly valid. See the specification: it says the content of value should be CDATA, in which pretty much everything is valid, with the following caveats:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.

For HTML4:
OPTION Attribute definitions
selected [CI]
When set, this boolean attribute specifies that this option is pre-selected.
value = cdata [CS]
This attribute specifies the initial value of the control. If this attribute is not set, the initial value is set to the contents of the OPTION element.
label = text [CS]
This attribute allows authors to specify a shorter label for an option than the content of the OPTION element. When specified, user agents should use the value of this attribute rather than the content of the OPTION element as the option label.
Source: http://www.w3.org/TR/html401/interact/forms.html#h-17.6
So we go to the definition of CDATA:
CDATA is a sequence of characters from
the document character set and may
include character entities. User
agents should interpret attribute
values as follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values (e.g., " myval " may be interpreted as "myval"). Authors should not declare attribute values with leading or trailing white space.
For some HTML 4 attributes with CDATA
attribute values, the specification
imposes further constraints on the set
of legal values for the attribute that
may not be expressed by the DTD.
Source: http://www.w3.org/TR/html401/types.html#type-cdata
As there is no constraint noted, the valid content of value must have properly escaped entities, properly defined entities, and be within the scope of the document's encoding.

Related

Does "Text" in the HTML5 syntax mean "any character"?

I wasn't able to find any restrictions what characters are allowed in Text does this imply that erverthing is allowed or are there restrictions that affect HTML documents in general?
For example the Character Reference Section states that:
The numeric character reference forms [...] are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.
Are those characters still allowed in their "unescaped" form in Text? E.g. as attribute value: <span title="Hello ␀ World"></span> where ␀ is the U+0000 NULL character (not U+2400).
The character restriction for text on your page and in your markup is defined according to your selected character set. If you don't define a character set, the browser will take a guess or assert its default option (usually, whatever is the least restrictive). The character set is defined by using the meta tag with the charset attribute in your document's head section. The most common example of this uses the UTF-8 character set:
<meta charset="UTF-8" />
The value of this attribute can be any of the character sets defined by the Internet Assigned Numbers Authority (IANA). The full list of defined character sets is available here.
Additionally, there may be specific restrictions on unescaped text used within certain elements (or types of elements). In this case, you would have to read the specifications for that tag or type of tag, or simply escape the characters in question by replacing them with their ampersand-encoded html entities escape values.
I dont think that there is any restriction which is there on Text in the context which you have pointed. The text here means all the allowed alphabets,numbers and alphanumeric characters.
The answer is in the link you provided:
Text is allowed inside elements, attribute values, and comments. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections
Now if we go to the syntax definition for CDATA sections:
CDATA sections must consist of the following components, in this
order:
The string "<![CDATA[".
Optionally, text, with the additional restriction that the text must not contain the string "]]>".
The string "]]>".
So every type of content has it's own set of restrictions, and text is just used to define the superset of all characters, symbols and so on...

Should a value be used for all attributes?

There are attributes in HTML that only specify a boolean value. These include multiple, disabled, selected etc.
In XHTML, due to the strict XML syntax, you must give the attributes a value. This is normally the name of the attribute.
<select multiple="multiple">
But HTML also supports just the name of the element.
<select multiple>
And, as seen here, browsers (at least Firefox) also allow for other values with the same result.
<select multiple="yes">
Which one of these is the generally used one, or is there one? What is the official recommendation?
From the spec
A number of attributes are boolean attributes. The presence of a
boolean attribute on an element represents the true value, and the
absence of the attribute represents the false value.
If the attribute is present, its value must either be the empty string
or a value that is an ASCII case-insensitive match for the attribute's
canonical name, with no leading or trailing whitespace.
So multiple, multiple=multiple, multiple='multiple' or multiple="multiple". Nothing else (case insensitivity aside), even if browsers recover from the error.
I'd lean towards either the short (multiple) or the XML parser-friendly with the more conventional quotes (multiple="multiple").

Namespace and HTML 5

In the HTML specs one can find the following line:
In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.
After looking into the Grammar definition there are the following sections:
On tag names it states:
Tags contain a tag name, giving the element's name. HTML elements all have names that only use alphanumeric ASCII characters. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
This leaves almost no room for interpretation. There is no underscore or dollar sign here. Also there is no ':' making it impossible to legally express names spaces. It also makes it possible to use only a number like <1> but then the grammar states:
Uppercase ASCII letter
Create a new start tag token, set its tag name to the lowercase version of the current input character (add 0x0020 to the character's code point), then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
Lowercase ASCII letter
Create a new start tag token, set its tag name to the current input character, then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
So we are only left to something like <a1234>.
On attribute names it states:
Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/" (U+002F), and "=" (U+003D) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
Reading this it seems this is possible:
<div ::::::="hello" $_$="dollar"></div>
From all this using namespaces for tag names is forbidden and for attributes it's mere a convention you may follow but do not need to.
So to put it simple namespace for HTML 5 does not exist and at least for the tag name can not be emulated and we have no underscore and no dot or something alike.
Is this correct? On the other hand HTML 5 specs state that we are free to add xmlns attributes to the elements making it possible to clearly introduce new namespaces. How does this fit?
[Update]
I rechecked the specification using the single page version of the specs and it actually stats that the name space declartion is allowed for xhtml left overs but it actually has to be ignored so no name spaces for us. Sad thing.
[/Update]
So the only question left is, if there is no ':' or anything else what can I legally do with element tag names. Can I use some special one I have made up. Remember we habe a relaxed specification for the parser here. The parser should be build in a way that it can handle unkown element tags. The question here is, how do they handle unknown element tags?
The HTML 5 specification allows only xmlns name space attributes with regard to the xhtml document specification. Those name spaces are ignored and not valued.
The tag name section of the specs is a bit confusing since it only talks about HTML elements. The parser section for tag names reads:
8.2.4.10 Tag name state
Consume the next input character:
"tab" (U+0009)
"LF" (U+000A)
"FF" (U+000C)
U+0020 SPACE
-> Switch to the before attribute name state.
"/" (U+002F)
-> Switch to the self-closing start tag state.
">" (U+003E)
-> Switch to the data state. Emit the current tag token.
Uppercase ASCII letter
-> Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.
U+0000 NULL
-> Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.
EOF
-> Parse error. Switch to the data state. Reconsume the EOF character.
Anything else
-> Append the current input character to the current tag token's tag name.
The last line is the important part. Also the specification only states for HTML elements defined as those. Therefore we are free to do things like and it is considered a valid element but not a valid HTML element. The question is how a browser or Editor reacts toward this character soup. But again it is a valid element name but not a valid HTML element name.

Can the select option value be of different types?

I want to know what is good practice for select option values.
Example
<select name="select">
<option value="0-9">Sample</option>
<option value="a-z">Sample</option>
<option value="this is sample value">Sample</option>
<option value="this-is-sample-value">Sample</option>
<option value="this_is_sample_value">Sample</option>
<option value="this & is | sample ** value">Sample</option>
</select>
I'm a little bit confused here. Is the select value same like input text and textarea
There are no limits real to the type of data that can be set in the value attribute of the option element. Characters with special meaning in HTML do, of course, need to be represented by the appropriate entities (& as & for example (although the one in the question meets the "followed by a space character" exception to the rule)).
The attribute is defined as containing CDATA:
<!ELEMENT OPTION - O (#PCDATA) -- selectable choice -->
<!ATTLIST OPTION
%attrs; -- %coreattrs, %i18n, %events --
selected (selected) #IMPLIED
disabled (disabled) #IMPLIED -- unavailable in this context --
label %Text; #IMPLIED -- for use in hierarchical menus --
value CDATA #IMPLIED -- defaults to element content --
>
— http://www.w3.org/TR/html4/interact/forms.html#h-17.6
CDATA is a sequence of characters from
the document character set and may
include character entities. User
agents should interpret attribute
values as follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and
trailing white space in CDATA
attribute values (e.g., " myval "
may be interpreted as "myval").
Authors should not declare attribute
values with leading or trailing white
space.
For some HTML 4 attributes with CDATA
attribute values, the specification
imposes further constraints on the set
of legal values for the attribute that
may not be expressed by the DTD.
— http://www.w3.org/TR/html4/types.html#type-cdata
The specification doesn't impose additional limits for the option element's value attribute.
Same as a text-type input -- it can be string, float, etc. This is more a question of which is most reliable to parse when you process the form data.
The posted value will be the one corresponding to the selection.
In that regards, it is treated the same way as an input type text is.
Yes, it is a string type, and could have any value. The value goes when you submit a form, and there are limitations.
The limitations depends which technology you are using on server end.
As in case of ASP.Net when you try to post special characters like & or especially < script > some script < / script > or the similar characters which are part of html tags or could be a dangerous script. The asp.net checks the posted data and throws exception. means some special characters are not allowed in value of select box with regards to asp.net
However the samples you given (except of having & it should be prefixed by amp;) are allowed and could be set in option tag value attribute.
Hope your understanding are build.

Can data-* attribute contain HTML tags?

I.E. <img src="world.jpg" data-title="Hello World!<br/>What gives?"/>
As far as I understand the guidelines, it is basically valid, but it's better to use HTML entities.
From the HTML 4 reference:
You should also escape & within attribute values since entity references are allowed within cdata attribute values. In addition, you should escape > as > to avoid problems with older user agents that incorrectly perceive this as the end of a tag when coming across this character in quoted attribute values.
From the HTML 5 reference:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
So the best thing to do, as #tdammers already says, is to escape these characters (quoting the W3C reference)
& to represent the & sign.
< to represent the < sign.
> to represent the > sign.
" to represent the " mark.
and decoding them from their entity values if they are to be used as HTML.
Providing you're serving it as text/html, then yes it's valid.
Note that not only is it possible to include markup inside attributes, but the HTML5 srcdoc attribute on the iframe element positively encourages it. The HTML5 draft says:
In the HTML syntax, authors need only
remember to use U+0022 QUOTATION MARK
characters (") to wrap the attribute
contents and then to escape all U+0022
QUOTATION MARK (") and U+0026
AMPERSAND (&) characters, ....
Note, that when served with an XML content type (e.g. application/xhtml+xml), it is not valid, or even well-formed.
I'd say yes, as in it's still valid HTML5. Older browsers (which ones?) may not parse correctly.
Section 3.2.4.1 Attributes of the current HTML5 draft says this:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
HTML tags inside attributes also validates at http://html5.validator.nu
No. That would be invalid - HTML does not allow < or > inside attributes.
<img src="world.jpg" data-title="Hello World!<br/>What gives?"/> would be valid, but it would display the <br/> literally, not as a newline.