Undefined behaviour in (X)HTML? - html

Is there such a thing as undefined behaviour in (X)HTML?
I have wondered this after playing around with the <button> tag, which allows HTML to be rendered as button. Nothing new so far...
But I noticed that one can also use the <a> tag. Complete example:
<button>
normal text
<b>bold text</b>
linked text
</button>
This is rendered as following on Firefox:
And in Google Chrome:
Now, On firefox, the link target is NOT clickable, only the button... However, on Chrome, the link is clickable, and will redirect to the IANA RFC2606 Page.
Is this undefined behaviour? Are there more cases in (X)HTML that could be described as undefined behaviour?

It's a little more complex than just inspecting the DTD as given by Yi Jiang and mu is too short.
It's true that the XHTML 1.0 DTDs explicitly forbid <a> elements as children of <button> elements as given in your question. However it does not forbid <a> elements as descendants of <button> elements.
So
<button>
normal text
<b>bold text</b>
<span>linked text</span>
</button>
is XHTML 1.0 Strict DTD conforming. But it has the same behavioural difference between Firefox and Chrome as the button fragment in the question.
Now, it is known that DTDs have problems describing limitations on descendant relationships, so it's maybe not surprising that the above sample is DTD conforming.
However. Appendix B of the XHTML 1.0 spec normatively describes descendant limitations in addition to the DTD. It says:
The following elements have
prohibitions on which elements they
can contain (see SGML Exclusions).
This prohibition applies to all depths
of nesting, i.e. it contains all the
descendant elements.
button
must not contain the input, select, textarea, label, button, form,
fieldset, iframe or isindex elements.
Note that it does not contain an exclusion for the <a> element. So it seems that XHTML 1.0 does not prohibit the <a> element from being non-child descendant of <button> and the behaviour in this case is indeed undefined.
This omission is almost certainly a mistake. The <a> element should have been in the list of elements prohibited as descendants of button in Appendix B.
HTML5 (including XHTML5) is much more thorough on the matter. It says:
4.10.8 The button element
Content model:
Phrasing content, but there must be no interactive content descendant.
where interactive content is defined as
Interactive content is content that is
specifically intended for user
interaction.
a
audio (if the controls attribute is present)
button
details
embed
iframe
img (if the usemap attribute is present)
input (if the type attribute is not in the Hidden state)
keygen
label
menu (if the type attribute is in the toolbar state)
object (if the usemap attribute is present)
select
textarea
video (if the controls attribute is present)
So in (X)HTML5 the <a> element is prohibited from being a descendant of the <button> element.

The HTML 4 specifications declares the <button> as such:
<!ELEMENT BUTTON - -
(%flow;)* -(A|%formctrl;|FORM|FIELDSET)
-- push button -->
Which, if my reading of the DTD is correct (and I'm not exactly familiar with this), <a> elements are explicitly forbidden from being nested in buttons, so what you're looking at there is invalid HTML, and therefore it is undefined behavior.

XHTML says this about <button>:
<!-- button uses %Flow; but excludes a, form and form controls -->
<!ENTITY % button.content
"(#PCDATA | p | %heading; | div | %lists; | %blocktext; |
table | %special; | %fontstyle; | %phrase; | %misc;)*">
So <a> is explicitly excluded from XHTML as well. The allowable elements inside <button> appear to be pretty much the same in XHTML-1.0 as in HTML-4.0.

To add to Alohci's good answer, but more specifically to answer the question: If your (X)HTML is invalid, behaviour is always, per definitionem, undefined. In this case, browsers are free to interpret the markup as they like (or whatever chance will make out of it), or even to reject it (what no real browser does).
This is exactly the problem that the tag soup introduced and that was the origin of XML's strict parsing rules and the HTML5 spec growing to >500 pages.

Related

Why some non self-closing HTML elements that do not have a close tag (self closing) will not display these elements, but some do

<textarea> is not a self-closing element, right? If so, when I remove </textarea> in this w3school code example, why it still works?
There are only 12 self-closing elements based on this explanation? Is it complete? Does it means we have to add closing tag except these 12 self-closing elements? If not, then element cannot display correctly?
Self-closing tags accompany void elements, which don't allow any content within them.
The void elements are <area>, <base>, <br>, <col>, <embed>, <hr>, <img>, <input>, <keygen>, <link>, <menuitem>, <meta>, <param>, <source>, <track> and <wbr>.
Consider <textarea> Text </textarea>. That is not self-closing, because it makes sense for it to contain content; the text the user inputs.
Conversely, consider <br />. That is self-closing, because it's a line break; there can never be anything between the start and end of a new line.
Void elements have an implied closing tag if omitted; you can safely leave it out when writing the tag. <br> is just as valid as <br />.
Omitting the closing tag of a non-void element will still work in some circumstances. In fact, there's a list of optional start and end tags, that covers things such as </body> and </head>. This is because you cannot have a valid HTML document with these tags omitted, and if you choose to omit them yourself, the parser will automatically attempt to place them in for you. Inspection with the F12 Debugger will reveal that these closing tags will be created automatically if omitted.
Obviously, this can be confusing for the parser, and it's much safer for you to add the tags in yourself. Otherwise, you may end up with invalid markup. You can test whether your markup is valid or not with the W3 Markup Validation service.
Hope this helps! :)
It doesn’t “work” when you omit the </textarea> end tag. Instead as #Kaiido alludes to above, “</body>\n</html>” gets added to the contents of the textarea element as text. Look:
As you can see there, “</body>\n</html>” has become part of the textarea contents.
That is, by removing the </textarea> end tag, you’ve caused all the remaining HTML markup in the source to be parsed not as markup but instead as text contents of the textarea element.
And while it’s true that for some elements, the parser will infer where the end tag should be and add it to the DOM for you, the parser will never do that for the textarea element.
There are only 12 self-closing elements based on this explanation? Is it complete? Does it means we have to add closing tag except these 12 self-closing elements? If not, then element cannot display correctly?
Check https://html.spec.whatwg.org/multipage/form-elements.html#the-textarea-element and you’ll see that for the Tag omission in text/html section there for the textarea element it says:
Neither tag is omissible.
Every element in the spec has a similar Tag omission in text/html section that explains whether or not you can ever omit the end tag or start tag for that element.

Can empty HTML elements have attributes in HTML5?

Empty HTML elements (i.e. elements having no content and no closing tag, like br/hr or any other HTML elements which I'm not aware of) can have attributes in the latest HTML5 standard?
Somebody please explain me in simple and easy to understand language.
Yes. Example: The <hr> tag can be modified to move the line around or change its length.
<hr width="50%" align="right">
They can For example tag supports global HTML attributes. You can check the attributes of html tags in W3school site. Here is the one for br:
http://www.w3schools.com/tags/tag_br.asp
(Check out the Global Attributes and/or Event Attributes)
You can easily check yourself which attributes an HTML5 element can have. In short:
Visit the HTML5 specification.
Search for the element under the "Table of Contents" (section 4).
For each element, see the attributes listed under "Content attributes".
In case of br and hr, they can have the global attributes (class, id, lang etc.).

Invalid location of tag (p)

I have this code:
<p class="pHelp"> xxxxx Form components yyyyy </p>
This line is inside html/body/a/a/a/a/a/a/a/a/p/#text (<- btw, what is the proper name for this "html's tag route/path"?)
Eclipse gives me an Other error on that line. How should I solve this?
It also complains about this sentence...
<a id="pd" /><h5>Provisional Data</h5>
...where it indicates 'No end tag (</a>).' Aren't self-closing tags allowed in HTML?
Thanks!
An interactive anchor element must not appear as a descendant of an anchor element. Your code appears to have multiple levels of nesting of anchor elements.
An anchor element must have a start tag and an end tag.
See the W3C language reference for the anchor element.
Only empty elements - that is elements that can't contain anything, such as img or `br´ - can use the self-closing syntax, unless you explicitly use an XHTML DOCTYPE, but even then older browsers such as IE<7 don't support that either.
Also you don't need to use a <a> element if you just need an id to set an anchor. Just set the ID to any other element, in this case, for example, directly to the <h5> element:
<h5 id="pd">Provisional Data</h5>

What kinds of HTML tags come in pairs and what not in pairs?

What kinds of HTML tags come in pairs and what not in pairs?
Wikipedia about HTML says:
HTML tags most commonly come in pairs like <h1> and </h1>, although some tags represent empty elements and so are unpaired, for example <img>
What does it mean by "empty elements"? <img> represents image embedding which isn't it nonempty element?
Thanks.
All tag come in pairs except for Void elements, which at the time of this writing include: area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, and wbr.
The Wikipedia page (as of now) is confused and confusing, and it uses old HTML terminology: the term “empty element” as used up to HTML 4.01 and various XHTML specifications was replaced by “void element” in HTML5. These terms are used to describe elements that cannot have any content, in the technical meaning for “content” as defined in HTML. When HTML appears as linear text, as serialized, using tags, this “content” is what appears between the start tag and the end tag (either of which may be implied for some elements). In the Document Object Model, “content” consists of the child nodes of an element that are element nodes or text nodes.
Emptyness does not imply invisibility. For example, <hr> is normally rendered as a horizontal rule. But the element is empty because its definition does not allow any content for the element
“Void element” is introduced in HTML5 by a list: area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr. However, this is meant to mean the elements with content model that allows no content; the list follows from this. The definitions of these elements specify “Content model: Empty.”
An element is an empty/void element if it is declared so in an applicable HTML specification. The definition of each element indicates its content model (allowed content).
“Empty element” is (or was) an element with EMPTY declared content in the formal syntax. As such, it was simply a syntactic concept: an empty element cannot have any content (any elements or any text except whitespace) between the start tag and the end tag. According to HTML rules except XHTML, the end tag must be omitted (implied), whereas in XHTML, an empty element may be written either with the end tag present, e.g. <br></br>, or using a special syntax where a slash in the start tag makes it act as an end tag, too, e.g. <br/>. The latter is recommended in clause C.2 of XHTML 1.0.
HTML5 has two syntaxes (serializations, linearizations): classic HTML syntax and XHTML syntax. In the former, old HTML rules for empty elements apply to void elements, e.g. no end tag is allowed for <br>. However, for compatibility, a slash is allowed as in XHTML, e.g. <br/>, but it has no effect. In the XHTML syntax, all XML syntax rules apply, so <br> alone is fatally invalid (well-formedness error), and either <br></br> or <br/> must be used.

Should an end tag close all unclosed intervening start tags with omitted end tags?

Am I reading the HTML 4.01 standard wrong, or is Google? In HTML 4.01, if I write:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html> <head> <body>plain <em>+em <strong>+strong </em>-em
The rendering in Google Chrome is:
plain +em +strong -em
This seems to contradict the HTML 4.01 standard, which summarizes the underlying SGML rules as: “an end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags”.¹
That is, the </em> end tag should close not only the <em> start tag but also the unclosed intervening <strong> start tag, and the rendering should be:
plain +em +strong -em
A commenter pointed out that it is bad practice to leave tags open, but this is only an academic example. An equally good example would be: <em> +em <strong> +strong </em> -em </strong>. It was my understanding from the HTML 4.01 standard that this code fragment would not work as intended because of the overlapping elements: the </em> end tag should implicitly close the <strong>. The fact that it did work as intended was surprising, and this is what led to my question.
And it turned out I proposed a false dichotomy in the question: neither Google nor I were reading the HTML 4.01 standard wrong. A private correspondent at w3.org pointed me to Web SGML and HTML 4.0 Explained by Martin Bryan, which explains that “[t]he parsing program will automatically close any currently open embedded element which has been declared as having omissible end-tags when it encounters an end-tag for a higher level element. (If an embedded element whose end-tag cannot be omitted is still open, however, the program will report an error in the coding.)”² (Emphasis added.) Bryan’s summarization of the SGML standard is right, and HTML 4.01’s summarization is wrong.
The statement quoted from the HTML 4.01 specification is very obscure, or just plain wrong on all accounts. HTML 4.01 has specific rules for end tag omission, and these rules depend on the element. For example, the end tag of a p element may be omitted, the end tag of an em may never be omitted. The statement in the specification probably tries to say that an end tag implicitly closes any inner elements that have not yet been closed, to the extent that end tag omission is allowed.
No browser has ever implement HTML 4.01 (or any earlier HTML specification) as defined, with the SGML features that are formally part of it. Anything that the HTML specifications say about SGML should be taken as just theoretical until proven otherwise.
HTML5 doesn’t change the rules of the game in this respect, except that it writes down the error handling rules. In simple issues like these, the rules just make the traditional browser behavior a norm. They are tagsoup-oriented, treating certain tags more or less as formatting commands: <em> means “italicize,” </em> means “stop italicizing,” etc. But HTML5 also takes measures to define error handling more formally so that despite such tag soup usage, it is well-defined what document tree in the DOM will be constructed.
Some tags are allowed to be omitted (such as the end tag for <p> or the start and end tags for <body>), and some are not (such as the end tag for <strong>). It is the former that the section of the spec you quote is referring to. You can identify them by the use of a dash in the DTD:
<!ELEMENT P - O (%inline;)* -- paragraph -->
^A p element
^ requires a start tag
^ has optional end tag
^ contains zero or more inline things
^ Comment: Is a paragraph
What you have is not an HTML document with an omitted tag, but and invalid pseudo-HTML document that browsers will try to perform error recovery on.
The specification (for HTML 4) does not describe how to perform error recovery, that is left up to browsers.
The specification says that:
Some HTML element types allow authors to omit end tags (e.g., the P and LI element types).
This:
Please consult the SGML standard for information about rules governing elements (e.g., they must be properly nested, an end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags (section 7.5.1), etc.).
Applies to elements which can have omitted end tags.
If you look the P element spec you will see:
Start tag: required, End tag: optional
So, when you use this:
<DIV>
<P>This is the paragraph.
</DIV>
The P element will be automatically closed.
But, if you look at the EM spec, you will see:
Start tag: required, End tag: required
So this rule of automatic closing is not valid since the HTML is not valid.
Curiously all the browsers presented the same behavior with that kind of invalid HTML.
All modern browsers use an HTML5 parser (even for HTML 4.01 content), so the parsing rules of HTML5 apply. You can find more information at the Parsing HTML Documents section in the HTML5 spec.
HTML Outline
HTML
HEAD
#text " " ()
BODY
#text "plain " ()
EM
#text "+em " (italic)
STRONG
#text "+strong " (bold/italic)
STRONG
#text "-em" (bold)
If you try running your HTML through http://validator.w3.org/check it will flag up this HTML as being pretty much invalid.
If your HTML is invalid, all bets are off, and different browsers may render your HTML differently.
If you look at the D.O.M. in Chrome by right clicking and saying inspect element, you'll be able to deduce that since your tags do not match up, it applied an algorithm to decide where you messed up. Technically, it does close the strong tag at the correct place. However, It decides that you were probably trying to make both pieces of text bold, so it puts the last -em in an entirely new, extra "strong" element while keeping the '+strong' in it's own "strong" element. It looks to me like the chrome team decided it is statistically likely that you want both things to be bold.