What is the difference between HTML and XHTML? - html

Note: this is supposed to be the canonical post for this question. A number of answers exist already, but descriptions of the various differences are scattered all over the place, and more often than not, they also offer opinions to "which one should I use", which I will refrain from in here.
If you have more questions to ask, or you know of more differences, feel free to edit.
What is the difference between XHTML and HTML? Isn't XHTML merely a more strict version of HTML? And why are there different versions of XHTML if they all act the same?

What is the difference between HTML and XHTML?
There are many differences. The main one is that XHTML is HTML in an XML document, and XML has different syntax rules:
XML has a different namespace by default, so you'll have to use the HTML namespace, xmlns="http://www.w3.org/1999/xhtml" explicitly in an XHTML document
XML is case sensitive and you'll have to use lowercase for tag names and attributes and even the x in hexadecimal character references
XML doesn't have optional start and end tags, so you'll have to write out all of them in full
Likewise, XML doesn't have void tags, so you'll have to close every void element yourself with a slash.
Non-void elements that have no content can be written as a single empty element tag in XML.
XML can contain CDATA sections, sections of plain text delimited with <![CDATA[ .. ]]>; HTML cannot
On the other hand, there are no CDATA or PCDATA elements or attributes in XML, so you'll have to escape your < signs everywhere (except in CDATA sections)
Quotes around attribute values are not optional in XML, and there is no attribute minimization (name-only attributes)
And the XML parser is not as forgiving of errors as the HTML parser.
Then there are a couple of not XML-related differences:
XHTML documents are always rendered in standards mode, never in quirks mode
XHTML does not look at meta commands in the head to determine the encoding. In fact, the W3C validator flags <meta http-equiv="content-type" ... as an error in XHTML5 files, but not in HTML5 files.
Earlier on, mismatches between the dtds for XHTML 1.0 strict and HTML 4.01 strict lead to validation issues. The definition for XTHML 1.0 was missing the name attribute on <img> and <form>. This was an error though, fixed in XHTML 1.1.
Note that XHTML documents should be served up with the correct file type, i.e. a .xhtml file extension or an application/xhtml+xml MIME type. You can't really have XHTML in an HTML document, because browsers don't differentiate between the two syntaxes by looking at the content, only by file type.
In other words, if you have an HTML file, its contents are HTML, no matter if it has valid XML in it or not.
One point about the syntax rules worth mentioning is the casing of tag names. Although HTML documents are case-insensitive, the tag names are actually exposed as uppercase by the DOM. That means that under HTML, a JavaScript command like console.log(document.body.tagName); would output "BODY", whereas the same command under XHTML would output "body".
Isn't XHTML merely a stricter version of HTML?
No; XML has different rules than HTML, but it's not necessarily stricter. If anything, XML has fewer rules!
In HTML, many features are optional. You can choose to put quotes around attribute values or not; in XML you don't have that choice. And in HTML, you have to remember when you have the choice and when you don't: are quotes optional in <a href=http://my-website.com/?login=true>? In XML, you don't have to think about that. XML is easier.
In HTML, some elements are defined as raw text elements, that is, elements that contain plain text rather than markup.
And some other elements are escapable raw text elements, in which references like é will be parsed, but things like <b>bold</b> and <!-- comment --> will be treated as plain text. If you can remember which elements those are, you don't have to escape < signs (you optionally can though). XML doesn't have that, so there's nothing to remember and all elements have the same content type.
XML has processor instructions, the most well known of which is the xml declaration in the prolog, <?xml version="1.0" encoding="windows-1252"?>. This tells the browser which version of XML is used (1.0 is the only version that works, by the way) and which character set.
And XML parses comments in a different way. For example, HTML comments can't start with <!--> (with a > as the first character inside); XHTML comments can.
Speaking of comments, with XHTML you can comment out blocks of code inside <script> and <style> elements using <!-- comment -->. Don't try that in HTML. (It's not recommended in XHTML either, because of compatibility issues, but you can.)
Why are there different versions of XHTML if they all act the same?
They don't! For instance, in XHTML 1.1 you can refer to character entities like é and , because those entities are defined in the DTD. The current version of XHTML (formerly known as XHTML5) does not have a DTD, so you will have to use numerical references, in this case é and   (or, define those entities yourself in the DOCTYPE declaration. The X means eXtensible after all).

Related

HTML5 Doctype for Domparser

Task: I want to parse an XML document using DOMParser (https://developer.mozilla.org/en-US/docs/Web/API/DOMParser). I have no and need no formal DTD and parsing this as "text/xml" worked pretty well. Now I want to use certain symbolic entities, such as in my xml and the parser, of course, complains that they are not known. Since I want to be able to access, in principle, all existing html entities, I tried to use a doctype specification
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/html4/strict.dtd">
and this worked as expected, since DOMParser seems to have this doctype and the connected entity list preloaded. However, this doctype is outdated. So I tried the new <!DOCYTPE html> but this did not work. Also this is expected, as the novel html5 doctype tag works differently than the older xml/sgml based ones.
Question: Is there some standardized !DOCTYPE for html (5) which the browser recognizes and which contains the preloaded HTML entities. (I do not want to copy in a list of all entities as separate entity definitions, the browser has them somewhere, I just do not know how to activate them by an xml/sgml style DTD for html5)
If you want to continue using XML, but don't want to use the XHTML doctype, then you have to declare the character entities of XHTML via ENTITY declarations directly in your document (in the internal subset or an external declaration set) since only HTML has nbsp and many others as predefined entities (XML has only quot, amp, apos, lt, and gt). You can use the HTML5 entity set from https://www.w3.org/2003/entities/2007/htmlmathml-f.ent (which includes the large set of MathML entities), or the much smaller set of classic HTML4 entities.
But I would first check if DomParser actually processes markup declarations and/or external declaration sets with markup declarations. Try to parse the following
<?xml version="1.0"?>
<!DOCTYPE test [
<!ENTITY nbsp " ">
]>
<test>
</test>
and check the console for error messages.
There is no "official" DTD for HTML (in fact, no formal grammar at all), but there's my SGML DTD for W3C HTML 5.1 with much more information about parsing HTML5 than you probably are interested in, including info about HTML5's predefined entities.

What does exclamation point stand for in HTML in constructs like DOCTYPE and comments?

I am curious about the syntax of the doctype and comment tags...
Why the exclamation point? What is it called, what does it mean/do?
I have read through the HTML syntax spec and found no real explanation other than
Any case-insensitive match for the string <!DOCTYPE.
Cite: http://www.w3.org/TR/html-markup/syntax.html#doctype-syntax
In SGML, which is what HTML was nominally based on, up to and including HTML 4.01, the exclamation mark is part of the construct <!, which is the reference concrete syntax for mdo, markup declaration open. Markup declarations are not markup elements but, informally speaking, declarations relating to elements. This includes document type declaration, comment declarations, and entity declarations.
In XML, which is what XHTML is based on, there is no general concept like that. Instead, the character pair <! just appears in some constructs, with no uniform theory.
In HTML5, the HTML syntax has been defined very much in an ad hoc manner, and the doctype string is called just the doctype string – it has no role and no meaning beyond the expected effect of triggering “standards mode” (or “no-quirks mode”) in browsers. In the XHTML syntax, it has its XML meaning.
The ! is used for comments (<!-- -->) and to define the DOCTYPE (<!DOCTYPE ...>) of the HTML document. The DOCTYPE describe some characteristics of the document such as the root of the XML/XHTML/HTML file (in HTML usually is <html>), a DTD, a Public Identifier and other subset declarations.
From the specification of HTML from the Cern docs:
MDO
Markup Declaration Open: "<!", when followed by a letter or "--" or "[", signals one of several SGML markup declarations. The only purpose it serves in HTML is to introduce comments.
Source:
http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/Text.html
A <!DOCTYPE ...> is a SGML document type declaration. Its purpose is to tell an SGML parser what DTD it should use to parse the document.

Does HTML conform to the XML specification?

HTML and XML are syntactically very similar, so what I want to know is if valid HTML code will always conform to the XML specification.
No, it won't.
HTML 2 through 4.x were SGML applications, not XML applications. (HTML+ might also have been an SGML application, it isn't clear from a brief skim of the specification)
HTML 5 has its own parse rules.
(XHTML and the XML serialisation of HTML 5 will be XML though)
Does HTML conform to the XML specification?
No, it does not. HTML supports:
unclosed tags (e.g. <img> instead of <img />)
wrongly nested tags (e.g. <b><i>bla</b></i> instead of <b><i>bla</i></b>)
unquoted attributes (e.g. <a name=foo>...</a>)
contents that is not propery encoded (e.g. <em>this & that</em> instead of <em>this & that</em>)
tags that explicitly must contain unencoded content (i.e. <script>)
named entities (e.g. © instead of ©)
The standard does not explicitly allow all of these notions, but all HTML parsers understand and support them.
None of them is legal in XML.
HTML is more lenient. For example,
<!DOCTYPE html>
<title>foo</title>
bar
is a valid HTML5 document, but it's obviously not valid XML, since XML requires a top-level element that encompasses the whole document.
However, you can use one of the XHTML languages, which are applications of XML with the same semantics as the corresponding HTML standards.

Why does the .tagName DOM property return an uppercase value?

For example, if we have
<html>
<head>
<title>FooBar</title>
</head>
<body></body>
</html>
If we do document.getElementByTagName("title").TagName, then we will have TITLE (uppercase). While the html standards recommends writing html tags in lowercase.
I know there is no relationship between both, but this still doesn't make sense.
Is there any reason that DOM should return tag names in uppercase?
Technically, this is mandated in DOM Level 1:
The HTML DOM returns the tagName of an HTML element in the canonical uppercase form, regardless of the case in the source HTML document.
The convention of uppercase tag names probably stems from legacy, when HTML was previously developed based on SGML, and element types were declared in uppercase. See this section of the HTML 4.01 spec discussing SGML, HTML and its syntax, as well as for example the HTML 4.01 Strict doctype definition. Any DOM implementations supporting HTML would follow suit.
Note that lowercase tag names are only explicitly required in XHTML (but not XML), and authors are generally recommended to write lowercase tags for easy porting between HTML/XHTML, as well as improving readability. However, this recommendation doesn't occur in the spec; all it says is that tag names are case-insensitive only in HTML as opposed to XHTML and XML.

DOCTYPE's role in general XML

I know the purpose of DOCTYPE (and what each url/identifier on the line is) as far as web standards and page validation goes, but I am unsure about what it actually "is" in the context of an XML document.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>My Page</title>
</head>
<body>
<p>Hello</p>
</body>
</html>
Is it part of the actual XML document structure, or is it some kind of comment-like "hint" that is noted then stripped?
What is the significance of the "!" before the name? Does this denote a special type of "element"? What are they called?
The example I posted is XHTML for the web, but is DOCTYPE also used in general purpose XML documents?
DOCTYPE has been "inherited" from SGML (it was supposed to point to DTD file that explains how to parse the file), however self-explanatory XML syntax and namespaces made it largely irrelevant. The only real use for DOCTYPE/DTD in XML is to define allowed named entities (e.g. ).
XML spec even allows "non-validating" parsers that ignore DTD file completely (web browsers use such parsers, unless you've fallen into the text/html trap in which case XML parser is not used at all).
DTD is quite poor for purpose of validation (hard to specify rules for more than one level of nesting, no way to specify types of attributes beyond few predefined types). Schema, RelaxNG can be far more precise.
DTD doesn't fully suppport namespaces either, which leads to ridiculous workarounds like XHTMLplusMathMLplusSVG DOCTYPE.
In web browsers certain DOCTYPEs have desirable side-effect of triggering standards-compliant rendering mode. This is more of a hack than intended use DOCTYPEs.
If you're using real XHTML (application/xhtml+xml – the one that doesn't open in IE at all), then don't use DOCTYPE at all (that's recommendation from XHTML 5). XML mode will trigger standards-compliant rendering regardless of DOCTYPE.
If you're using text/html mode, then use <!DOCTYPE html>. That's HTML 5 DOCTYPE and it's a shortest one that triggers best possible rendering in all browsers. Browsers don't use DOCTYPE for any other purpose, so you're not missing out on anything.
If you're processing XHTML files with XML parsers (outside browsers), then please don't forget to set up DTD Catalog properly, otherwise your parser may be DoS-ing w3.org trying to fetch DTD every time. If you can't use DTD catalog, then disable "externals" in the parser or omit DOCTYPE and don't use named entities (i.e. use   rather than )
DOCTYPE is part of the XML specification (see the relevant subsection here) and can include either a link to a DTD, "internal" DTD declarations, or both. Many "modern" uses of XML don't use a DOCTYPE at all, though - as porneL mentions, both XML Schema and RelaxNG are more powerful ways to specify a document's syntax. See this Tim Bray blog post for a bit more background.