DOCTYPE's role in general XML - html

I know the purpose of DOCTYPE (and what each url/identifier on the line is) as far as web standards and page validation goes, but I am unsure about what it actually "is" in the context of an XML document.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>My Page</title>
</head>
<body>
<p>Hello</p>
</body>
</html>
Is it part of the actual XML document structure, or is it some kind of comment-like "hint" that is noted then stripped?
What is the significance of the "!" before the name? Does this denote a special type of "element"? What are they called?
The example I posted is XHTML for the web, but is DOCTYPE also used in general purpose XML documents?

DOCTYPE has been "inherited" from SGML (it was supposed to point to DTD file that explains how to parse the file), however self-explanatory XML syntax and namespaces made it largely irrelevant. The only real use for DOCTYPE/DTD in XML is to define allowed named entities (e.g. ).
XML spec even allows "non-validating" parsers that ignore DTD file completely (web browsers use such parsers, unless you've fallen into the text/html trap in which case XML parser is not used at all).
DTD is quite poor for purpose of validation (hard to specify rules for more than one level of nesting, no way to specify types of attributes beyond few predefined types). Schema, RelaxNG can be far more precise.
DTD doesn't fully suppport namespaces either, which leads to ridiculous workarounds like XHTMLplusMathMLplusSVG DOCTYPE.
In web browsers certain DOCTYPEs have desirable side-effect of triggering standards-compliant rendering mode. This is more of a hack than intended use DOCTYPEs.
If you're using real XHTML (application/xhtml+xml – the one that doesn't open in IE at all), then don't use DOCTYPE at all (that's recommendation from XHTML 5). XML mode will trigger standards-compliant rendering regardless of DOCTYPE.
If you're using text/html mode, then use <!DOCTYPE html>. That's HTML 5 DOCTYPE and it's a shortest one that triggers best possible rendering in all browsers. Browsers don't use DOCTYPE for any other purpose, so you're not missing out on anything.
If you're processing XHTML files with XML parsers (outside browsers), then please don't forget to set up DTD Catalog properly, otherwise your parser may be DoS-ing w3.org trying to fetch DTD every time. If you can't use DTD catalog, then disable "externals" in the parser or omit DOCTYPE and don't use named entities (i.e. use   rather than )

DOCTYPE is part of the XML specification (see the relevant subsection here) and can include either a link to a DTD, "internal" DTD declarations, or both. Many "modern" uses of XML don't use a DOCTYPE at all, though - as porneL mentions, both XML Schema and RelaxNG are more powerful ways to specify a document's syntax. See this Tim Bray blog post for a bit more background.

Related

What is the difference between HTML and XHTML?

Note: this is supposed to be the canonical post for this question. A number of answers exist already, but descriptions of the various differences are scattered all over the place, and more often than not, they also offer opinions to "which one should I use", which I will refrain from in here.
If you have more questions to ask, or you know of more differences, feel free to edit.
What is the difference between XHTML and HTML? Isn't XHTML merely a more strict version of HTML? And why are there different versions of XHTML if they all act the same?
What is the difference between HTML and XHTML?
There are many differences. The main one is that XHTML is HTML in an XML document, and XML has different syntax rules:
XML has a different namespace by default, so you'll have to use the HTML namespace, xmlns="http://www.w3.org/1999/xhtml" explicitly in an XHTML document
XML is case sensitive and you'll have to use lowercase for tag names and attributes and even the x in hexadecimal character references
XML doesn't have optional start and end tags, so you'll have to write out all of them in full
Likewise, XML doesn't have void tags, so you'll have to close every void element yourself with a slash.
Non-void elements that have no content can be written as a single empty element tag in XML.
XML can contain CDATA sections, sections of plain text delimited with <![CDATA[ .. ]]>; HTML cannot
On the other hand, there are no CDATA or PCDATA elements or attributes in XML, so you'll have to escape your < signs everywhere (except in CDATA sections)
Quotes around attribute values are not optional in XML, and there is no attribute minimization (name-only attributes)
And the XML parser is not as forgiving of errors as the HTML parser.
Then there are a couple of not XML-related differences:
XHTML documents are always rendered in standards mode, never in quirks mode
XHTML does not look at meta commands in the head to determine the encoding. In fact, the W3C validator flags <meta http-equiv="content-type" ... as an error in XHTML5 files, but not in HTML5 files.
Earlier on, mismatches between the dtds for XHTML 1.0 strict and HTML 4.01 strict lead to validation issues. The definition for XTHML 1.0 was missing the name attribute on <img> and <form>. This was an error though, fixed in XHTML 1.1.
Note that XHTML documents should be served up with the correct file type, i.e. a .xhtml file extension or an application/xhtml+xml MIME type. You can't really have XHTML in an HTML document, because browsers don't differentiate between the two syntaxes by looking at the content, only by file type.
In other words, if you have an HTML file, its contents are HTML, no matter if it has valid XML in it or not.
One point about the syntax rules worth mentioning is the casing of tag names. Although HTML documents are case-insensitive, the tag names are actually exposed as uppercase by the DOM. That means that under HTML, a JavaScript command like console.log(document.body.tagName); would output "BODY", whereas the same command under XHTML would output "body".
Isn't XHTML merely a stricter version of HTML?
No; XML has different rules than HTML, but it's not necessarily stricter. If anything, XML has fewer rules!
In HTML, many features are optional. You can choose to put quotes around attribute values or not; in XML you don't have that choice. And in HTML, you have to remember when you have the choice and when you don't: are quotes optional in <a href=http://my-website.com/?login=true>? In XML, you don't have to think about that. XML is easier.
In HTML, some elements are defined as raw text elements, that is, elements that contain plain text rather than markup.
And some other elements are escapable raw text elements, in which references like é will be parsed, but things like <b>bold</b> and <!-- comment --> will be treated as plain text. If you can remember which elements those are, you don't have to escape < signs (you optionally can though). XML doesn't have that, so there's nothing to remember and all elements have the same content type.
XML has processor instructions, the most well known of which is the xml declaration in the prolog, <?xml version="1.0" encoding="windows-1252"?>. This tells the browser which version of XML is used (1.0 is the only version that works, by the way) and which character set.
And XML parses comments in a different way. For example, HTML comments can't start with <!--> (with a > as the first character inside); XHTML comments can.
Speaking of comments, with XHTML you can comment out blocks of code inside <script> and <style> elements using <!-- comment -->. Don't try that in HTML. (It's not recommended in XHTML either, because of compatibility issues, but you can.)
Why are there different versions of XHTML if they all act the same?
They don't! For instance, in XHTML 1.1 you can refer to character entities like é and , because those entities are defined in the DTD. The current version of XHTML (formerly known as XHTML5) does not have a DTD, so you will have to use numerical references, in this case é and   (or, define those entities yourself in the DOCTYPE declaration. The X means eXtensible after all).

HTML5 Doctype for Domparser

Task: I want to parse an XML document using DOMParser (https://developer.mozilla.org/en-US/docs/Web/API/DOMParser). I have no and need no formal DTD and parsing this as "text/xml" worked pretty well. Now I want to use certain symbolic entities, such as in my xml and the parser, of course, complains that they are not known. Since I want to be able to access, in principle, all existing html entities, I tried to use a doctype specification
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/html4/strict.dtd">
and this worked as expected, since DOMParser seems to have this doctype and the connected entity list preloaded. However, this doctype is outdated. So I tried the new <!DOCYTPE html> but this did not work. Also this is expected, as the novel html5 doctype tag works differently than the older xml/sgml based ones.
Question: Is there some standardized !DOCTYPE for html (5) which the browser recognizes and which contains the preloaded HTML entities. (I do not want to copy in a list of all entities as separate entity definitions, the browser has them somewhere, I just do not know how to activate them by an xml/sgml style DTD for html5)
If you want to continue using XML, but don't want to use the XHTML doctype, then you have to declare the character entities of XHTML via ENTITY declarations directly in your document (in the internal subset or an external declaration set) since only HTML has nbsp and many others as predefined entities (XML has only quot, amp, apos, lt, and gt). You can use the HTML5 entity set from https://www.w3.org/2003/entities/2007/htmlmathml-f.ent (which includes the large set of MathML entities), or the much smaller set of classic HTML4 entities.
But I would first check if DomParser actually processes markup declarations and/or external declaration sets with markup declarations. Try to parse the following
<?xml version="1.0"?>
<!DOCTYPE test [
<!ENTITY nbsp " ">
]>
<test>
</test>
and check the console for error messages.
There is no "official" DTD for HTML (in fact, no formal grammar at all), but there's my SGML DTD for W3C HTML 5.1 with much more information about parsing HTML5 than you probably are interested in, including info about HTML5's predefined entities.

What's the purpose of HTMl doctype's .dtd?

When I look at the XHTML doctype, there's a .dtd file.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
What's the purpose of it?
Do browsers actually access it and use it to parse HTML?
If so, what happens when w3.org goes down?
The document type definition is basically there to tell the browser what version of HTML is being used. It dates back to SGML (Standard Generalised Markup Language). SGML was basically used to explain to the browser how to understand the structure of a document (such as an HTML page). Interestingly enough, XML is a restricted subset of SGML with many (exotic) features turned off.
Browsers do use it to parse the document but they don't need to nuke the W3C servers with a request every time a document is fetched. Instead they use a cached local copy.
When W3C.org goes down, they continue to use the cached copy. Unless you specify another URL of course...
One more thing to note with regard to the DOCTYPE declaration is that it is gone in HTML5, because HTML5 is no longer based on SGML. HTML5 uses <!DOCTYPE html>.
Browsers do not actually read that file from w3.org.
Instead, they have a list of known DTD URIs, and they know how to handle each one. (probably using a copy of the DTD file embedded in the browser)

Uppercase or lowercase doctype?

When writing the HTML5 doctype what is the correct method?
<!DOCTYPE html>
or
<!doctype html>
In HTML, the DOCTYPE is case insensitive. The following DOCTYPEs are all valid:
<!doctype html>
<!DOCTYPE html>
<!DOCTYPE HTML>
<!DoCtYpE hTmL>
In XML serializations (i.e. XHTML) the DOCTYPE is not required, but if you use it, DOCTYPE should be uppercase:
<!DOCTYPE html>
See The XML serialization of HTML5, aka ‘XHTML5’:
Note that if you don’t uppercase DOCTYPE in an XHTML document, the XML parser will return a syntax error.
The second part can be written in lowercase (html), uppercase (HTML) or even mixed case (hTmL) — it will still work. However, to conform to the Polyglot Markup Guidelines for HTML-Compatible XHTML Documents, it should be written in lowercase.
If anyone is still wondering in 2014, please consult this:
HTML5
W3 HTML5 Spec - Doctype
A DOCTYPE must consist of the following components, in this order:
A string that is an ASCII case-insensitive match for the string "<!DOCTYPE".
...
Note: Despite being displayed in all caps, the spec states it is insensitive.
XHTML5
W3 HTML5 - XHTML
This specification does not define any syntax-level requirements beyond those defined for XML proper.
XML documents may contain a DOCTYPE if desired, but this is not required to conform to this specification. This specification does not define a public or system identifier, nor provide a formal DTD.
Looking at the XML spec, it lists DOCTYPE in caps, but I can't find anything that states that 'all caps' is required (for comparison, in the HTML5 spec listed above, it is displayed in the example in all caps, but the spec explicitly states that is is case-insensitive).
Polyglot Markup
W3 Polyglot Markup - Intro
It is sometimes valuable to be able to serve HTML5 documents that are also well formed XML documents.
W3 Polyglot Markup - Doctype
Polyglot markup uses a document type declaration (DOCTYPE) specified by section 8.1.1 of [HTML5]. In addition, the DOCTYPE conforms to the following rules:
The string DOCTYPE is in uppercase letters.
...
So, note that Polyglot Markup uses a regular HTML5 doctype, but with additions/changes. For our discussion, most notably that DOCTYPE is declared in all caps.
Summary
View the W3's HTML vs. XHTML section.
[Opinion] I wouldn't worry too much about satisfying XML compliance unless you are specifically trying to make considerations for it. For most client and JS-based server development, JSON has replaced XML.
Therefore, I can only see this really applying if you are trying to update an existing, XHTML/XML-based legacy system to co-exist with new, HTML5 functionality. If this is the case then look into the polyglot markup spec.
According to the latest spec, you should use something that is a case-insensitive match for <!DOCTYPE html>. So while browsers are required to support whatever case you prefer, it's reasonable to infer from this that <!DOCTYPE html> is the canonical case.
Either upper or lower case is "correct". However if you use web fonts and care about IE7, I'd recommend using <!DOCTYPE html> because of a bug in IE7 where web fonts sometimes fail if using <!doctype html> (e.g. in this answer).
This is why I always upper-case the doctype.
The standard for HTML5 is that tags are case insensitive.
http://www.w3schools.com/html5/tag_doctype.asp
More Technically: (http://www.w3.org/TR/html5/syntax.html)
A DOCTYPE must consist of the following components, in this order:
A string that is an ASCII case-insensitive match for the string <!DOCTYPE.
The question sort of implies there's only one correct answer, supplies a multiple choice of two, and asks us to pick one. I would suggest that for HTML5 both <!DOCTYPE html> and <!doctype html> are valid.
So a HTML5-capable browser would accept the lowercase one and process the html properly.
Browsers previous and oblivious to HTML5, I've heard, even without a doctype, will attempt to process the html as best they can. And if they don't recognize the lowercase doctype will do the same. So there's no point in making it uppercase since those browsers won't be able to fully implement any HTML5 declarations anyway.
The doctype declaration is case insensitive, and any string of ASCII that matches
Html5 standard

Do I need to declare XML on a page using the XHTML doctype?

I've been seeing some conflicting information that an XHTML document must also declare itself as XML.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
However, in other places I'm seeing (including w3.org) that the DOCTYPE must be the first tag declaration.
Since W3 says it, it must be true. However, I probably have some pages/apps lying about that are following the first method. What are my risks?
Edit: I just ran a page through the W3 Validator with and without the XML declaration and it passed both ways. At this point, then, I'm guessing it's just a "style" thing.
<?xml version="1.0" encoding="utf-8"?>
...is the default version and encoding for XML, so you don't need it at all. If you are serving XHTML as text/html, it probably shouldn't be there at all.
However, in other places I'm seeing (including w3.org) that the DOCTYPE must be the first tag.
Sounds like some confusion... DOCTYPE isn't a tag and neither is <?xml?> (which is called the XML declaration, and looks like a Processing Instruction, but it isn't one of those, either).
If you are including both, the XML declaration must come first. The trick is that IE6's DOCTYPE sniffer only detects Standards Mode DOCTYPEs if they're the first thing on the page, which means you can't use an XML declaration and you must stick with XML 1.0 and UTF-8 encoding (which is no great loss).
From the XHTML 1.1 specification:
An XML declaration like the one above
is not required in all XML documents.
XHTML document authors SHOULD use XML
declarations in all their documents.
XHTML document authors MUST use an XML
declaration when the character
encoding of the document is other than
the default UTF-8 or UTF-16 and no
encoding is specified by a
higher-level protocol.
http://www.w3.org/TR/xhtml11/conformance.html
http://validator.w3.org/ only accepts the <?xml> stuff before <!DOCTYPE>. The other way around (doctype before ?xml) won't get validated.
I've never included it (always gone with just the doctype), and w3c says my XHTML 1.0 Strict projects are "valid."