Task: I want to parse an XML document using DOMParser (https://developer.mozilla.org/en-US/docs/Web/API/DOMParser). I have no and need no formal DTD and parsing this as "text/xml" worked pretty well. Now I want to use certain symbolic entities, such as in my xml and the parser, of course, complains that they are not known. Since I want to be able to access, in principle, all existing html entities, I tried to use a doctype specification
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/html4/strict.dtd">
and this worked as expected, since DOMParser seems to have this doctype and the connected entity list preloaded. However, this doctype is outdated. So I tried the new <!DOCYTPE html> but this did not work. Also this is expected, as the novel html5 doctype tag works differently than the older xml/sgml based ones.
Question: Is there some standardized !DOCTYPE for html (5) which the browser recognizes and which contains the preloaded HTML entities. (I do not want to copy in a list of all entities as separate entity definitions, the browser has them somewhere, I just do not know how to activate them by an xml/sgml style DTD for html5)
If you want to continue using XML, but don't want to use the XHTML doctype, then you have to declare the character entities of XHTML via ENTITY declarations directly in your document (in the internal subset or an external declaration set) since only HTML has nbsp and many others as predefined entities (XML has only quot, amp, apos, lt, and gt). You can use the HTML5 entity set from https://www.w3.org/2003/entities/2007/htmlmathml-f.ent (which includes the large set of MathML entities), or the much smaller set of classic HTML4 entities.
But I would first check if DomParser actually processes markup declarations and/or external declaration sets with markup declarations. Try to parse the following
<?xml version="1.0"?>
<!DOCTYPE test [
<!ENTITY nbsp " ">
]>
<test>
</test>
and check the console for error messages.
There is no "official" DTD for HTML (in fact, no formal grammar at all), but there's my SGML DTD for W3C HTML 5.1 with much more information about parsing HTML5 than you probably are interested in, including info about HTML5's predefined entities.
Related
Note: this is supposed to be the canonical post for this question. A number of answers exist already, but descriptions of the various differences are scattered all over the place, and more often than not, they also offer opinions to "which one should I use", which I will refrain from in here.
If you have more questions to ask, or you know of more differences, feel free to edit.
What is the difference between XHTML and HTML? Isn't XHTML merely a more strict version of HTML? And why are there different versions of XHTML if they all act the same?
What is the difference between HTML and XHTML?
There are many differences. The main one is that XHTML is HTML in an XML document, and XML has different syntax rules:
XML has a different namespace by default, so you'll have to use the HTML namespace, xmlns="http://www.w3.org/1999/xhtml" explicitly in an XHTML document
XML is case sensitive and you'll have to use lowercase for tag names and attributes and even the x in hexadecimal character references
XML doesn't have optional start and end tags, so you'll have to write out all of them in full
Likewise, XML doesn't have void tags, so you'll have to close every void element yourself with a slash.
Non-void elements that have no content can be written as a single empty element tag in XML.
XML can contain CDATA sections, sections of plain text delimited with <![CDATA[ .. ]]>; HTML cannot
On the other hand, there are no CDATA or PCDATA elements or attributes in XML, so you'll have to escape your < signs everywhere (except in CDATA sections)
Quotes around attribute values are not optional in XML, and there is no attribute minimization (name-only attributes)
And the XML parser is not as forgiving of errors as the HTML parser.
Then there are a couple of not XML-related differences:
XHTML documents are always rendered in standards mode, never in quirks mode
XHTML does not look at meta commands in the head to determine the encoding. In fact, the W3C validator flags <meta http-equiv="content-type" ... as an error in XHTML5 files, but not in HTML5 files.
Earlier on, mismatches between the dtds for XHTML 1.0 strict and HTML 4.01 strict lead to validation issues. The definition for XTHML 1.0 was missing the name attribute on <img> and <form>. This was an error though, fixed in XHTML 1.1.
Note that XHTML documents should be served up with the correct file type, i.e. a .xhtml file extension or an application/xhtml+xml MIME type. You can't really have XHTML in an HTML document, because browsers don't differentiate between the two syntaxes by looking at the content, only by file type.
In other words, if you have an HTML file, its contents are HTML, no matter if it has valid XML in it or not.
One point about the syntax rules worth mentioning is the casing of tag names. Although HTML documents are case-insensitive, the tag names are actually exposed as uppercase by the DOM. That means that under HTML, a JavaScript command like console.log(document.body.tagName); would output "BODY", whereas the same command under XHTML would output "body".
Isn't XHTML merely a stricter version of HTML?
No; XML has different rules than HTML, but it's not necessarily stricter. If anything, XML has fewer rules!
In HTML, many features are optional. You can choose to put quotes around attribute values or not; in XML you don't have that choice. And in HTML, you have to remember when you have the choice and when you don't: are quotes optional in <a href=http://my-website.com/?login=true>? In XML, you don't have to think about that. XML is easier.
In HTML, some elements are defined as raw text elements, that is, elements that contain plain text rather than markup.
And some other elements are escapable raw text elements, in which references like é will be parsed, but things like <b>bold</b> and <!-- comment --> will be treated as plain text. If you can remember which elements those are, you don't have to escape < signs (you optionally can though). XML doesn't have that, so there's nothing to remember and all elements have the same content type.
XML has processor instructions, the most well known of which is the xml declaration in the prolog, <?xml version="1.0" encoding="windows-1252"?>. This tells the browser which version of XML is used (1.0 is the only version that works, by the way) and which character set.
And XML parses comments in a different way. For example, HTML comments can't start with <!--> (with a > as the first character inside); XHTML comments can.
Speaking of comments, with XHTML you can comment out blocks of code inside <script> and <style> elements using <!-- comment -->. Don't try that in HTML. (It's not recommended in XHTML either, because of compatibility issues, but you can.)
Why are there different versions of XHTML if they all act the same?
They don't! For instance, in XHTML 1.1 you can refer to character entities like é and , because those entities are defined in the DTD. The current version of XHTML (formerly known as XHTML5) does not have a DTD, so you will have to use numerical references, in this case é and (or, define those entities yourself in the DOCTYPE declaration. The X means eXtensible after all).
tl;dr: How do you include a file in the DOCTYPE declaration of an XHTML file? The solution from this answer doesn't work, giving different types of errors in the different browsers.
Longer version:
I am interested in defining character entities for XHTML files. One disadvantage of XHTML5 is that you can't use entity references such as and é, only the five basic XML ones.
Reverting to the XHTML 1 DOCTYPE would work (since the XHTML DTD contains definitions for all the entities) but then the validator complains about all the new XHTML5 features in the file.
Reverting the file type to text/html would also work, but then it wouldn't be XHTML any more! (Defining new entities doesn't work in HTML, only in XHTML.)
So the solution is to actively add all the entity names to the DOCTYPE declaration.
<!DOCTYPE html
[
<!ENTITY nbsp " ">
<!ENTITY eacute "é">
]
>
This works great! But including hundreds of entities this way would not be very efficient. Doing it by way of an include file would be much better.
I found this answer which seems to do exactly what I need...
<!ENTITY % xhtml-special SYSTEM "xhtml-special.ent">
%xhtml-special;
but it doesn't work. If I try that, I get errors: PEReference: %xhtml-special; not found in Chromium; XML Parsing Error: undefined entity in SeaMonkey.
So does anybody know how it should be done? With a working example?
When I look at the XHTML doctype, there's a .dtd file.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
What's the purpose of it?
Do browsers actually access it and use it to parse HTML?
If so, what happens when w3.org goes down?
The document type definition is basically there to tell the browser what version of HTML is being used. It dates back to SGML (Standard Generalised Markup Language). SGML was basically used to explain to the browser how to understand the structure of a document (such as an HTML page). Interestingly enough, XML is a restricted subset of SGML with many (exotic) features turned off.
Browsers do use it to parse the document but they don't need to nuke the W3C servers with a request every time a document is fetched. Instead they use a cached local copy.
When W3C.org goes down, they continue to use the cached copy. Unless you specify another URL of course...
One more thing to note with regard to the DOCTYPE declaration is that it is gone in HTML5, because HTML5 is no longer based on SGML. HTML5 uses <!DOCTYPE html>.
Browsers do not actually read that file from w3.org.
Instead, they have a list of known DTD URIs, and they know how to handle each one. (probably using a copy of the DTD file embedded in the browser)
I've been seeing some conflicting information that an XHTML document must also declare itself as XML.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
However, in other places I'm seeing (including w3.org) that the DOCTYPE must be the first tag declaration.
Since W3 says it, it must be true. However, I probably have some pages/apps lying about that are following the first method. What are my risks?
Edit: I just ran a page through the W3 Validator with and without the XML declaration and it passed both ways. At this point, then, I'm guessing it's just a "style" thing.
<?xml version="1.0" encoding="utf-8"?>
...is the default version and encoding for XML, so you don't need it at all. If you are serving XHTML as text/html, it probably shouldn't be there at all.
However, in other places I'm seeing (including w3.org) that the DOCTYPE must be the first tag.
Sounds like some confusion... DOCTYPE isn't a tag and neither is <?xml?> (which is called the XML declaration, and looks like a Processing Instruction, but it isn't one of those, either).
If you are including both, the XML declaration must come first. The trick is that IE6's DOCTYPE sniffer only detects Standards Mode DOCTYPEs if they're the first thing on the page, which means you can't use an XML declaration and you must stick with XML 1.0 and UTF-8 encoding (which is no great loss).
From the XHTML 1.1 specification:
An XML declaration like the one above
is not required in all XML documents.
XHTML document authors SHOULD use XML
declarations in all their documents.
XHTML document authors MUST use an XML
declaration when the character
encoding of the document is other than
the default UTF-8 or UTF-16 and no
encoding is specified by a
higher-level protocol.
http://www.w3.org/TR/xhtml11/conformance.html
http://validator.w3.org/ only accepts the <?xml> stuff before <!DOCTYPE>. The other way around (doctype before ?xml) won't get validated.
I've never included it (always gone with just the doctype), and w3c says my XHTML 1.0 Strict projects are "valid."
I know the purpose of DOCTYPE (and what each url/identifier on the line is) as far as web standards and page validation goes, but I am unsure about what it actually "is" in the context of an XML document.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>My Page</title>
</head>
<body>
<p>Hello</p>
</body>
</html>
Is it part of the actual XML document structure, or is it some kind of comment-like "hint" that is noted then stripped?
What is the significance of the "!" before the name? Does this denote a special type of "element"? What are they called?
The example I posted is XHTML for the web, but is DOCTYPE also used in general purpose XML documents?
DOCTYPE has been "inherited" from SGML (it was supposed to point to DTD file that explains how to parse the file), however self-explanatory XML syntax and namespaces made it largely irrelevant. The only real use for DOCTYPE/DTD in XML is to define allowed named entities (e.g. ).
XML spec even allows "non-validating" parsers that ignore DTD file completely (web browsers use such parsers, unless you've fallen into the text/html trap in which case XML parser is not used at all).
DTD is quite poor for purpose of validation (hard to specify rules for more than one level of nesting, no way to specify types of attributes beyond few predefined types). Schema, RelaxNG can be far more precise.
DTD doesn't fully suppport namespaces either, which leads to ridiculous workarounds like XHTMLplusMathMLplusSVG DOCTYPE.
In web browsers certain DOCTYPEs have desirable side-effect of triggering standards-compliant rendering mode. This is more of a hack than intended use DOCTYPEs.
If you're using real XHTML (application/xhtml+xml – the one that doesn't open in IE at all), then don't use DOCTYPE at all (that's recommendation from XHTML 5). XML mode will trigger standards-compliant rendering regardless of DOCTYPE.
If you're using text/html mode, then use <!DOCTYPE html>. That's HTML 5 DOCTYPE and it's a shortest one that triggers best possible rendering in all browsers. Browsers don't use DOCTYPE for any other purpose, so you're not missing out on anything.
If you're processing XHTML files with XML parsers (outside browsers), then please don't forget to set up DTD Catalog properly, otherwise your parser may be DoS-ing w3.org trying to fetch DTD every time. If you can't use DTD catalog, then disable "externals" in the parser or omit DOCTYPE and don't use named entities (i.e. use rather than )
DOCTYPE is part of the XML specification (see the relevant subsection here) and can include either a link to a DTD, "internal" DTD declarations, or both. Many "modern" uses of XML don't use a DOCTYPE at all, though - as porneL mentions, both XML Schema and RelaxNG are more powerful ways to specify a document's syntax. See this Tim Bray blog post for a bit more background.