HTML and XML are syntactically very similar, so what I want to know is if valid HTML code will always conform to the XML specification.
No, it won't.
HTML 2 through 4.x were SGML applications, not XML applications. (HTML+ might also have been an SGML application, it isn't clear from a brief skim of the specification)
HTML 5 has its own parse rules.
(XHTML and the XML serialisation of HTML 5 will be XML though)
Does HTML conform to the XML specification?
No, it does not. HTML supports:
unclosed tags (e.g. <img> instead of <img />)
wrongly nested tags (e.g. <b><i>bla</b></i> instead of <b><i>bla</i></b>)
unquoted attributes (e.g. <a name=foo>...</a>)
contents that is not propery encoded (e.g. <em>this & that</em> instead of <em>this & that</em>)
tags that explicitly must contain unencoded content (i.e. <script>)
named entities (e.g. © instead of ©)
The standard does not explicitly allow all of these notions, but all HTML parsers understand and support them.
None of them is legal in XML.
HTML is more lenient. For example,
<!DOCTYPE html>
<title>foo</title>
bar
is a valid HTML5 document, but it's obviously not valid XML, since XML requires a top-level element that encompasses the whole document.
However, you can use one of the XHTML languages, which are applications of XML with the same semantics as the corresponding HTML standards.
Related
Note: this is supposed to be the canonical post for this question. A number of answers exist already, but descriptions of the various differences are scattered all over the place, and more often than not, they also offer opinions to "which one should I use", which I will refrain from in here.
If you have more questions to ask, or you know of more differences, feel free to edit.
What is the difference between XHTML and HTML? Isn't XHTML merely a more strict version of HTML? And why are there different versions of XHTML if they all act the same?
What is the difference between HTML and XHTML?
There are many differences. The main one is that XHTML is HTML in an XML document, and XML has different syntax rules:
XML has a different namespace by default, so you'll have to use the HTML namespace, xmlns="http://www.w3.org/1999/xhtml" explicitly in an XHTML document
XML is case sensitive and you'll have to use lowercase for tag names and attributes and even the x in hexadecimal character references
XML doesn't have optional start and end tags, so you'll have to write out all of them in full
Likewise, XML doesn't have void tags, so you'll have to close every void element yourself with a slash.
Non-void elements that have no content can be written as a single empty element tag in XML.
XML can contain CDATA sections, sections of plain text delimited with <![CDATA[ .. ]]>; HTML cannot
On the other hand, there are no CDATA or PCDATA elements or attributes in XML, so you'll have to escape your < signs everywhere (except in CDATA sections)
Quotes around attribute values are not optional in XML, and there is no attribute minimization (name-only attributes)
And the XML parser is not as forgiving of errors as the HTML parser.
Then there are a couple of not XML-related differences:
XHTML documents are always rendered in standards mode, never in quirks mode
XHTML does not look at meta commands in the head to determine the encoding. In fact, the W3C validator flags <meta http-equiv="content-type" ... as an error in XHTML5 files, but not in HTML5 files.
Earlier on, mismatches between the dtds for XHTML 1.0 strict and HTML 4.01 strict lead to validation issues. The definition for XTHML 1.0 was missing the name attribute on <img> and <form>. This was an error though, fixed in XHTML 1.1.
Note that XHTML documents should be served up with the correct file type, i.e. a .xhtml file extension or an application/xhtml+xml MIME type. You can't really have XHTML in an HTML document, because browsers don't differentiate between the two syntaxes by looking at the content, only by file type.
In other words, if you have an HTML file, its contents are HTML, no matter if it has valid XML in it or not.
One point about the syntax rules worth mentioning is the casing of tag names. Although HTML documents are case-insensitive, the tag names are actually exposed as uppercase by the DOM. That means that under HTML, a JavaScript command like console.log(document.body.tagName); would output "BODY", whereas the same command under XHTML would output "body".
Isn't XHTML merely a stricter version of HTML?
No; XML has different rules than HTML, but it's not necessarily stricter. If anything, XML has fewer rules!
In HTML, many features are optional. You can choose to put quotes around attribute values or not; in XML you don't have that choice. And in HTML, you have to remember when you have the choice and when you don't: are quotes optional in <a href=http://my-website.com/?login=true>? In XML, you don't have to think about that. XML is easier.
In HTML, some elements are defined as raw text elements, that is, elements that contain plain text rather than markup.
And some other elements are escapable raw text elements, in which references like é will be parsed, but things like <b>bold</b> and <!-- comment --> will be treated as plain text. If you can remember which elements those are, you don't have to escape < signs (you optionally can though). XML doesn't have that, so there's nothing to remember and all elements have the same content type.
XML has processor instructions, the most well known of which is the xml declaration in the prolog, <?xml version="1.0" encoding="windows-1252"?>. This tells the browser which version of XML is used (1.0 is the only version that works, by the way) and which character set.
And XML parses comments in a different way. For example, HTML comments can't start with <!--> (with a > as the first character inside); XHTML comments can.
Speaking of comments, with XHTML you can comment out blocks of code inside <script> and <style> elements using <!-- comment -->. Don't try that in HTML. (It's not recommended in XHTML either, because of compatibility issues, but you can.)
Why are there different versions of XHTML if they all act the same?
They don't! For instance, in XHTML 1.1 you can refer to character entities like é and , because those entities are defined in the DTD. The current version of XHTML (formerly known as XHTML5) does not have a DTD, so you will have to use numerical references, in this case é and (or, define those entities yourself in the DOCTYPE declaration. The X means eXtensible after all).
Can I parse an HTML file using an XML parser?
Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.
The intended use is to make an HTML parser, that is part of a web crawler application
You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.
elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
elements that don’t need end tags; e.g., <p> <dt> <li> (their end tags can be implied)
elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
attributes with unquoted values; for example, <meta charset=utf-8>
attributes that are empty, with no separate value given at all; e.g., <input disabled>
XML parsers will fail to parse any HTML document that uses any of those features.
HTML parsers, on the other hand, will basically never fail no matter what a document contains.
All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.
The intended use is to make an HTML parser, that is part of a web
crawler application
If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.
These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:
parse5 (node.js/JavaScript)
html5lib (python)
html5ever (rust)
validator.nu html5 parser (java)
gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)
syntactically they are almost identical
Computers are picky. "Almost identical" isn't good enough. HTML allows things that XML doesn't, therefore an XML parser will reject (many, though not all) HTML documents.
In addition, there's a different quality culture. With HTML the culture for a parser is "try to do something with the input if you possibly can". With XML the culture is "if it's faulty, send it back for repair or replacement".
I'm trying to convert some XML to HTML. The XML contains only a few known elements that map to HTML tags. Do I need to html encode text nodes?
Is valid XML also valid HTML assuming we are only using HTML tags?
Is valid XML also valid HTML assuming we are only using HTML tags?
No. Here's a simple example.
<div>
<span/>
</div>
This is well-formed and valid XML. It is not valid HTML (except when processed as XHTML) in any version of HTML.
That's not to say that a HTML parser won't process it, but that's not a good test. An HTML parser will process any byte sequence, valid or not.
Is valid* XML also valid HTML assuming we are only using HTML tags?
*Note that "valid" is not the same as "well-formed". Validity is a property that requires well-formedness and succesful comparison against a DTD or schema. Well-formedness only means syntactical correctness, which is what you mean here.
Yes. HTML uses a few conventions that are not present in XML (prominently unclosed tags, unencoded tag bodies like <script>, namespaces are unsupported, incorrect tag nesting is glossed over) but all things considered well-formed, vanilla (!) XML that only uses HTML tag names will be understood by an HTML parser.
Vanilla means in this case: No custom DTDs, no custom named character entities.
Do I need to html encode text nodes?
No. All characters valid in a certain encoding (say, UTF-8) will be acceptable in both XML and HTML, as long as the encoding is correctly declared. Character escaping schemes are compatible, so e.g. (or &xA0;) will represent a non-breaking space in both XML and HTML. Writing that non-breaking space verbatim (i.e as single byte xA0) into the text will work as well. Named character entities besides <, >, &, " and ' are unsupported in XML, whereas all numbered character entities XML could use will work in HTML. That means you will not encounter a problem there.
XML that does not declare an encoding will default to UTF-8. You should not have a problem with leaving all text nodes and attribute values as they are as long as you use the same encoding for your HTML.
http://www.w3schools.com/tags/tag_doctype.asp
HTML5 is not based on SGML, and therefore does not require a reference to a DTD.
On what standard is HTML 5 based on if not on SGML?
The HTML5 standard specifies two serializations of HTML5: "html" and "xml". "xml" is a valid XML serialization (which in turn is a subset of SGML). "html" is not based on any specific serialization standard anymore, it has its own complete serialization. Herein lies the difference: HTML4 has a "sgml" serialization and "xml" serialization (called XHTML 1.0)
Of course HTML5 is for a large part based on HTML4 (based on SGML) and XHTML (based on HTML4 and XML).
Also see the history section of the HTML5 specification
What is the HTML 5 standard based on?
It is based on what browsers actually do.
In 2002-2005 Ian Hickson went through every browser, and found every parsing edge case for the DOM tree they create when presented with some HTML.
For Example
For example, what should the DOM tree of this (invalid) HTML be:
<!DOCTYPE html><em><p>XY</p></em>
Browsers seemed to agree on the tree:
DOCTYPE: html
HTML
HEAD
BODY
EM
P
#text: XY
Even though it is invalid html, browsers were happy to parse it into what you meant. The last thing your browser should do refuse to display what is perfectly understandable HTML.
Now what about this invalid html:
<!DOCTYPE html><em><p>X</em>Y</p>
IE: Y is a child of both p and body. This violates the DOM spec (a note is supposed to have only one parent), but is what the author of the HTML wanted.
Opera: Makes a valid DOM tree, but X isn't emphasised - violating CSS spec.
Mozilla and Safari: make it a valid DOM tree, but Y isn't emphasised (which is what the author wanted)
DOCTYPE: html
HTML
HEAD
BODY
EM
P
EM
#text: X
#text: Y
Which means that different browsers had different ideas on how to handle HTML (hence the need for an HTML standard).
A parser can't say:
Well, HTML is supposed to be a subset of SGML. And if your HTML isn't well-formed, then the results are undefined.
Not good enough
The web needs a standard to reflect how browsers should parse HTML. The W3C wasn't doing it. They hated HTML, and wanted everyone to move their beautiful SGML version of HTML, an xml-ified version of HTML: xhtml.
The HTML 5 standard is meant to be used in the real world. There needs to be a definition on how to handle not well-formed HTML, and define how browsers should handle it. It was based on a survey of all existing implementations, and choosing what either a consensus is, or what a consensus should be.
Which brings us to HTML5
From the HTML5 spec, and they lay it out quite plainly:
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
In other words (and they also say this):
An HTML5 parser is any parser that follows the parsing rules of HTML5
HTML5 has no grammer. There is no regex, lexer, BNF, EBNF you can use to parse HTML.
In order to correctly parse HTML to the HTML5 standard, you must implement the (very meticulously detailed) algorithm described in the HTML5 standard.
And if your parser doesn't handle invalid HTML: then that's the fault of your parser.
For example, if we have
<html>
<head>
<title>FooBar</title>
</head>
<body></body>
</html>
If we do document.getElementByTagName("title").TagName, then we will have TITLE (uppercase). While the html standards recommends writing html tags in lowercase.
I know there is no relationship between both, but this still doesn't make sense.
Is there any reason that DOM should return tag names in uppercase?
Technically, this is mandated in DOM Level 1:
The HTML DOM returns the tagName of an HTML element in the canonical uppercase form, regardless of the case in the source HTML document.
The convention of uppercase tag names probably stems from legacy, when HTML was previously developed based on SGML, and element types were declared in uppercase. See this section of the HTML 4.01 spec discussing SGML, HTML and its syntax, as well as for example the HTML 4.01 Strict doctype definition. Any DOM implementations supporting HTML would follow suit.
Note that lowercase tag names are only explicitly required in XHTML (but not XML), and authors are generally recommended to write lowercase tags for easy porting between HTML/XHTML, as well as improving readability. However, this recommendation doesn't occur in the spec; all it says is that tag names are case-insensitive only in HTML as opposed to XHTML and XML.