Declaring Doctype HTML5 [duplicate] - html

The "old" HTML/XHTML standards have a DTD (Document Type Definition) defined for them:
HTML 4.01 http://www.w3.org/TR/html401/sgml/dtd.html
XHTML 1.0 http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_XHTML-1.0-Strict
This DTDs specify the rules for nesting elements - "which types of elements may appear in which types of elements". I made a diagram for XHTML 1.0 here (sorry, I no longer have that resource)
I would like to update that diagram with a new version which also includes the new HTML5 elements. However, there doesn't seem to be a HTML5 DTD. It seems that the nesting rules are defined by the various content models that are defined in HTML5.
So there is no DTD, correct?
Follow-up question: Is there a reason why there is no DTD in HTML5? The DTD is such a nice method of defining the nesting rules for all the different types of elements. Why wouldn't they include such a thing?
Update: I found this: http://www.w3.org/TR/html5/dom.html#kinds-of-content I guess, this is the closest to having a DTD.
Update: The Visual Studio Team made a XML Schema for XHTML5. I guess that answers my question: Link

There is no HTML5 DTD. The HTML5 RC explicitly says this when discussing XHTML serialization, and this clearly applies to HTML serialization as well.
DTDs have been regarded by the designers of HTML5 as too limited in expressive power, and HTML5 validators (basically the HTML5 mode of http://validator.nu and its copy at http://validator.w3.org/nu/) use schemas and ad hoc checks, not DTD-based validation.
Moreover, HTML5 has been designed so that writing a DTD for it is impossible. For example, there is no SGML way to capture the HTML5 rule that any attribute name that starts with “data-” and complies with certain general rules is valid. In SGML, attributes need to be listed individually, so a DTD would need to be infinite.
It is possible to design DTDs that correspond to HTML5 with some omissions and perhaps with some extra rules imposed, but they won’t really be HTML5 DTDs. My experiment with the idea is not very encouraging: too many limitations, too tricky, and the DTD would need to be so permissive that many syntax errors would go uncaught.

Correct. There is no DTD. However, HTML5 documents should start with <!DOCTYPE html>
So there's a DOCTYPE, but no DTD.
See:
http://dev.w3.org/html5/spec/syntax.html#the-doctype
http://en.wikipedia.org/wiki/Document_Type_Declaration#HTML5_DTD-less_DOCTYPE

I have created an HTML5 DTD for use in my PHP XML projects. It ain't beautiful, but it works with well-formed XHTML5 (that is, HTML5 expressed as XML).
You can grab it from my bitbucket account here:
https://bitbucket.org/kashbridge/dtd/overview
Enjoy!

Certain Marcus from sgmljs.net created and analyzed an SGML DTD for HTML 5.1 and started a thread in the XML-DEV mailing list for review and discussion. The discussion revolves around entity definitions so far.
I've just completed my analysis of W3C's HTML 5.1 recommendation at
http://sgmljs.net/docs/html5.html (from a markup language rather than
web development PoV), and I'm publishing it here for review in the
form of an initial SGML DTD for parsing HTML 5.1, along with a lengthy
analysis text.
[…]
I'm aware that WHATWG and W3C have since long moved away from SGML
(and XML in most web-related specification work), treating it as a
legacy technique and with a somewhat presumptuous attitude in the
specification text and elsewhere. But as the analysis of HTML5's
grammar shows, they've essentially abandoned use of any formal methods
alltogether (and it shows in at least two flaws discussed in the
analysis).
Nothing official yet, but maybe this initiative will get traction, or at least find its users as an unofficial resource.

I think they did away with the old DTDs, now we just start HTML pages with: <!DOCTYPE HTML>
Maybe the W3C will come out with one eventually.

Related

HTML5 is not based on SGML, so what is it based on then?

http://www.w3schools.com/tags/tag_doctype.asp
HTML5 is not based on SGML, and therefore does not require a reference to a DTD.
On what standard is HTML 5 based on if not on SGML?
The HTML5 standard specifies two serializations of HTML5: "html" and "xml". "xml" is a valid XML serialization (which in turn is a subset of SGML). "html" is not based on any specific serialization standard anymore, it has its own complete serialization. Herein lies the difference: HTML4 has a "sgml" serialization and "xml" serialization (called XHTML 1.0)
Of course HTML5 is for a large part based on HTML4 (based on SGML) and XHTML (based on HTML4 and XML).
Also see the history section of the HTML5 specification
What is the HTML 5 standard based on?
It is based on what browsers actually do.
In 2002-2005 Ian Hickson went through every browser, and found every parsing edge case for the DOM tree they create when presented with some HTML.
For Example
For example, what should the DOM tree of this (invalid) HTML be:
<!DOCTYPE html><em><p>XY</p></em>
Browsers seemed to agree on the tree:
DOCTYPE: html
HTML
HEAD
BODY
EM
P
#text: XY
Even though it is invalid html, browsers were happy to parse it into what you meant. The last thing your browser should do refuse to display what is perfectly understandable HTML.
Now what about this invalid html:
<!DOCTYPE html><em><p>X</em>Y</p>
IE: Y is a child of both p and body. This violates the DOM spec (a note is supposed to have only one parent), but is what the author of the HTML wanted.
Opera: Makes a valid DOM tree, but X isn't emphasised - violating CSS spec.
Mozilla and Safari: make it a valid DOM tree, but Y isn't emphasised (which is what the author wanted)
DOCTYPE: html
HTML
HEAD
BODY
EM
P
EM
#text: X
#text: Y
Which means that different browsers had different ideas on how to handle HTML (hence the need for an HTML standard).
A parser can't say:
Well, HTML is supposed to be a subset of SGML. And if your HTML isn't well-formed, then the results are undefined.
Not good enough
The web needs a standard to reflect how browsers should parse HTML. The W3C wasn't doing it. They hated HTML, and wanted everyone to move their beautiful SGML version of HTML, an xml-ified version of HTML: xhtml.
The HTML 5 standard is meant to be used in the real world. There needs to be a definition on how to handle not well-formed HTML, and define how browsers should handle it. It was based on a survey of all existing implementations, and choosing what either a consensus is, or what a consensus should be.
Which brings us to HTML5
From the HTML5 spec, and they lay it out quite plainly:
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
In other words (and they also say this):
An HTML5 parser is any parser that follows the parsing rules of HTML5
HTML5 has no grammer. There is no regex, lexer, BNF, EBNF you can use to parse HTML.
In order to correctly parse HTML to the HTML5 standard, you must implement the (very meticulously detailed) algorithm described in the HTML5 standard.
And if your parser doesn't handle invalid HTML: then that's the fault of your parser.

XHTML 5 standard check (HTML + CSS)

My web page must be strictly developed using XHTML 5 standard. How can I check it?
W3C validator:
http://validator.w3.org/
You can use the W3C validator here: http://validator.w3.org/
on a sidenote: there is no such thing as a XHTML5 standard. The last XHTML version was 2 and the development there has ended. The current standard is just called HTML5.
The W3C has an experimental HTML 5 validation engine here you might want to check out. But since the standard is still in development, I don't think you'll find any definitive validation engines just yet.
Please also keep in mind that there will be no official XHTML5 standard. HTML 5 will support two formats, one which uses strict XML syntax, and another which uses regular HTML syntax, which is somewhat looser in that it doesn't mandate closing tags or capitalization rules.
To get what you're looking for you might want to try two different validations. One to check that your document is a fully compliant XML document and also another to run it through the HTML 5 validation engine to check for non conforming tags, etc.
By comparing your code against the rules for XHTML serialization in the HTML5 specification, as soon as it becomes official (allow ten years for delivery). Meanwhile, the so-called HTML5 validators, such as http://validator.nu and the W3C service based on it may be useful, but a) they are known to be incomplete and can probably never check all aspects of the rules, b) they do not necessarily reflect the most recent HTML5 draft, and c) the drafts themselves are work in progress and may be changed at any moment without prior notice.
I recommend using http://validator.nu/.
Note what Jukka says, but with regard to using either validator.nu or the W3C HTML validator:
If you want to validate a page at a URL as XHTML with the W3C HTML validator, and the page has <!DOCTYPE html> as its doctype, then you must serve the page with an XML mime type such as application/xhtml+xml to the validator.
This is a good thing. Only if you use such a mime type will browsers treat your XHTML as XHTML, otherwise they will treat it as HTML and all your careful XHTML will be be so much tag soup. With HTML5, validators now behave the same way as browsers.

Does every html page with doctype need internet connection to render page properly?

many doctype use a url link
like this
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
and this dtd file is on live url http://www.w3.org/TR/html4/strict.dtd
What is the use of this online live dtd and how any page (which use this doctype) will render properly according to this doctype without having access to this url (i mean if internet access is not available?)
update : I found this info from wikipedia http://en.wikipedia.org/wiki/System_identifier
In HTML and XML, a system identifier
is a fragmentless URI reference. It
typically occurs in a Document Type
Declaration. In this context, it is
intended to identify a document type
which is used exclusively in one
application, whereas a public
identifier is meant to identify a
document type that may span more than
one application.
In the following example, the system
identifier is the text contained
within quotes:
update 2 : is it only to use for Validators? how some software like dreamweaver provides offline validation?
update 3: i found this info from w3c site http://www.w3.org/QA/Tips/Doctype
Why specify a doctype? Because it
defines which version of (X)HTML your
document is actually using (version
for what browser or validator?), and
this is a critical piece of
information needed by some tools
(which tools? any other tools then validator?) processing the document.
For example, specifying the doctype of
your document allows you to use tools
such as the Markup Validator to check
the syntax of your (X)HTML. Such tools
won't be able to work if they do not
know what kind of document you are
using.
But the most important thing is that
with most families of browsers, a
doctype declaration will make a lot of
guessing unnecessary, and will thus
trigger a "standard" rendering mode.
No, no browsers actually fetch or validate against the doctype. See DTDs Don't Work on the Web for a good argument for why fetching and validating DTDs is a bad idea.
The doctype is there, in theory, to tell what version of the standard the document uses. The browsers generally don't use this information, other than to switch between quirks and standards mode. All modern browsers accept the simplest possible doctype, with no URL or version information, <!DOCTYPE html>, for this purpose; because of this, HTML5 has adopted this as the recommended doctype.
Validators sometimes use this information to tell what DTD to validate against, but DTDs embedded in the document aren't actually a very good way of specifying validation information. The problem with validating against a DTD referenced within a document is that the consumer of that document doesn't really care all that much whether the document is self-consistent, but whether it follows a schema that the consumer knows how to interpret reliably. Instead, it's generally better to validate against an external schema, in a more powerful schema language like RELAX NG.
When validators use this information, they frequently use the URI as an identifier only, not as a locator. That means that the validator already knows about all of the common HTML doctypes, and uses that knowledge for validation, instead of downloading from the URI referred to. This is in part to avoid the problem of having to download the DTD every time, and also because a DTD doesn't actually specify enough information to provide very good validation and error messages, so some parts of the validator may be specified in custom code or a more powerful schema language. For more information, see Henri Sivonen's thesis on his implementation of the validator.nu HTML5 conformance checker.
Some validators may also download and then cache DTDs, so they would need to be online once to download it, but will later work from the cached version.
The URI is there to identify the document type uniquely - it is not meant for retrieval and no browser (or other piece of software) should rely on a document existing at that web address.
I used to wonder about that myself. But if you have your own HTTP server, it's pretty easy to prove that it doesn't matter. Just yank the cable to the outside world and see if you can still open the pages on your server.

Why is XHTML syntax so widely used in web pages?

First of all let's emphasize that syntax rules don't work alone, but they need the correct Content-type header to be fully interpreted by the clients. Currently web pages cannot be served with the correct XHTML header because Internet Explorer doesn't understand that.
The first advantage usually mentioned is that XHTML requires pages to be well-formed: true, but when browsers treat them as (malformed) HTML nothing enforces this rule, so it's up to you being a disciplined developer -- but you can be as disciplined writing good well-formed HTML too.
Another point often mentioned is that XHTML promotes the separation between content and presentation, but even in this case it doesn't really offer anything that can't be done with HTML -- it still depends on the developer since nothing is enforced, and no exclusive tools are offered.
So why do so many developer (including those of famous CMS/blogging softwares) still use XHTML syntax instead of directly writing what those pages will become anyway (i.e. plain HTML)?
Related fact: Stackoverflow uses HTML strict.
http://en.wikipedia.org/wiki/XHTML
From the wiki:
"The only essential difference between XHTML and HTML is that XHTML must be well-formed XML, while HTML need not be."
It's up to you which one you choose. There is no real difference in terms of what the user sees. Whichever you choose, please try to make it well-formed and make sure that your HTML/XHTML validates and follows the standards.
This probably isn't the actual reason, but it makes them parsable using a regular XML parser.
Sadly, XHTML syntax isn't as widely used as the XHTML doctype. You'd think people would be conscious about it, but a lot of the time (at least a few years ago), an XHTML doctype was used mostly because HTML 4 was being "dissed". That hasn't stopped people from continuing to use HTML syntax though. Open ended <li> and <p> tags, non-terminated <br> and <img> tags, tag attributes not enclosed in quotes, and more hypocritical nonsense.
Currently web pages cannot be served with the correct XHTML header because Internet Explorer doesn't understand that.
Sure they can, provided you're prepared to use content negotiation to serve a application/xhtml+xml content type to those user-agents that say they accept it.
There a number of reasons both good and bad why xhtml is so widely used. Jay Askren has a point about people who use XML in other contexts, (I'm one of them), but I doubt if that accounts for much use. If there is a good reason why XHTML is popular, it's most likely that the orthogonality of XML is a very seductive idea. It's simply easier remembering "Always close every tag, always quote the attribute values" than trying to remember all the rules about when you can safely omit tags and leave attributes unquoted etc., even though it results in a more verbose document.
There are other reasons like the fact that it's easier to indent your code if every opening tag has a matching closing one, and if you do, you've got a pretty accurate picture of the DOM laid out in the source code, which can aid with scripting. But I doubt that this is a primary reason.
Using XHTML states an intent, don’t underestimate that (but don’t overestimate this either). Web standards are politics: if nobody cares, nothing is gonna change. Using XHTML (or HTML5) signals “yes, we are in fact interested in the continued development of the standards.
Furthermore, while clients certainly don’t enforce XHTML rules with a text/html content type, design tools still can do this. XHTML is much easier to support for editors than real HTML (with “real” I mean the whole ugly SGML package). There are good XHTML validators that do much more than HTML validators can (e.g. Schneegans’ XML schema validator).
All in all, many arguments against XHTML are in fact straw-men that aim at some of the poorly-formulated arguments for XHTML. For instance, Microsoft is responsible of publishing long lists of purported XHTML advantages (such as semantic web design). Attacking those arguments is like reductio ad absurdum. But there are good arguments for XHTML.
I suspect a major reason xhtml is so popular is cultural and historical more than anything. XML became quite popular some time ago and it is still used quite heavily. It is good for for defining a data model that can be sent over the wire using webservices. There are lots of tools/technologies that work with it such as xslt and many others. It is natural for a developer to use html which is structured like xml, even if there is no real advantage just because they use xml in other contexts.

At the end of the day, why choose XHTML over HTML? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I wonder why I should use XHTML instead of HTML.
XHTML is supposed to be "modularized", but I haven't seen any server side language take advantage of any of that.
XHTML is also more strict, and I don't see the advantage. What does XHTML offer that I need so bad? How does it make my code "better"?
EDIT: another question I found in the comments: Does XHTML parse faster than HTML?
EDIT2: after reading all your comments and the links, I indeed agree that another post deserves to be the correct answer, so I chose the one that directly links to the best source.
Also, goes to show that people upvote the green comment without even reading it.
You should read Beware of XHTML, which is an informative article that warns about some of the pitfalls of XHTML over HTML.
I was pretty gung-ho about XHTML until I read it, but it does make several valid points. Including the following bit;
XHTML 1.x is not “future-compatible”. XHTML 2, currently in the drafting stages, is not backwards-compatible with XHTML 1.x. XHTML 2 will have lots of major changes to the way documents are written and structured, and even if you already have your site written in XHTML 1.1, a complete site rewrite will usually be necessary in order to convert it to proper XHTML 2. A simple XSL transformation will not be sufficient in most cases, because some semantics won't translate properly.
HTML 4.01 is actually more future-compatible. A valid HTML 4.01 document written to modern support levels will be valid HTML 5, and HTML 5 is where the majority of attention is from browser developers and the W3C.
Future compatibility can be huge when working on some projects. The article goes on to make several other good points, but I think that may have stood out the most for me.
Don't mistake the article for a rant against XHTML, the author does talk about the good points of XHTML, but it is good to be aware of the shortcomings before you dive in.
I was going to add this as a comment to one of the other posts, but it grew a little too large.
What the fundamental point that most people seem to be missing, is the purpose behind XHTML. One of the major reasons for developing the XHTML specification was to de-emphasise presentation-related tags in the markup, and to defer presentation to CSS. Whilst this separation can be achieved with plain HTML, this behaviour isn't promoted by the specifcation.
Separating meta-markup and presentation is a vital part of developing for the 'programmable web', and will not only improve SEO, and access for screen readers/text browsers, but will also lead towards your website being more easily analysable by those wishing to access it programmatically (in many simple cases, this can negate the need for developing a specific API, or even just allow for client-side scripts to do things like, identify phone numbers readily). If your web-page conforms to the XHTML specification, it can easily be traversed using XML-related tools, and things such as XPath... which is fantastic news for those who want to extract particular information from your website.
XHTML was not developed for use by itself, but by use with a variety of other technologies. It relies heavily on the use of CSS for presentation, and places a foundation for things like Microformats (whether you love them, or hate them) to offer a standardised markup for common data presentation.
Don't be fooled by the crowd who think that XHTML is insignificant, and is just overly restrictive and pointless... it was created with a purpose that 95% of the world seems to ignore/not know about.
By all means use HTML, but use it for what it's good for, and take the same approach when looking at XHTML.
With regard to parsing speed, I imagine there would be very little difference in the parsing of the actual documents between XHTML and HTML. The trade-off will come purely in how you describe the document using the available markup. XHTML tags tend to be longer, due to required attributes, proper closing, etc. but will forego the need for any presentational markup in the document itself. With that being the case, I think you're talking about comparing one type of apple, with a very slightly different type of apple... they're different, but it's unlikely to be of any consequence (in terms of parsing and rendering) when all you want is a healthy, tasty apple.
For the visitor of a website it probably doesn't make any visible difference. Furthermore, XHTML is usually more of a pain to use as at least one widespread browser still doesn't know how to handle it and you need to serve it as text/html in that case (which yields invalid HTML).
If your HTML is going to be regularly processed by automated tools instead of being read by humans, then you might want to use XHTML because of its more strict structure and being XML it's more easy to parse (from an application standpoint. Not that XML is inherently easy to parse, though).
Apart from that I don't see any compelling reasons to use it, though. XHTML was created in an approach of making use of XML features for HTML and basically it boils down to "HTML 4 with several annoying side-effects" (IMHO, at least).
Use HTML (HTML4 Strict or HTML5).
HTML can fully utilize CSS, can be validated and parsed unambiguously. Separation of structure and presentation has been done in HTML4 and XHTML merely continued that.
All browsers support HTML. Only some browsers support XHTML and those that do, often have more mature and better tested and optimized support for HTML (it's caused by the fact that tiny fraction of pages uses XML mode).
If you care about IE and Google, you have to use HTML or subset of XHTML and HTML defined in Appendix C of XHTML spec. The latter is almost worst of the both worlds, because such XHTML cannot be generated with standard XML tools, cannot use extension mechanisms new to XHTML and has additional limitations over those in HTML alone.
XHTML1.0 is now over 10 years old, it was designed in "Web1.0" times, and as head of W3C said, in retrospect it didn't work out and better approach is needed. W3C HTML5 is written as we speak and addresses needs of web applications used today, and has very good backwards compatibility.
HTML5 closes many gaps that were between HTML4 and XHTML1 (e.g. adds inline SVG, MathML i RDF), cleans up language beyond what was done in XHTML1.0 and XHTML1.1.
XHTML2 is not going to be supported by web browsers in forseeable future. It's likely that it will never be supported (all browser vendors heavily support [X]HTML5, some have already declared that they won't implement XHTML2).
XHTML1.0 has exactly the same semantics and separation of presentation from structure as HTML4.01. Anybody who says otherwise, hasn't read the specification. I encourage everybody to read the spec – it's suprisingly short and uninteresting.
Stylesheets were introduced in HTML4.01 and were not changed in XHTML1.0.
Presentational elements were deprecated in HTML4.01 and were not removed in XHTML1.0.
XHTML myths.
There are no untractable differences in HTML and XHTML that would make parsing of one much slower than another. It depends how the parser is implemented.
Both SGML and XML parsers need to load and parse entire DTD in order to understand entities. This alone is usually more work than parsing of the document itself. HTML parsers almost always "cheat" and use hardcoded entities and element information. XHTML parsers in browsers cheat too.
Parsing of HTML requires handling of implied start and end tags, and real-world HTML requires additional work to handle misplaced tags.
Proper parsing of XHTML requires tracking of XML namespaces.
Draconian XML rules require checking if every character is properly encoded. HTML parsers may get away with this, but OTOH they need to look for <meta>.
The overall difference in cost of parsing is tiny compared to time it takes to download document, build DOM, run scripts, apply CSS and all other things browsers have to do.
I'm surprised that all the answers here recommend XHTML over HTML. I am firmly of the opposite opinion - you should not use XHTML, for the foreseeable future. Here's why:
No browser interprets XHTML as XHTML unless you serve it as mimetype application/xhtml+xml. If you just serve it with the default mimetype, all browsers will interpret it as HTML - eg, accepting unclosed or improperly nested elements.
However, you should never actually do this, as Internet Explorer does not recognise application/xhtml+xml, and would fail to render the page completely.
There are significant differences in the DOM between XHTML and HTML. Since all so-called XHTML pages are being served as HTML at the moment, all javascript code is written using the HTML DOM. If, support for the XHTML mimetype becomes significant enough to convince people to start using it, most of their javascript code will break - even if they think their pages validate as XHTML.
Instead of continuing to debate HTML 4.01 Strict vs XHTML Strict, I would suggest starting to use HTML 5 today. John Resig, the author of jquery, made a similar suggestion last year on his blog.
The HTML 5 doctype, in it's beautiful simplicity will trigger standards mode in all browsers (including IE6).
<!DOCTYPE html>
That's it.
HTML 5 provides some exciting new features such as the <canvas> tag which potentially can push javascript application development to the next level. HTML 5 also has proper support for media (and media is a fairly important aspect of the web these days!) in the form of <video> and <audio> tags.
If you like the syntax of XHTML, i.e. closing "empty" tags such as <br />, that is fully supported in HTML 5. From Karl Dubost of the W3C's post Learn How To Write HTML 5:
auto-closing tag is allowed and conformant in HTML 5.
XHTML2 has received relatively little attention compared to HTML 5. It's becoming increasingly clear that HTML 5 is the future of markup on the web. Microsoft's latest browser, IE8 still renders XHTML served as text/xml as text/html.
Microsoft have a co-chair on the W3C HTML working group and there's an implied support from them for HTML 5. All of the browser vendors have publicly announced their support for HTML 5.
At the end of the day, even if XHTML2 regains support from the industry, it won't be a significant issue having two competing standards as it has been in the past. Both languages support XML namespaces (in the case of HTML 5, serialization of HTML i.e. DOCTYPE switching).
As a programmer, you should be VERY concerned about your code. HTML is ugly and follows few rules.
XHTML on the other hand, turns HTML into a proper language, following strict structural and syntactic rules.
XHTML is better for everyone, as it will help move the web to a point where everyone (all browsers) can agree on how to display a web page.
XHTML is an XML descendent, and us such is much easier on parsers built for the job of analysing syntactically sound XML documents.
If you can't see the benefit of XHTML, you might as well be using MS Word to create your HTML documents.
Take a look at http://www.w3.org/MarkUp/2004/xhtml-faq#need. There are some good reasons apart from modularisation.
I favor XHTML because it's stricter and more clearly laid out. HTML is quirky and browsers have to accept things like <b><i>sadasd</b></i>.
While this is a really simple example, it could also get more confusing and different browsers could lay out things differently.
Also I think that XHTML has to be "faster" since the browser doesn't have to do that kind of "reparations".
Some differences are:
XHTML tags must be properly nested
The documents must have one root element
XHTML tags are always in lowercase
Tags must always be closed (e.g. using the <br> tag in XHTML must have closing tag <br /> or <br></br> in XHTML)
Here are some links on it
wiki XHTML
wiki HTML vs XHTML
XHTML allows to use all those tools designed for XML. Among then, there is XSLT, embedding SVG, etc...
Interesting development: XHTML 2 Working Group Expected to Stop Work End of 2009, W3C to Increase Resources on HTML 5
2009-07-02: Today the Director announces that when the XHTML 2 Working Group charter expires as scheduled at the end of 2009, the charter will not be renewed. By doing so, and by increasing resources in the Working Group, W3C hopes to accelerate the progress of HTML 5 and clarify W3C's position regarding the future of HTML. A FAQ answers questions about the future of deliverables of the XHTML 2 Working Group, and the status of various discussions related to HTML. Learn more about the HTML Activity.
Well, I guess that makes the future of HTML pretty clear.
XHTML forces you to be neat.
For example, in HTML, you can write:
<img src="image.jpg">
This isn't very logical, because the img tag never gets closed. In XHTML, however, you're forced to close the tag neatly, like this:
<img src="image.jpg" />
I like using something that forces me to be neat.
Steve
The subtitle to the XHTML 1.0 recommendation:
A Reformulation of HTML 4 in XML 1.0
Many tools exist today to process XML. By using XHTML, you are allowing a huge set of tools to operate on your pages and to extract information programmatically.
If you were to use HTML, this would be possible too. There are tools in existence to parse HTML DOM trees. However, these tools can often be more specialized than those for XML. You may not find your favorite XML data processing tools compatible with HTML. Furthermore, there are so many uses for XML nowadays that you may be using XML for some other part of an application; why not also use that same XML parser to parse your web pages? This is the motivation behind XHTML.
If you're already comfortable and familiar with HTML 4.01, you have an established project using HTML 4, and you don't have tons of spare time, just go with HTML 4.01. If you have spare time, learn XHTML 1.1 anyway, and start your new projects in XHTML 1.1 – there's no harm in doing so. If you're using something other than HTML 4.01 or are pretty unfamiliar with HTML 4 anyway, just learn XHTML 1.1.
Using XHTML with the correct DocType will force the browser to render the content in a more standards compliant (strict) mode. This makes the different browsers behave better and, most importantly, more like each other. This makes your job as a webdeveloper a lot easier since it reduces the amount of browser specific tweaks needed to make the content look the same in all browsers.
Quirksmode.org has a lot of good info on this subject.
In my opinion, the strictness is, at least in theory, a good thing, because in HTML, you don't need to be strict, and because of that and the HTML5 junk, Browsers have advanced error correction algorithms that will make the best out of broken HTML. The problem is, the algorithms are not exactly the same and will lead to really strange behaviour you can't predict. With XHTML, on the other hand, you typically have fine, valid XHTML and so the error correction algorithms are not needed, i.e. the entire Browser behaviour is predictable. In addition, strict code makes it easier for your tools to work with the code. So you have actually nothing to lose by using XHTML, but there is some potential to gain. Things will get worse with plain HTML when HTML5 is finally out and the "be open in what you accept" will lead to the described strange behaviour. But at least then it's a standardized strange behaviour. Sigh.
On the other hand, if you use a good IDE like Visual Studio, it's almost impossible to produce broken HTML code anyway, so the result is the same.
Use XHTML
Fails fast. If there are any inconsistencies they will be found during validation.
It encourages better design by separating semantic markup from presentation etc.
It's structured which means that you can treat it as a data object and run all sorts of queries against it. For example you could find all addresses or citations within your website.
You can do build-time optimizations. Since it's well-formed XML you can easily do find/replace operations during build time. Or any document management and manipulation.
You can write XSLT or other transformation scripts to programatically transform your XHTML for other platforms. For example you could have an XSLT for the iPhone that would transform all XHTML to make it compatible or more user-friendly for the iPhone
You are future proofing yourself. Transforming XHTML to newer semantics is again, very easy using transformation.
Search engines will continue to evolve to gather more semantic information as part of the programmable web.
DOM operations are more reliable since it's structured.
From an algorithmic perspective, it yields easier and faster parsing.
XHTMl is a good standing point to use because if you want valid code you would need to provide some aspect of help to the disabled community due to the fact screen readers need the alt and title parts of the image and link tags.
It must be faster to parse to an extent because unlike HTML the parser wouldn't need to check to see if the tag wasn't closed properly, if it was nested correctly etc.
Also it is better to use it because yes it is strict but it helps you to think more logically (in my opinion) when it comes to learning programming languages.
I believe XHTML is (or should be) faster to parse. A valid XHTML document must be written to a stricter spec in that errors are fatal when parsing, whereas HTML is more lenient and allows for oddities mentioned before my comment like out of order closing tags and such. I found this helpful in uncovering the differences between HTML and XHTML parsing:
http://wiki.whatwg.org/wiki/HTML_vs._XHTML#Parsing
A reason you might use XHTML over HTML might be if you intend to have mobile users as part of your audience. If I recall, many phones use something more of an XML parser, rather than an HTML one to display the web. If you are writing for desktop browsers, HTML would probably be acceptable.
That said, if you are going to serve the data as text/html anyway, you should use HTML:
http://www.hixie.ch/advocacy/xhtml