What is the relationship between XHTML and HTML5 (and on)? - html

Is there any reason to consider XHTML related details for web development for someone who works with HTML4 and plans to work with HTML5?
Or was XHTML a "side" development in web technology which is not related to the current mainstream of web development (which I guess is HTML5)?
Rationale for asking
The reason I'm asking is that a StackOverflow answer I previously posted was recently commented on to the effect that "You should NOT use upper case HTML tags". When I asked "why" the commenter pointed me to a couple of articles, both of which gave the reasoning for lowercase-only tag names as "Upper case tag names are not allowed in XHTML" as the main reason.
Since I don't currently do web development in XHTML (pure HTML4.0) and only plan to use HTML5 as the next step, I'm interested in whether XHTML should be a factor in assorted decisions I make as far as web development
Please note that this question is NOT asking whether one should use upper or lower case tag names in pure HTML - merely whether the upper-case-tag prohibition in XHTML should be a factor in my decision making, or I should ignore XHTML related rules.

Most of the info you will find on XHTML fails to mention that the whole purpose of XHTML is to be used by both humans and computers for data exchange. XHTML was created so that computers could talk to each other, but in the same time remain readable by humans.
The most crazy thing is that various code monks became obsessed with XHTML compliance, without any intend of using the XHTML for data exchange and without understanding what XHTML actually is.
Edit:
A good example of XHTML use is an airline company publishing its flights. The company can utilize XHTML to produce a single list of flights that will be both consumable be humans visiting online and by services that will read the data for their needs.
On the other hand if you own a hair-salon, there is no need for your site to validate against XHTML standards.

Directly, there's no reason to take into account XHTML rules when marking up documents for the HTML serialization of HTML5.¹ However, it seems to be the modern practice to use lower case for tags, and that has been partly influenced by XML and XHTML. So if you find yourself working in a large team of developers, it is quite likely that the coding guidelines for the team will mandate lower case just for consistency's sake. So, it might be worth getting into the habit of doing so.
¹ Note that there's also the possibility of using the XHTML serialization of HTML5, in which case it would matter of course, but I don't think that's what you are asking about.

As far as I can tell, XHTML was intended to be a kind of combination between the features of HTML and the strictness of XML. For example, in HTML you can write <b><i>test</b></i> and the browser won't care, it knows what you meant. Whereas in XHTML that would cause an error and block the entire page from showing.
However most browsers just tolerated errors in XHTML as well, so it became kind of pointless.
Personally, the only reason I can give for choosing to use lowercase tag names rather than uppercase is because I'm too lazy to hold the Shift key and I remapped my CapsLock key to something more useful.

From what I've personally observed about XHTML, the only "benefit" of it is the versatility—for example, it is my understanding that with relative ease, one can create a custom tag/element or attribute and still validate with XHTML.
I see no need for custom tags, since if there isn't one that doesn't fit semantically (though the cases are minimal due to HTML5's improvements in semantic markup), you can just use a div or span and use a class or id to make it "fit." And HTML5 does allow custom attributes anyways.
So I see no direct relationship—or at least, I don't think there should be one—between HTML5 and XHTML (and anyways, XHTML 1.0 is meant to be a combination of HTML 4 and XML). The spec doesn't recommend one over the other, it just notes the fact that in HTML5, they are case-insensitive.
I personally use lowercase tag names for two reasons:
It's a nuisance to hold shift every time (though one could argue that you hold shift to get the angular brackets, but whatever).
It's a style preference—I truly think it looks much cleaner with lowercase tag names as well as a lowercase doctype (<!doctype html>). Uppercase tag names are really old-fashioned, and you don't really see them anymore. It may just be a thing with newer generation programmers such as myself.
And either way, you should pay no mind to rules that apply to a doctype you're not even working with.

Related

HTML: Include, or exclude, optional closing tags?

Some HTML1 closing tags are optional, i.e.:
</HTML>
</HEAD>
</BODY>
</P>
</DT>
</DD>
</LI>
</OPTION>
</THEAD>
</TH>
</TBODY>
</TR>
</TD>
</TFOOT>
</COLGROUP>
Note: Not to be confused with closing tags that are forbidden to be included, i.e.:
</IMG>
</INPUT>
</BR>
</HR>
</FRAME>
</AREA>
</BASE>
</BASEFONT>
</COL>
</ISINDEX>
</LINK>
</META>
</PARAM>
Note: xhtml is different from HTML. xhtml is a form of xml, which requires every element have a closing tag. A closing tag can be forbidden in html, yet mandatory in xhtml.
Are the optional closing tags
ideally included, but we'll accept them if you forgot them, or
ideally not included, but we'll accept them if you put them in
In other words, should I include them, or should I not include them?
The HTML 4.01 spec talks about closing element tags being optional, but doesn't say if it's preferable to include them, or preferable to not include them.
On the other hand, a random article on DevGuru says:
The ending tag is optional. However, it is recommended that it be included.
The reason I ask is because you just know it's optional for compatibility reasons; and they would have made them (mandatory | forbidden) if they could have.
Put it another way: What did HTML 1, 2, 3 do with regards to these, now optional, closing tags. What does HTML 5 do? And what should I do?
Note
Some elements in HTML are forbidden from having closing tags. You may disagree with that, but that is the specification, and it's not up for debate. I'm asking about optional closing tags, and what the intention was.
Footnotes
1HTML 4.01
There are cases where explicit tags help, but sometimes it's needless pedantry.
Note that the HTML spec clearly specifies when it's valid to omit tags, so it's not always an error.
For example you never need </body></html>. Nobody ever remembers to put <tbody> explicitly (to the point that XHTML made exceptions for it).
You don't need </head><body> unless you have DOM-manipulating scripts that actually search <head> (then it's better to close it explicitly, because rules for implied end of <head> could surprise you).
Nested lists are actually better off without </li>, because then it's harder to create erroneous ul > ul tree.
Valid:
<ul>
<li>item
<ul>
<li>item
</ul>
</ul>
Invalid:
<ul>
<li>item</li>
<ul>
<li>item</li>
</ul>
</ul>
And keep in mind that end tags are implied whether you try to close all elements or not. Putting end tags won't automatically make parsing more robust:
<p>foo <p>bar</p> baz</p>
will parse as:
<p>foo</p><p>bar</p> baz
It can only help when you validate documents.
The optional ones are all ones that should be semantically clear where they end, without needing the end tag.
E.G. each <li> implies a </li> if there isn't one right before it.
The forbidden end tags all would be immediately followed by their end tag so it would be kind of redundant to have to type <img src="blah" alt="blah"></img> every time.
I almost always use the optional tags (unless I have a very good reason not to) because it lends to more readable and updateable code.
I am adding some links here to help you with the history of HTML, for you to understand the various contradictions. This is not the answer to your question, but you will know more after reading these various digests.
How Did We Get Here? – Dive Into HTML5
The History of the Web
Brief History of HTML
HTML’s History – HTML WG Wiki
Some excerpts from Dive Into HTML5:
[T]he fact that “broken” HTML markup still worked in web browsers led authors to create broken HTML pages. A lot of broken pages. By some estimates, over 99% of HTML pages on the web today have at least one error in them. But because these errors don’t cause browsers to display visible error messages, nobody ever fixes them.
The W3C saw this as a fundamental problem with the web, and they set out to correct it. XML, published in 1997, broke from the tradition of forgiving clients and mandated that all programs that consumed XML must treat so-called “well-formedness” errors as fatal. This concept of failing on the first error became known as “draconian error handling,” after the Greek leader Draco who instituted the death penalty for relatively minor infractions of his laws. When the W3C reformulated HTML as an XML vocabulary, they mandated that all documents served with the new application/xhtml+xml MIME type would be subject to draconian error handling. If there was even a single well-formedness error in your XHTML page […] web browsers would have no choice but to stop processing and display an error message to the end user.
This idea was not universally popular. With an estimated error rate of 99% on existing pages, the ever-present possibility of displaying errors to the end user, and the dearth of new features in XHTML 1.0 and 1.1 to justify the cost, web authors basically ignored application/xhtml+xml. But that doesn’t mean they ignored XHTML altogether. Oh, most definitely not. Appendix C of the XHTML 1.0 specification gave the web authors of the world a loophole: “Use something that looks kind of like XHTML syntax, but keep serving it with the text/html MIME type.” And that’s exactly what thousands of web developers did: they “upgraded” to XHTML syntax but kept serving it with a text/html MIME type.
Even today, millions of web pages claim to be XHTML. They start with the XHTML doctype on the first line, use lowercase tag names, use quotes around attribute values, and add a trailing slash after empty elements like <br /> and <hr />. But only a tiny fraction of these pages are served with the application/xhtml+xml MIME type that would trigger XML’s draconian error handling. Any page served with a MIME type of text/html — regardless of doctype, syntax, or coding style — will be parsed using a “forgiving” HTML parser, silently ignoring any markup errors, and never alerting end users (or anyone else) even if the pages are technically broken.
XHTML 1.0 included this loophole, but XHTML 1.1 closed it, and the never-finalized XHTML 2.0 continued the tradition of requiring draconian error handling. And that’s why there are billions of pages that claim to be XHTML 1.0, and only a handful that claim to be XHTML 1.1 (or XHTML 2.0). So are you really using XHTML? Check your MIME type. (Actually, if you don’t know what MIME type you’re using, I can pretty much guarantee that you’re still using text/html.) Unless you’re serving your pages with a MIME type of application/xhtml+xml, your so-called “XHTML” is XML in name only.
[T]he people who had proposed evolving HTML and HTML forms were faced with two choices: give up, or continue their work outside of the W3C. They chose the latter, registered the whatwg.org domain, and in June 2004, the WHAT Working Group was born.
[T]he WHAT working group was quietly working on a few other things, too. One of them was a specification, initially dubbed Web Forms 2.0, which added new types of controls to HTML forms. (You’ll learn more about web forms in A Form of Madness.) Another was a draft specification called “Web Applications 1.0,” which included major new features like a direct-mode drawing canvas and native support for audio and video without plugins.
In October 2009, the W3C shut down the XHTML 2 Working Group and issued this statement to explain their decision:
When W3C announced the HTML and XHTML 2 Working Groups in March 2007, we indicated that we would continue to monitor the market for XHTML 2. W3C recognizes the importance of a clear signal to the community about the future of HTML.
While we recognize the value of the XHTML 2 Working Group’s contributions over the years, after discussion with the participants, W3C management has decided to allow the Working Group’s charter to expire at the end of 2009 and not to renew it.
The ones that win are the ones that ship.
The reason i ask is because you just know it's optional for compatibility reasons; and they would have made them (mandatory | forbidden) if they could have.
That's an interesting inference. My reading of it is that just about any time a tag could be reliably inferred, the tag is optional. The design suggests that the intention was to make it quick and easy to write.
What did HTML 1, 2, 3 do with regards to these, now optional, closing tags.
The DTD for HTML 2 is embedded in the RFC which, along with the original HTML DTD, has optional start and end tags all over the place.
HTML 3 was abandoned (thanks to the browser wars) and replaced with HTML 3.2 (which was designed to describe the then current state of the web).
What does HTML 5 do?
HTML 5 was geared towards "paving the cowpaths" from the outset.
And what should i do?
Ah, now that is subjective and argumentative :)
Some people think that explicit tags are better for readability and maintainability by virtue of being in front of the readers eyes.
Some people think that inferred tags are better for readability and maintainability by virtue of not cluttering up the editor.
What does HTML 5 do?
The answer to this question is in the W3C Working Draft:
http://www.w3.org/TR/html5/syntax.html#syntax-tag-omission
And what should i do?
It's a matter of style. I try to never omit end tags because it helps me to be rigorous and not omit tags that are necessary.
If it is superfluous, leave it out.
If it serves a purpose (even a seemingly trivial purpose, such as appeasing your IDE or appeasing your eyes), leave it in.
It's rare in a well-defined spec to see optional items that do not affect behavior. With the exception of "comments", of course. But the HTML spec is less of a design spec, and more of a document of the state of current major implementations. So when an item is optional in HTML and it seems to serve no purpose, we may guess that optional nature is merely documentation of a quirk in specific browser.
Looking at the HTML-5 spec RFC section linked above, you see that the optional tags are strangely linked to the presence of comments! That should tell you that the authors are not wearing design hats. They are instead playing the game of "document the quirks" in major implementations. So we can't take the spec too seriously in this respect.
So, the solution is: Don't sweat it. Move on to something that actually matters. :)
I think the best answer is to include closing tags for readability or error detection. However, if you have lots of generated HTML (say, tables of data), you could save significant bandwidth by omitting optional tags.
My recommendation is that you omit most optional close tags, and all optional attributes that you can get away with. Many IDEs will complain so you may not be able to get away with omitting some of these but it is generally better for smaller file size and less clutter. If you have code generators definitely omit end tags there because you can get some good size reduction from it. Usually it doesn't really matter one way or the other.
But when it does matter then act on it. On some recent work of mine I was able to reduce the size of my rendered HTML from 1.5 MB to 800 KB by eliminating most of the generated end and redundant value attributes for the open tag, where the text of the element was the same as the value. I have about 200 tags. I could implement this some other way entirely, but that would be more work ($$$), so this allows me to easily make the page more responsive.
Just out of curiosity I found that if I removed quotes around attributes that didn't need them I could save 20 KB, but my IDE (Visual Studio) doesn't like it. I also was surprised to find that the really long ID that ASP.NET generates account for 20% of my file.
The idea that we could ever get any relevant fraction of HTML strictly valid was misguided in the first place, so do whatever works best for you and your customers. Most tools that I have ever seen or used will say they generate xhtml, but they don't really work 100%, and there isn't any benefit to strict adherence anyway.
Personally, I'm a fan of XHTML and, like ghoppe, "I try to never omit end tags because it helps me to be rigorous and not omit tags that are necessary."
but
If you're deliberately using HTML 4.n, one can't argue that including them makes it easier to consume the document, as the notion of well-formedness as opposed to validity is an XML concept, and you lose that benefit when you forbid certain close tags. So the only issue becomes validity... and if it's still valid without them... you might as well save the bandwidth, no?
Using end tags makes dealing with fragments easier because their behaviour is not dependant on sibling elements. This reason alone should be compelling enough. Does anyone deal with monolithic html documents anymore?
In some curly bracket languages like C#, you can omit the curly braces around an if statement if its only two lines long. for example...
if ([condition])
[code]
but you can't do this...
if ([condition])
[code]
[code]
the third line won't be a part of the if statement. it hurts readability, and bugs can be easily introduced, and be difficult to find.
for the same reasons, i close all tags. tags like the img tag do still need to be closed, just not with a separate closing tag.
Do whatever you feel makes the code more readable and maintainable.
Personally I would always be inclined to close <td> and <tr>, but I would never bother with <li>.
If you were writing an HTML parser, would it be easier to parse HTML that included optional closing tags, or HTML that doesn't? I think the optional closing tags being present would make it easier, as I wouldn't have to infer where the closing tag should be.
For that reason, I always include the optional closing tags - on the theory that my page might render faster, as I'm creating less work for the browser's HTML parser.
For forbidden closing types use syntax like: <img /> With the /> to close the tag which is accepted in xml

Why is XHTML syntax so widely used in web pages?

First of all let's emphasize that syntax rules don't work alone, but they need the correct Content-type header to be fully interpreted by the clients. Currently web pages cannot be served with the correct XHTML header because Internet Explorer doesn't understand that.
The first advantage usually mentioned is that XHTML requires pages to be well-formed: true, but when browsers treat them as (malformed) HTML nothing enforces this rule, so it's up to you being a disciplined developer -- but you can be as disciplined writing good well-formed HTML too.
Another point often mentioned is that XHTML promotes the separation between content and presentation, but even in this case it doesn't really offer anything that can't be done with HTML -- it still depends on the developer since nothing is enforced, and no exclusive tools are offered.
So why do so many developer (including those of famous CMS/blogging softwares) still use XHTML syntax instead of directly writing what those pages will become anyway (i.e. plain HTML)?
Related fact: Stackoverflow uses HTML strict.
http://en.wikipedia.org/wiki/XHTML
From the wiki:
"The only essential difference between XHTML and HTML is that XHTML must be well-formed XML, while HTML need not be."
It's up to you which one you choose. There is no real difference in terms of what the user sees. Whichever you choose, please try to make it well-formed and make sure that your HTML/XHTML validates and follows the standards.
This probably isn't the actual reason, but it makes them parsable using a regular XML parser.
Sadly, XHTML syntax isn't as widely used as the XHTML doctype. You'd think people would be conscious about it, but a lot of the time (at least a few years ago), an XHTML doctype was used mostly because HTML 4 was being "dissed". That hasn't stopped people from continuing to use HTML syntax though. Open ended <li> and <p> tags, non-terminated <br> and <img> tags, tag attributes not enclosed in quotes, and more hypocritical nonsense.
Currently web pages cannot be served with the correct XHTML header because Internet Explorer doesn't understand that.
Sure they can, provided you're prepared to use content negotiation to serve a application/xhtml+xml content type to those user-agents that say they accept it.
There a number of reasons both good and bad why xhtml is so widely used. Jay Askren has a point about people who use XML in other contexts, (I'm one of them), but I doubt if that accounts for much use. If there is a good reason why XHTML is popular, it's most likely that the orthogonality of XML is a very seductive idea. It's simply easier remembering "Always close every tag, always quote the attribute values" than trying to remember all the rules about when you can safely omit tags and leave attributes unquoted etc., even though it results in a more verbose document.
There are other reasons like the fact that it's easier to indent your code if every opening tag has a matching closing one, and if you do, you've got a pretty accurate picture of the DOM laid out in the source code, which can aid with scripting. But I doubt that this is a primary reason.
Using XHTML states an intent, don’t underestimate that (but don’t overestimate this either). Web standards are politics: if nobody cares, nothing is gonna change. Using XHTML (or HTML5) signals “yes, we are in fact interested in the continued development of the standards.
Furthermore, while clients certainly don’t enforce XHTML rules with a text/html content type, design tools still can do this. XHTML is much easier to support for editors than real HTML (with “real” I mean the whole ugly SGML package). There are good XHTML validators that do much more than HTML validators can (e.g. Schneegans’ XML schema validator).
All in all, many arguments against XHTML are in fact straw-men that aim at some of the poorly-formulated arguments for XHTML. For instance, Microsoft is responsible of publishing long lists of purported XHTML advantages (such as semantic web design). Attacking those arguments is like reductio ad absurdum. But there are good arguments for XHTML.
I suspect a major reason xhtml is so popular is cultural and historical more than anything. XML became quite popular some time ago and it is still used quite heavily. It is good for for defining a data model that can be sent over the wire using webservices. There are lots of tools/technologies that work with it such as xslt and many others. It is natural for a developer to use html which is structured like xml, even if there is no real advantage just because they use xml in other contexts.

Why is XHTML 1.0 Transitional so popular?

My company is looking to replace all websites in the group with a new CMS-based system and similar designs/styling, with E-Commerce functionality being added in a future phase. It's too big a job for me to do in a reasonable time-frame, so we are going to be inviting tenders from agencies.
I'm currently in the process of defining the technical requirements, and I'm intending to dictate that the selected system must have a Strict DOCTYPE and must trigger Standards mode (or Almost Standards Mode) in common browsers, or something to that effect [We have to allow Almost Standards Mode to cater for IE, obviously].
I've done a bit of homework an all of this - I don't want the spec to be limited by my ignorance, after all - but it won't surprise you all that I've found that 'current opinion' is completely divided on what is good practice.
There are plenty of people who advocate HTML4.01 Strict (fair enough), plenty of people who recommend XHTML1.0 Strict served as text/html (I'm OK with this too), some who recommend HTML5 but restricted to HTML4.01 tags (erm... still not sure if this is a good idea or not, but I see the principle), but also a not-inconsiderable number (including people on other SO threads) who recommend XHTML1.0 Transitional.
I just don't understand the reasoning for this... OK, you may happen to want to temporarily use something that been deprecated, and thus Transitional seems sensible, but some people recommend XHTML Transitional for new build.
After checking out other companies' sites for design inspiration, I've notice that many sites (if they have any DOCTYPE specified at all) will refer to a Transitional DTD. OK, we all know there is plenty of crap on the web, so perhaps I shouldn't draw too many conclusions. But checking out Web Design Agencies that we've come across, there are an amazing proportion of them (the vast majority of them, I would say) are using XHTML 1.0 Transitional.
Fine, so you don't necessarily have to be an expert to call yourself a Web Designer, but the sheer volume of Transitional layouts makes me wonder... Most of the sites seem to be otherwise reasonably designed (CSS layout, validating, accessible etc).
So, having finally got to the point(!), is there some reason why such a large proportion of these agencies are opting for Transitional DOCTYPEs? Am I missing something, something that I need to consider for my new sites?
Edit: Yeah, I realise the purpose of the Transitional DTD - I was just suspicious that so many otherwise-competent web developers are clinging on to deprecated markup. I wonder if you guys are correct and the answer is simply that they are a) are too lazy to get their own website to validate, or b) sticking with the default DTD of their preferred IDE.
The key re-assurance for me is that (according to your responses so far) I don't seem to be missing out on some key reason to use a Transitional DTD.
Edit 2: Regarding our CMS project - all the short-listed agencies thankfully seem to have their heads screwed on - Strict, valid and accessible.
I'm going to take a hard-nosed approach and say that Transitional doctypes are popular because some people are lazy or just don't care. Somebody told them they should validate their code, and transitional is the easiest to validate because it allows deprecated markup.
I agree with those who answered this previously and said that transitional doctypes are for transitioning to strict. Those are the only situations where I've used them.
Any new development should definitely be with a strict doctype. The choice of Strict vs. Transitional is more important than XHTML vs. HTML4. I would strongly recommend to you that in your technical requirements you require Strict.
Simply put, XHTML Transitional allows developers to more easily migrate a legacy HTML codebase over to that doctype. The Transitional doctype is more forgiving than Strict, but is meant as a stepping stone toward Strict.
Here's a good article talking about Transitional vs Strict: http://24ways.org/2005/transitional-vs-strict-markup
The default of DOCTYPE of MS Visual Studio is XHTML 1.0 Transitional.
XHTML 1.0 Transitional is more forgiving than its Strict counterpart and is the easiest to transition to.
For example in Strict DTD, you would need to write JS to open linked pages in a new window... and target="_blank" would flag an error (eg. from Designing with Web Standards...)
Strict is definitely the way to move forward, but transitional is the best choice for, Transitioning.
I think many people use XHTML 1.0 Transitional because it seems advanced, up-to-date or even beyond (XHTML is the future, dude!) on the one hand but on the other hand still does allow this “nasty stuff” like center, font, align, border, target etc.
Most of the web doesn't validate, as either transitional or strict. Transitional is generally easier to work with, but I think a more telling reason it's used more frequently is that it's also the default for editors like Dreamweaver and many CMS's use it as their default doctype.
Most web authors don't validate their pages, in which case, there's very little to choose between the DOCTYPEs and no good reason to change from the IDE default. The only practical difference to browsers between Strict and Transitional is that strict triggers Standards mode and Transitional triggers Almost Standards mode in gecko, presto, webkit, etc. Pages that actually require Standards mode behaviour are very few and far between and certainly won't affect the overall proportions that you see on the web.
It's worth noting that HTML 5 is being written without a Strict/Transitional distinction, largely because the authors of that recognise that the Transitional DOCTYPEs of XHTML and HTML4 are widely misused.
In XHTML 1.0 Transitional you have <center> <font> and <strike> which is not allowed within the strict version.
The reason is mostly portability between other templates. A nice article about that is here.
Good luck bludgeoning your pages into validating if you're using user input in a CMS. XHTML Strict doesn't like really common things like target="_blank" (we can have a debate about that, but people still use it). Additionally, if anyone adds the default youtube embed html, it won't validate.

At the end of the day, why choose XHTML over HTML? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I wonder why I should use XHTML instead of HTML.
XHTML is supposed to be "modularized", but I haven't seen any server side language take advantage of any of that.
XHTML is also more strict, and I don't see the advantage. What does XHTML offer that I need so bad? How does it make my code "better"?
EDIT: another question I found in the comments: Does XHTML parse faster than HTML?
EDIT2: after reading all your comments and the links, I indeed agree that another post deserves to be the correct answer, so I chose the one that directly links to the best source.
Also, goes to show that people upvote the green comment without even reading it.
You should read Beware of XHTML, which is an informative article that warns about some of the pitfalls of XHTML over HTML.
I was pretty gung-ho about XHTML until I read it, but it does make several valid points. Including the following bit;
XHTML 1.x is not “future-compatible”. XHTML 2, currently in the drafting stages, is not backwards-compatible with XHTML 1.x. XHTML 2 will have lots of major changes to the way documents are written and structured, and even if you already have your site written in XHTML 1.1, a complete site rewrite will usually be necessary in order to convert it to proper XHTML 2. A simple XSL transformation will not be sufficient in most cases, because some semantics won't translate properly.
HTML 4.01 is actually more future-compatible. A valid HTML 4.01 document written to modern support levels will be valid HTML 5, and HTML 5 is where the majority of attention is from browser developers and the W3C.
Future compatibility can be huge when working on some projects. The article goes on to make several other good points, but I think that may have stood out the most for me.
Don't mistake the article for a rant against XHTML, the author does talk about the good points of XHTML, but it is good to be aware of the shortcomings before you dive in.
I was going to add this as a comment to one of the other posts, but it grew a little too large.
What the fundamental point that most people seem to be missing, is the purpose behind XHTML. One of the major reasons for developing the XHTML specification was to de-emphasise presentation-related tags in the markup, and to defer presentation to CSS. Whilst this separation can be achieved with plain HTML, this behaviour isn't promoted by the specifcation.
Separating meta-markup and presentation is a vital part of developing for the 'programmable web', and will not only improve SEO, and access for screen readers/text browsers, but will also lead towards your website being more easily analysable by those wishing to access it programmatically (in many simple cases, this can negate the need for developing a specific API, or even just allow for client-side scripts to do things like, identify phone numbers readily). If your web-page conforms to the XHTML specification, it can easily be traversed using XML-related tools, and things such as XPath... which is fantastic news for those who want to extract particular information from your website.
XHTML was not developed for use by itself, but by use with a variety of other technologies. It relies heavily on the use of CSS for presentation, and places a foundation for things like Microformats (whether you love them, or hate them) to offer a standardised markup for common data presentation.
Don't be fooled by the crowd who think that XHTML is insignificant, and is just overly restrictive and pointless... it was created with a purpose that 95% of the world seems to ignore/not know about.
By all means use HTML, but use it for what it's good for, and take the same approach when looking at XHTML.
With regard to parsing speed, I imagine there would be very little difference in the parsing of the actual documents between XHTML and HTML. The trade-off will come purely in how you describe the document using the available markup. XHTML tags tend to be longer, due to required attributes, proper closing, etc. but will forego the need for any presentational markup in the document itself. With that being the case, I think you're talking about comparing one type of apple, with a very slightly different type of apple... they're different, but it's unlikely to be of any consequence (in terms of parsing and rendering) when all you want is a healthy, tasty apple.
For the visitor of a website it probably doesn't make any visible difference. Furthermore, XHTML is usually more of a pain to use as at least one widespread browser still doesn't know how to handle it and you need to serve it as text/html in that case (which yields invalid HTML).
If your HTML is going to be regularly processed by automated tools instead of being read by humans, then you might want to use XHTML because of its more strict structure and being XML it's more easy to parse (from an application standpoint. Not that XML is inherently easy to parse, though).
Apart from that I don't see any compelling reasons to use it, though. XHTML was created in an approach of making use of XML features for HTML and basically it boils down to "HTML 4 with several annoying side-effects" (IMHO, at least).
Use HTML (HTML4 Strict or HTML5).
HTML can fully utilize CSS, can be validated and parsed unambiguously. Separation of structure and presentation has been done in HTML4 and XHTML merely continued that.
All browsers support HTML. Only some browsers support XHTML and those that do, often have more mature and better tested and optimized support for HTML (it's caused by the fact that tiny fraction of pages uses XML mode).
If you care about IE and Google, you have to use HTML or subset of XHTML and HTML defined in Appendix C of XHTML spec. The latter is almost worst of the both worlds, because such XHTML cannot be generated with standard XML tools, cannot use extension mechanisms new to XHTML and has additional limitations over those in HTML alone.
XHTML1.0 is now over 10 years old, it was designed in "Web1.0" times, and as head of W3C said, in retrospect it didn't work out and better approach is needed. W3C HTML5 is written as we speak and addresses needs of web applications used today, and has very good backwards compatibility.
HTML5 closes many gaps that were between HTML4 and XHTML1 (e.g. adds inline SVG, MathML i RDF), cleans up language beyond what was done in XHTML1.0 and XHTML1.1.
XHTML2 is not going to be supported by web browsers in forseeable future. It's likely that it will never be supported (all browser vendors heavily support [X]HTML5, some have already declared that they won't implement XHTML2).
XHTML1.0 has exactly the same semantics and separation of presentation from structure as HTML4.01. Anybody who says otherwise, hasn't read the specification. I encourage everybody to read the spec – it's suprisingly short and uninteresting.
Stylesheets were introduced in HTML4.01 and were not changed in XHTML1.0.
Presentational elements were deprecated in HTML4.01 and were not removed in XHTML1.0.
XHTML myths.
There are no untractable differences in HTML and XHTML that would make parsing of one much slower than another. It depends how the parser is implemented.
Both SGML and XML parsers need to load and parse entire DTD in order to understand entities. This alone is usually more work than parsing of the document itself. HTML parsers almost always "cheat" and use hardcoded entities and element information. XHTML parsers in browsers cheat too.
Parsing of HTML requires handling of implied start and end tags, and real-world HTML requires additional work to handle misplaced tags.
Proper parsing of XHTML requires tracking of XML namespaces.
Draconian XML rules require checking if every character is properly encoded. HTML parsers may get away with this, but OTOH they need to look for <meta>.
The overall difference in cost of parsing is tiny compared to time it takes to download document, build DOM, run scripts, apply CSS and all other things browsers have to do.
I'm surprised that all the answers here recommend XHTML over HTML. I am firmly of the opposite opinion - you should not use XHTML, for the foreseeable future. Here's why:
No browser interprets XHTML as XHTML unless you serve it as mimetype application/xhtml+xml. If you just serve it with the default mimetype, all browsers will interpret it as HTML - eg, accepting unclosed or improperly nested elements.
However, you should never actually do this, as Internet Explorer does not recognise application/xhtml+xml, and would fail to render the page completely.
There are significant differences in the DOM between XHTML and HTML. Since all so-called XHTML pages are being served as HTML at the moment, all javascript code is written using the HTML DOM. If, support for the XHTML mimetype becomes significant enough to convince people to start using it, most of their javascript code will break - even if they think their pages validate as XHTML.
Instead of continuing to debate HTML 4.01 Strict vs XHTML Strict, I would suggest starting to use HTML 5 today. John Resig, the author of jquery, made a similar suggestion last year on his blog.
The HTML 5 doctype, in it's beautiful simplicity will trigger standards mode in all browsers (including IE6).
<!DOCTYPE html>
That's it.
HTML 5 provides some exciting new features such as the <canvas> tag which potentially can push javascript application development to the next level. HTML 5 also has proper support for media (and media is a fairly important aspect of the web these days!) in the form of <video> and <audio> tags.
If you like the syntax of XHTML, i.e. closing "empty" tags such as <br />, that is fully supported in HTML 5. From Karl Dubost of the W3C's post Learn How To Write HTML 5:
auto-closing tag is allowed and conformant in HTML 5.
XHTML2 has received relatively little attention compared to HTML 5. It's becoming increasingly clear that HTML 5 is the future of markup on the web. Microsoft's latest browser, IE8 still renders XHTML served as text/xml as text/html.
Microsoft have a co-chair on the W3C HTML working group and there's an implied support from them for HTML 5. All of the browser vendors have publicly announced their support for HTML 5.
At the end of the day, even if XHTML2 regains support from the industry, it won't be a significant issue having two competing standards as it has been in the past. Both languages support XML namespaces (in the case of HTML 5, serialization of HTML i.e. DOCTYPE switching).
As a programmer, you should be VERY concerned about your code. HTML is ugly and follows few rules.
XHTML on the other hand, turns HTML into a proper language, following strict structural and syntactic rules.
XHTML is better for everyone, as it will help move the web to a point where everyone (all browsers) can agree on how to display a web page.
XHTML is an XML descendent, and us such is much easier on parsers built for the job of analysing syntactically sound XML documents.
If you can't see the benefit of XHTML, you might as well be using MS Word to create your HTML documents.
Take a look at http://www.w3.org/MarkUp/2004/xhtml-faq#need. There are some good reasons apart from modularisation.
I favor XHTML because it's stricter and more clearly laid out. HTML is quirky and browsers have to accept things like <b><i>sadasd</b></i>.
While this is a really simple example, it could also get more confusing and different browsers could lay out things differently.
Also I think that XHTML has to be "faster" since the browser doesn't have to do that kind of "reparations".
Some differences are:
XHTML tags must be properly nested
The documents must have one root element
XHTML tags are always in lowercase
Tags must always be closed (e.g. using the <br> tag in XHTML must have closing tag <br /> or <br></br> in XHTML)
Here are some links on it
wiki XHTML
wiki HTML vs XHTML
XHTML allows to use all those tools designed for XML. Among then, there is XSLT, embedding SVG, etc...
Interesting development: XHTML 2 Working Group Expected to Stop Work End of 2009, W3C to Increase Resources on HTML 5
2009-07-02: Today the Director announces that when the XHTML 2 Working Group charter expires as scheduled at the end of 2009, the charter will not be renewed. By doing so, and by increasing resources in the Working Group, W3C hopes to accelerate the progress of HTML 5 and clarify W3C's position regarding the future of HTML. A FAQ answers questions about the future of deliverables of the XHTML 2 Working Group, and the status of various discussions related to HTML. Learn more about the HTML Activity.
Well, I guess that makes the future of HTML pretty clear.
XHTML forces you to be neat.
For example, in HTML, you can write:
<img src="image.jpg">
This isn't very logical, because the img tag never gets closed. In XHTML, however, you're forced to close the tag neatly, like this:
<img src="image.jpg" />
I like using something that forces me to be neat.
Steve
The subtitle to the XHTML 1.0 recommendation:
A Reformulation of HTML 4 in XML 1.0
Many tools exist today to process XML. By using XHTML, you are allowing a huge set of tools to operate on your pages and to extract information programmatically.
If you were to use HTML, this would be possible too. There are tools in existence to parse HTML DOM trees. However, these tools can often be more specialized than those for XML. You may not find your favorite XML data processing tools compatible with HTML. Furthermore, there are so many uses for XML nowadays that you may be using XML for some other part of an application; why not also use that same XML parser to parse your web pages? This is the motivation behind XHTML.
If you're already comfortable and familiar with HTML 4.01, you have an established project using HTML 4, and you don't have tons of spare time, just go with HTML 4.01. If you have spare time, learn XHTML 1.1 anyway, and start your new projects in XHTML 1.1 – there's no harm in doing so. If you're using something other than HTML 4.01 or are pretty unfamiliar with HTML 4 anyway, just learn XHTML 1.1.
Using XHTML with the correct DocType will force the browser to render the content in a more standards compliant (strict) mode. This makes the different browsers behave better and, most importantly, more like each other. This makes your job as a webdeveloper a lot easier since it reduces the amount of browser specific tweaks needed to make the content look the same in all browsers.
Quirksmode.org has a lot of good info on this subject.
In my opinion, the strictness is, at least in theory, a good thing, because in HTML, you don't need to be strict, and because of that and the HTML5 junk, Browsers have advanced error correction algorithms that will make the best out of broken HTML. The problem is, the algorithms are not exactly the same and will lead to really strange behaviour you can't predict. With XHTML, on the other hand, you typically have fine, valid XHTML and so the error correction algorithms are not needed, i.e. the entire Browser behaviour is predictable. In addition, strict code makes it easier for your tools to work with the code. So you have actually nothing to lose by using XHTML, but there is some potential to gain. Things will get worse with plain HTML when HTML5 is finally out and the "be open in what you accept" will lead to the described strange behaviour. But at least then it's a standardized strange behaviour. Sigh.
On the other hand, if you use a good IDE like Visual Studio, it's almost impossible to produce broken HTML code anyway, so the result is the same.
Use XHTML
Fails fast. If there are any inconsistencies they will be found during validation.
It encourages better design by separating semantic markup from presentation etc.
It's structured which means that you can treat it as a data object and run all sorts of queries against it. For example you could find all addresses or citations within your website.
You can do build-time optimizations. Since it's well-formed XML you can easily do find/replace operations during build time. Or any document management and manipulation.
You can write XSLT or other transformation scripts to programatically transform your XHTML for other platforms. For example you could have an XSLT for the iPhone that would transform all XHTML to make it compatible or more user-friendly for the iPhone
You are future proofing yourself. Transforming XHTML to newer semantics is again, very easy using transformation.
Search engines will continue to evolve to gather more semantic information as part of the programmable web.
DOM operations are more reliable since it's structured.
From an algorithmic perspective, it yields easier and faster parsing.
XHTMl is a good standing point to use because if you want valid code you would need to provide some aspect of help to the disabled community due to the fact screen readers need the alt and title parts of the image and link tags.
It must be faster to parse to an extent because unlike HTML the parser wouldn't need to check to see if the tag wasn't closed properly, if it was nested correctly etc.
Also it is better to use it because yes it is strict but it helps you to think more logically (in my opinion) when it comes to learning programming languages.
I believe XHTML is (or should be) faster to parse. A valid XHTML document must be written to a stricter spec in that errors are fatal when parsing, whereas HTML is more lenient and allows for oddities mentioned before my comment like out of order closing tags and such. I found this helpful in uncovering the differences between HTML and XHTML parsing:
http://wiki.whatwg.org/wiki/HTML_vs._XHTML#Parsing
A reason you might use XHTML over HTML might be if you intend to have mobile users as part of your audience. If I recall, many phones use something more of an XML parser, rather than an HTML one to display the web. If you are writing for desktop browsers, HTML would probably be acceptable.
That said, if you are going to serve the data as text/html anyway, you should use HTML:
http://www.hixie.ch/advocacy/xhtml

What problem does XHTML strict solve?

I really don't understand the fascination with XHTML strict. Inline JavaScript typically requires a rats nest of escapes to make it compatible with XHTML and semi-backwards compatible with MSIE 5 & 6. Then there is the issue of not being OCD enough on user input to make sure you don't miss any illegal characters. It just seems like more effort then its worth. Nevermind that almost every developer I've worked along side of keeps forgetting to ensure the content-type returned from the server is reset for XHTML pages from text/html to application/xhtml+xml.
Wish I knew the name of the blogger, but someone else pointed out that a majority of supposedly XHTML compliant websites and open source packages are actually not because of that last issue, forgetting to set the content-type header correctly.
I'm looking to understand why XHTML is useful, or build enough of an arsenal of arguments to prevent it ever being used in future projects that I have influence on.
XHTML1 vs HTML4 and Strict vs Transitional are completely orthogonal issues.
XML might not give any huge advantage to browsers today, but on the server end it's an order of magnitude easier to process documents using XML than trying to parse the mess that is old-school-SGML-except-not-really HTML4.
Restricting yourself to [X]HTML Strict doesn't achieve anything in itself, other than simply that it discourages the use of old, less-maintainable techniques you shouldn't be using anyway.
Inline javascript typically requires a rats nest of escapes to make it compatible with XHTML
You can get away without any escapes as long as you don't use the characters < or &. And ‘// < [CDATA[’ isn't really much worse than ‘< !--’ was in the old days.
In any case, keeping the scripting external is much more manageable; you don't want to be doing anything significant inline.
Then there is the issue of not being OCD enough on user input to make sure you don't miss any illegal characters.
Out-of-band characters are exactly as invalid in HTML4 Transitional as in XHTML1 Strict.
If you're accepting user-submitted HTML and not checking/escaping it with enough of a fine tooth comb to prevent well-formedness errors you have much bigger problems than just complying with a doctype. You'll be letting injection hacks through and making your site vulnerable to cross-site-scripting security holes.
forgetting to ensure the content-type returned from the server is reset for XHTML pages from text/html to application/html+xml.
It's not ‘forgetting’, it's deliberate: there is not really that much point in serving application/xhtml+xml today. To account for IE you have to sniff UA, and then make sure you understand the CSS and JavaScript differences that pop up in both parsing modes... you can do it to prove your technical prowess, but it doesn't really get you anything.
Serving XHTML as legacy HTML may not be ideal, but it lets you keep the simpler, more processable syntax of XML (and potential interoperability with other XML languages like SVG) whilst still being browser-friendly.
People complain about the pickiness of the well-formedness errors, but having those errors picked up straight away for you to fix them is way better than leaving them there silently, ready to trip up some future browser.
there is a great post about the usage of XHTML # Beware of XHTML.
Hope it helps,
Bruno Figueiredo
XHTML 1.0 Strict tries to solve four problems:
XML is W3C technology and HTML4 wasn’t using it. Not your problem.
Strict seeks to be more theoretically pure than Transitional when it comes to presentationalism. But this is not an XHTML vs. HTML issue.
XML parser is supposedly simpler. (Not entirely true; the code for dealing with the DTD part is pretty complex.) These days, you get both XML and HTML parsers off-the-shelf, so this isn’t your problem. (Aside: the mobile argument is utterly bogus.)
application/xhtml+xml (though not valid XHTML 1.0 Strict!) allows you to mix other vocabularies. If you want to use inline MathML or SVG today, this is the main reason to use application/xhtml+xml today. However, the direction the HTML5 work is taking is making it possible to use MathML and SVG in text/html.
XHTML is useful because it's much easier to create a simple transforming stylesheet or roll your own parser for it, than it is for HTML.
Do you have to parse your HTML with a program, o for some tests? Then, use XHTML.
For everything else, HTML 4.01 (strict, loose, transitional, whatever) is perfectly "standard" and less "troublesome".
XHTML enables you to advanced rendering like SVG (Scalable Vector Graphics), which itself is an XML, but can easily be embedded in XHTML through the XML namespace extension without <embed> or <object>. Unfortunately, only Firefox and Safari does support it. Sorry IE6 users.
For more on SVG at http://en.wikipedia.org/wiki/Svg
XHTML makes HTML orthogonal with all the other xml-based structures in our universe, which has two primary benefits.
Design patterns we use in dealing with xml can be applied to html.
Software tools ditto.
XHTML has the advantages of xml. But then why the strict variant?
I see some similarities with deprecated functions. You can still use them this version, but they are possibly removed the next version. So I see the transitional version as deprecated use. It still works and it will work for a couple of versions, but if you want to build for the future, use the strict version.
Strict is intended to formalize the separation between content and style by making it more difficult to commingle the two. Elliotte Rusty Harold has a good write up on XHTML in one of his books, here's the relevant excerpt on 'Why XHTML'.
The only thing I've seen solved by XHTML is the "problem" of users using Safari: I don't know if the bug is still there, but when we were last asked to write in XHTML, we ran across a bug that made XHTML unusable with Safari. In XHTML, the following URL isn't allowed in anchor tags, because the ampersand isn't escaped:
http://www.example.com/page.php?arg1=val1&arg2=val2
so what you have to do is replace it with & like this:
http://www.example.com/page.php?arg1=val1&arg2=val2
but Safari converts & to & so you get this URL:
http://www.example.com/page.php?arg1=val1&arg2=val2
...and the hash symbol ends the URL as far as PHP is concerned. I know that there are ugly hacks that allow you to pass two variables in other ways, but if XHTML is going to force you to use ugly hacks, then you're better off without it.
Personally, I liked the concept of XHTML: much cleaner than most HTML we can see, easier to parse and validate. Like everybody, I started to code XHTML pages. BTW, I don't see an issue with inline JavaScript, no need for escapes if you put the code in CDATA. And IE5 is fortunately a bit out of the browser landscape, like Netscape 4 which forced us to write / > instead of />, thing I still see in pure XML sometime...
Now, I have read a number of articles, like the one linked by Bruno, which has lot of good arguments against its use in most cases. Basically, it says most browsers aren't just ready for strict XHTML (served as XML), it doesn't make much sense to server XHTML as HTML, and anyway it isn't that useful in the majority of sites.
Look at the arguments above: they are perfectly valid, and it is great to be able to put MathML or SVG directly in the page, to transform XML with an XSLT parser, to process the page with an XML parser.
But how often do you do that? Parsing the page is most often the problem of end users, which can use a good HTML parser. And given the number of browsers able to manage MathML, SVG or XSLT, it is more a need for intranet than for the vast Internet.
You can have an e-commerce or a blog or a forum, which spits out good XHTML pages. And the persons writing the descriptions, articles or messages insert <p><p><p> to skip some lines, when it isn't <p/> or some other exotic construct...
I believe in XHTML, but I think I will no longer use it for the little pages I do for my site. I will use HTML 4 with well written code (quoted attributes, closing tags even if optional, etc.).
And after all, if W3C is working in HTML 5, it is for a reason: HTML has still a live ahead, otherwise it would have been killed in favor of XHTML 2.
XHTML is by definition XML, unlike HTML.
This means you can do funky useful stuff with it, such as easily validate and parse it (since you know it's XML and thus can use the myriad of tools available).
Also, geeks like to make things "more correct" ;-)
This is a global standard issue
This is not just about xHTML, but about all the standards in the world. You need to make things clearer, from version to version.
xHTML is square and pushes coders to add semantic value to the code. It's fully XML compatible and therefor more easyly parseable, stylisable, etc.
Remember that a code is not just for coders, bot for machines too. In 10 years, people creating browsers or libraries won't want to implement the same complexes rules for old HTML processing but will rather expect something as clean as possible.
Search engine needs something to rely on to build semantic links between value and so it's better if there is only one easy way to do it.
And I am not talking about screen readers...
Standard, is above all, about going toward one unique open solution that fit everybody's need. Not just about adding new shiny features.