Yet another question regarding the html5 dtd/schema - html

If there is no DTD or schema to validate the H5 document against, how are we supposed to do document validation? And by document validation, I mean "how are we supposed to ensure our html5 documents are both syntactically accurate and structurally sound?" Please help! This is going to become a huge problem for our industry if we have no way to accurately validate HTML5 documents!
Sure, the W3C has an online tool that validates individual pages. But, if I'm creating A LOT of pages (hundreds, say) and I want to validate them in a sort of batch mode, what is the accepted method of ensuring valid structure and syntax? I mean, it seems rather rudimentary to just look at the document and say "yep. that's a valid xml document." What about custom tags? What about tag attributes? It seems like the W3C is leaving us out in the cold a little bit here.
Maybe the best answer will be found in the HTML editor. But then you get DTD/schema fragmentation. Each editor vendor coming up with their own rendition of what a valid structure is.
Maybe the answer is "wait for HTML5 to become official". But I really can't wait for that. I need to start creating and validating content now. I have applications I want to publish that can only be accomplished with html5.
So, any thoughts?

If there is no DTD or schema to validate the H5 document against, how are we supposed to do document validation?
With a specialized HTML5 validator rather then a generic SGML or XML validator.
Obviously, as the specification is still in draft form, the tools that do exist are immature and likely to be out of date or become out of date.
Sure, the W3C has an online tool that validates individual pages. But, if I'm creating A LOT of pages (hundreds, say) and I want to validate them in a sort of batch mode, what is the accepted method of ensuring valid structure and syntax?
Either use a different tool or download the W3C validator and run a local copy. It has a SOAP API so writing a batch validation tool isn't difficult.
What about custom tags?
HTML5 doesn't allow custom elements.
What about tag attributes?
The only custom attributes in HTML5 are data-* attributes, so an HTML 5 validator can recognize them.
It seems like the W3C is leaving us out in the cold a little bit here.
It seems like you expect the state of QA tools for HTML 5 (unfinished) to be up to the same standard as those for HTML 4 (over a decade old). This isn't a realistic expectation.
Maybe the best answer will be found in the HTML editor. But then you get DTD/schema fragmentation. Each editor vendor coming up with their own rendition of what a valid structure is.
The specification is clear (although in flux) even if it isn't expressed in the form of a DTD or schema. If each editor has a different idea of what is valid, then most or all of them are going to be either out of date or just buggy.
Maybe the answer is "wait for HTML5 to become official". But I really can't wait for that. I need to start creating and validating content now. I have applications I want to publish that can only be accomplished with html5.
If you need to live in the bleeding edge, then you have to accept the limitations and risks of doing so.

You might find this question/answer interesting: Will HTML 5 validation be worth the candle? . The answer is written by the developer of http://about.validator.nu/ .

You should start by taking a look at http://about.validator.nu/ .
Some, though not all, of your concerns are addressed there. You can host your own validator, there's a python based submission script, you can use a RESTFUL web service API and there are ways to get validation output in a variety of different forms.
I can't however see a simple way to integrate XHTML5 with other applications of XML such that one can easily create a validator of such compound documents. Not that there's really been a way to do that with earlier versions of XHTML either though.

This is working well for me: https://github.com/hober/html5-el
To get this to work, I renamed the default '/etc/schema/schemas.xml' file in order to move it out of the way and let the 'html5-el' one be used by nxml-mode.

If there is no DTD or schema to validate the H5 document against, how are we supposed to do document validation? And by document validation, I mean "how are we supposed to ensure our html5 documents are both syntactically accurate and structurally sound?" Please help! This is going to become a huge problem for our industry if we have no way to accurately validate HTML5 documents!
If testing pages with either Firefox or Opera, both of those will report errors such as code that is not "well-formed" and mismatched tags. Beyond that, one of the validators such as validator.w3.org or validator.nu will definitely help.
Sure, the W3C has an online tool that validates individual pages. But, if I'm creating A LOT of pages (hundreds, say) and I want to validate them in a sort of batch mode, what is the accepted method of ensuring valid structure and syntax? I mean, it seems rather rudimentary to just look at the document and say "yep. that's a valid xml document."
There are ways to run the W3C validator in batch mode.
What about custom tags? What about tag attributes? It seems like the W3C is leaving us out in the cold a little bit here.
The easy answer to that one is that "custom tags" are simply not considered valid. The Working Group has thoroughly addressed the issue of "distributed extensibility", particularly with respect to allowing "decentralized
parties to create their own languages" and "extension attributes" (http:// lists.w3.org/Archives/Public/public-html/2011Feb/0085.html). There are numerous ways to extend HTML (http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#extensibility) but adding custom tags is not one of them. Custom data and microdata attributes should validate fine.
Maybe the answer is "wait for HTML5 to become official". But I really can't wait for that. I need to start creating and validating content now. I have applications I want to publish that can only be accomplished with html5.
Since HTML 5 was stabilized at the end of last year (Dec. 2010), IMO we don't need to wait for it to become an official "recommendation" by the W3C. The stabilized spec provides a solid base that all browser vendors can implement consistently and for the ongoing evolution beyond HTML 5 of the spec, which is now being called the "HTML Living Standard" (Jan. 2011 and later). There is a good diagram of this at http://www.HTML-5.com/html-versions-and-history.html#html-versions (scroll down to see the diagram).

Related

Is there a better approach to html standards validation than w3c validator?

Granted, this is a very generic question, but I am wondering if w3c validation is considered a best practice for html validation, or if there are better approaches to ensure contemporary standards-compliant markup.
This question arose when I noticed duplicate IDs on an MDN page (a site I would have assumed would be very strict about its coding practices). It appeared to be an artifact of how they generated the sections of the page.
Curious, I validated the page's code on the w3c validator, and there were various "errors" that suggested that MDN was just ignoring that a certain attribute or value was not valid. Generally, these related to seemingly appropriate uses of rel attributes.
I was left wondering if standards for valid, semantic markup matter less, or if there's a new ideal approach to code validation and standardization than relying on w3c validation.
Maintainer of the current W3C HTML Checker (validator) here. I think it's important to understand the intended purpose of the current HTML checker, which is different from the purpose of the legacy W3C Markup Validator.
The purpose of the checker is documented at https://validator.w3.org/nu/about.html#why-validate:
The core reason to run your HTML documents through a conformance checker is simple: To catch unintended mistakes—mistakes you might have otherwise missed—so that you can fix them.
Beyond that, some document-conformance requirements (validity rules) in the HTML spec are there to help you and the users of your documents avoid certain kinds of potential problems.
There are some markup cases defined as errors because they are potential problems for accessibility, usability, interoperability, security, or maintainability—or because they can result in poor performance, or that might cause your scripts to fail in ways that are hard to troubleshoot.
Along with those, some markup cases are defined as errors because they can cause you to run into potential problems in HTML parsing and error-handling behavior—so that, say, you’d end up with some unintuitive, unexpected result in the DOM
Validating your documents alerts you to those potential problems.
So as far as your question about "are better approaches to ensure contemporary standards-compliant markup", the answer is that it's not an either-or thing; there are a variety of approaches and the W3C HTML Checker is just one of them, and its goal isn't to be the single way to determine anything but instead to just help you catch mistakes you might otherwise miss and that might cause unexpected problems for your users.
As far as ways to get alerted to specific device issues or browser-implementation issues, we don’t have good automated checking tools for that, but a couple of things which are huge help there are:
https://caniuse.com/ — detailed information about the level of support for particular web-runtime features in different browsers, and in different versions of those browsers, and in release of the browsers for mobile devices vs desktop
https://wptdashboard.appspot.com/ — current test results across all major browser engines for dozens of web-runtime features/specs; if https://caniuse.com/ doesn’t have information about a particular feature, you can look through this dashboard and browse to the directory that has tests for that feature, and find whether a browser passes the tests for the feature
But as far as good automated tools we do actually have for checking other things, here are two:
https://validator.w3.org/i18n-checker/ — W3C Internationalization Checker
https://observatory.mozilla.org/ — for doing a security assessment of the content of your site
I recently faced a problem with mentioned above W3C HTML Checker. I respect a huge amount of work that was done by author of this validator, but it did not allow me in any way a tag <script type="text/vbscript" src="file.vbs">. It was said to change type value to empty string, a JavaScript MIME type, or module, which makes my page useless.
I know than VBScript language is rarely used now, it was just a test page, but let me share with you less tricky alternative, as good as the first one for HTML error checking.
Maintainer of the current JsonFormatter (validator) is here

Switch browser to a strict mode in order to write proper html code

Is it possible to switch a browser to a "strict mode" in order to write proper code at least during the development phase?
I see always invalid, dirty html code (besides bad javascript and css) and I feel that one reason is also the high tolerance level of all browsers. So at least I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
Is there anything like that with any of the known browser?
I know about w3c-validator but honestly who is really using this frequently?
Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?
Is there anything like that with any of the known browser? Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?
The answer to all those questions is “No“. No browsers have any built-in integration like what you describe. There are (or were) some browser extensions that would take every single document you load and send it to the W3C validator for checking, but using one of those extensions (or anything else that automatically sends things to the W3C validator in the background) is a great way to get the W3C to block your IP address (or the IP-address range for your entire company network) for abuse of W3C services.
I know about w3c-validator but honestly who is really using this frequently?
The W3C validator currently processes around 17 requests every second—around 1.5 million documents every day—so I guess there are quite a lot of people using it frequently.
I see always invalid, dirty html code… I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
I'm not sure what specifically you mean by “dirty html code” or “proper code“ but I can say that there are a lot of markup cases that are not bad or invalid but which some people mistakenly consider bad.
For example, some people think every <p> start tag should always have a matching </p> end tag but the fact is that from the time when HTML was created, it has never required documents to always have matching </p> end tags in all cases (in fact, when HTML was created, the <p> element was basically an empty element—not a container—and so the <p> tag simply was a marker.
Another example of a case that some people mistakenly think of as bad is the case of unquoted attribute values; e.g., <link rel=stylesheet …>. But that fact is that unless an attribute value contains spaces, it generally doesn't need to be quoted. So in fact there's actually nothing wrong at all with a case like <link rel=stylesheet …>.
So there's basically no point in trying to find a tool or mechanism to check for cases like that, because those cases are not actually real problems.
All that said, the HTML spec does define some markup cases as being errors, and those cases are what the W3C validator checks.
So if you want to catch real problems and be able to fix them, the answer is pretty simple: Use the W3C validator.
Disclosure: I'm the maintainer of the W3C validator. 😀
As #sideshowbarker notes, there isn't anything built in to all browsers at the moment.
However I do like the idea and wish there was such a tool also (that's how I got to this question)
There is a "partial" solution, in that if you use Firefox, and view the source (not the developer tools, but the CTRL+U or right click "View Page Source") Firefox will highlight invalid tag nesting, and attribute issues in red in the raw HTML source. I find this invaluable as a first pass looking at a page that doesn't seem to be working.
It is quite nice because it isn't super picky about the asdf id not being quoted, or if an attribute is deprecated, but it highlights glitchy stuff like the spacing on the td attributes is messed up (this would cause issues if the attributes were not quoted), and it caught that the span tag was not properly closed, and that the script tag is outside of the html tag, and if I had missed the doctype or had content before it, it flags that too.
Unfortunately "seeing" these issues is a manual process... I'd love to see these in the dev console, and in all browsers.
Most plugins/extensions only get access to the DOM after it has been parsed and these errors are gone or negated... however if there is a way to get the raw HTML source in one of these extension models that we can code an extension for to test for these types of errors, I'd be more than willing to help write one (DM #scunliffe on Twitter). Alternatively this may require writing something at a lower level, like a script to run in Fiddler.

What are the advantages of creating web pages with XML instead of HTML?

From time to time, I see web pages whose content is solely written in XML (not HTML or XHTML). These pages usually have some style sheets (either XSLT or CSS) attached to them which makes them look like any other ordinary web page.
My question is, what are the advantages of such an approach (if any), and why would anyone choose to work this way?
EDIT: If this is a good thing, why is it not widespread?
EDIT 2: Thanks everyone for the great responses. They really enlightened me. I also found this question whose content is also related.
It's easier to generate it programmatically and reuse it for other purposes than displaying as webpage.
Update:
EDIT: If this is a good thing, why is it not widespread?
Not everyone needs to generate it programmatically or reuse it for other purposes than displaying as webpage. It's then easier to use plain HTML.
One possible advantage would be for use of the data of the page in something other than a web browser; that would (presumably) be easier to do if a page's content were well-formed XML. Of course in theory a well-formed, semantic XHTML page should be nearly as able to be parsed, as well.
It can also be easier to generate XML instead of XHTML, depending on the data source.
When you are getting XML data in to your system, and you are supposed to present this XML data then it is much easier to write some XSLT for that XML instead of parsing it using some sort of parser and then presenting the data.
That can be a valid point for using XML instead of XHTML or HTML
Update
To answer your question on why this is not widespread, is because XSTL is tedious and hard to work with. Specifically XPath, which can be for some people quite difficult to use.
Those pages use XSLT to get rendered on the client side. Not every browser (especially older ones) supports rendering XML + XSLT. XML can however be used server-side as template and get transformed to HTML by the application running on the server. I personally don't see any advantages to this approach.
There are a lot more web pages that are written solely in XML than you know. You're only seeing the ones that do the XSLT transformation on the client side. Server-side transformation of XML is not at all unusual, because there's a plethora of things that produce data in XML, and transforming XML to HTML in XSLT is straightforward. You'll never know this is happening if you just look at the HTML, which bears no signs of having been generated via XSLT.
Personally, I don't understand it either though one of the biggest problems is support in IE. I created a skeleton ecommerce site serving XML, transformed by XSLT and styled using CSS. I sorely missed the ability to use XLink and other wonderful XML features. It's also nice to be able to tag the data for what it is. I used a 'menu' tag for the restaurant menus. 'price' tags for prices and so on. If a user clicked on a link to change menus, all I had to do was send the name of the item, the price and the description instead of the complete page. iirc, a 4K or more HTML menu page was only 200 bytes of sent data.
As far as the "one error makes everything crash in XML" type comments, the same is true of any programming language so proper coding should be no bother for programmers and careful HTML/CSS types.
Before anyone says that what I did was actually XHTML...no. I served XML. I did call up XHTML namespaces when needed for links, images and HTML type things but only when necessary.

Should I write Polyglot HTML5 documents?

I've been considering converting my current HTML5 documents to polyglot HTML5 ones. I figure that even if they only ever get served as text/html, the extra checks of writing it XML would help to keep my coding habits tidy and valid.
Is there anything particularly thrilling in the HTML5-only space that would make this an unwise choice?
Secondly, the specs are a bit hazy on how to validate a polyglot document. I assume the basics are:
No errors when run through the W3C Validator as HTML5
No errors when run through an XML parser
But are there any other rules I'm missing?
Thirdly, seeing as it is a polyglot, does anyone know any caveats to serving it as application/xhtml+xml to supporting browsers and text/html to non-supporting ones?
Edit: After a small bit of experimenting I found that entities like break in XHTML5 (no DTD). That XML parser is a bit of a double-edged sword, I guess I've answered my third question already.
Work on defining how to create HTML5 polyglot documents is currently on-going, but see http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html for an early draft. It's certainly possible to do, but it does require a good deal of coding discipline, and you will need to decide whether it's worth the effort. Although I create HTML4.01/XHTML1.0 polyglot documents, I create them using an XML tool chain which guarantees XML well-formedness and have specialized code to ensure compatibility with HTML non-void elements and valid XML characters. Direct hand coding would be very difficult.
One known current issue in HTML5 is the srcdoc attribute on the iframe element. Because the value of the attribute contains markup, certain characters need to be escaped. The HTML5 draft spec describes how to do this for the HTML serialization, but not (the last time I looked) how to do it in the XHTML serialization.
I'm late to the party but after 5 years the question is still relevant.
On one hand closing all my tags strongly appeals to me. For people reading it, for easier editing, for Great Justice. OTOH, looking at the gory details of the polyglot spec — http://www.sitepoint.com/have-you-considered-polyglot-markup/ has a convenient summary at the end — it's clear to me I can't get it all right by hand.
https://developer.mozilla.org/en/docs/Writing_JavaScript_for_XHTML also sheds interesting light on why XHTML failed: the very choice to use XML mime type has various side effects at run time. By now it should be routine for good JS code to handle these (e.g. always lowercase tag names before comparing) but I don't want all that. There are enough cross-browser issues to test for as-is, thank you.
So I think there is a useful middle way:
For now serve only as text/html. Stop worrying that it will actually parse as exactly the same DOM with same runtime behavior in both HTML and XML modes.
Only strive that it parses as some well-formed XML. It helps readers, it helps editors, it lets me use XML parser on my own documents.
Unfortunately, polyglot tools are rare to non-existant — it's hard to even serialize back XML in a way that also passes the HTML requirements...
No brainer: always self close void tags (<hr/>) and separately close non-void tags (<script ...></script>).
No brainers: use lowercase tags and attr (except some SVG but foreign content uses XML rules anyway), always quote attribute values, always provide attribute values (selected="selected" is more verbose than stanalone selected but I can live with that).
Inline <script> and <style> are most annoying. I can't use & or < inside without breaking XML parsing. I need:
<script>/*<![CDATA[*/
foo < bar && bar < baz;
/*]]>*/</script>
...and that's about it! Not caring about XML namespaces or matching HTML's implied DOM for tables drops about half the rules :-)
Await some future when I can directly go to authoring XHTML, skipping polyglotness. The benefits are I'll be able to forget the tag-closing limitations, will be able to directly consume and produce it with XML tools. Sure, neglecting xml namespaces and other things now will make the switch harder, but I think I'll create more new documents in this future than convert existing ones.
Actually I'm not entirely sure what's stopping me from living in that future right now. Is it only IE 8? I'm also a tiny bit concerned about the all-or-nothing error handling. I'm slighly hoping a future HTML spec will find a way to shrink the HTML vs XML gaps, e.g. make browsers accept <hr></hr> and <script .../> in HTML— while still retaining HTML error handling.
Also, tools. Having libraries in many languages that can serialize to polyglot markup would make it feasible for programs to generate it. Having tools to validate and convert HTML5 <-> polyglot <-> XHTML5 would help. Otherwise, it's pretty much doomed.
Given that the W3C's documentation on the differences between HTML and XHTML isn't even finished, it's probably not worth your time to try to do polyglot. Not yet anyways.... give it another couple of years.
In any event, only in the extremely narrow circumstances where you are actively planning on parsing your HTML as XML for some specific purpose, should you invest the extra time in XML-compliance. There are no benefits of doing it purely for consumption by web browsers -- only drawbacks.
Should you? Yes. But first some clarification on a couple points.
Sending the Content-Type: application/xhtml+xml header only means it should go through an XML parser, it still has all the benefits of HTML5 as far as I can tell.
About , that isn't defined in XML, the only character entity references XML defines are lt, gt, apos, quot, and amp, you will need to use numeric character references for anything else. The code for nbsp is   or  , I personally prefer hex because unicode code points are represented that way (U+00A0).
Sending the header is useful for testing because you can quickly find problems with your markup such as unclosed tags, stray end tags, text that could be interpreted as a tag, etc, basically stuff that can break the look or even functionality of your site.
Most significantly in my opinion, is if you are allowing user input and it fails to parse, that generally means you didn't escape their data and are leaving yourself open to a vulnerability. Parsed as HTML, you might not ever notice a problem until someone starts injecting scripts to harass your users or steal data.
This page is pretty good about explaining what polyglot markup is: https://blog.whatwg.org/xhtml5-in-a-nutshell
This sounds like a very difficult thing to do. One of the downfalls of XHTML was that it wasn't possible to steer successfully between the competing demands of XML and vintage HTML.
I think if you write HTML5 and validate it successfully, you will have as tidy and valid a document as anyone would need.
This wiki has some information not present in the W3C document: http://wiki.whatwg.org/wiki/HTML_vs._XHTML

What's the key difference between HTML 4 and HTML 5?

What are the key differences between HTML4 and HTML5 draft?
Please keep the answers related to changed syntax and added/removed html elements.
HTML5 has several goals which differentiate it from HTML4.
Consistency in Handling Malformed Documents
The primary one is consistent, defined error handling. As you know, HTML purposely supports 'tag soup', or the ability to write malformed code and have it corrected into a valid document. The problem is that the rules for doing this aren't written down anywhere. When a new browser vendor wants to enter the market, they just have to test malformed documents in various browsers (especially IE) and reverse-engineer their error handling. If they don't, then many pages won't display correctly (estimates place roughly 90% of pages on the net as being at least somewhat malformed).
So, HTML5 is attempting to discover and codify this error handling, so that browser developers can all standardize and greatly reduce the time and money required to display things consistently. As well, long in the future after HTML has died as a document format, historians may still want to read our documents, and having a completely defined parsing algorithm will greatly aid this.
Better Web Application Features
The secondary goal of HTML5 is to develop the ability of the browser to be an application platform, via HTML, CSS, and Javascript. Many elements have been added directly to the language that are currently (in HTML4) Flash or JS-based hacks, such as <canvas>, <video>, and <audio>. Useful things such as Local Storage (a js-accessible browser-built-in key-value database, for storing information beyond what cookies can hold), new input types such as date for which the browser can expose easy user interface (so that we don't have to use our js-based calendar date-pickers), and browser-supported form validation will make developing web applications much simpler for the developers, and make them much faster for the users (since many things will be supported natively, rather than hacked in via javascript).
Improved Element Semantics
There are many other smaller efforts taking place in HTML5, such as better-defined semantic roles for existing elements (<strong> and <em> now actually mean something different, and even <b> and <i> have vague semantics that should work well when parsing legacy documents) and adding new elements with useful semantics - <article>, <section>, <header>, <aside>, and <nav> should replace the majority of <div>s used on a web page, making your pages a bit more semantic, but more importantly, easier to read. No more painful scanning to see just what that random </div> is closing - instead you'll have an obvious </header>, or </article>, making the structure of your document much more intuitive.
From Wikipedia:
New parsing rules oriented towards flexible parsing and compatibility
New elements – section, video, progress, nav, meter, time, aside, canvas
New input attributes – dates and times, email, url
New attributes – ping, charset, async
Global attributes (that can be applied for every element) – id, tabindex, repeat
Deprecated elements dropped – center, font, strike
HTML5 introduces a number of APIs that help in creating Web applications. These can be used together with the new elements introduced for applications:
An API for playing of video and audio which can be used with the new video and audio elements.
An API that enables offline Web applications.
An API that allows a Web application to register itself for certain protocols or media types.
An editing API in combination with a new global contenteditable attribute.
A drag & drop API in combination with a draggable attribute.
An API that exposes the history and allows pages to add to it to prevent breaking the back button.
You'll want to check HTML5 Differences from HTML4: W3C Working Group Note 9 December 2014 for the complete differences. There are many new elements and element attributes. Some elements were removed and others have different semantic value than before.
There are also APIs defined, such as the use of canvas, to help build the next generation of web apps and make sure implementations are standardized.
You might be interested in this list of HTML5 elements and attributes.
Also, please note that it's "HTML 4", not "HTML4". Indeed, for HTML 5, both variants are used, but there is an important difference in meaning. HTML 5 refers to the name of the W3C specification, whereas "HTML5" is the document type of those HTML files with a text/html MIME type that follow this spec.
The same goes for XHTML 5 vs. XHTML5.
Now W3c provides an official difference on their site:
http://www.w3.org/TR/html5-diff/
HTML 5 invites you give add a lot of semantic value to your code. What's more, there are natives solution to embed multimedia content.
The rest is important, but it's more technical sugar that will save you from doing the same stuff with a client programming language.
In short it is much simple compared to html, the long doctype is removed and also center and font tag is removed.
I also answered this difference in my blog :
http://ravisinghblog.in/key-difference-between-html-and-html-5/