Parsing an html document using an XML-parser - html

Can I parse an HTML file using an XML parser?
Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.
The intended use is to make an HTML parser, that is part of a web crawler application

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.
elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
elements that don’t need end tags; e.g., <p> <dt> <li> (their end tags can be implied)
elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
attributes with unquoted values; for example, <meta charset=utf-8>
attributes that are empty, with no separate value given at all; e.g., <input disabled>
XML parsers will fail to parse any HTML document that uses any of those features.
HTML parsers, on the other hand, will basically never fail no matter what a document contains.
All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.
The intended use is to make an HTML parser, that is part of a web
crawler application
If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.
These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:
parse5 (node.js/JavaScript)
html5lib (python)
html5ever (rust)
validator.nu html5 parser (java)
gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)

syntactically they are almost identical
Computers are picky. "Almost identical" isn't good enough. HTML allows things that XML doesn't, therefore an XML parser will reject (many, though not all) HTML documents.
In addition, there's a different quality culture. With HTML the culture for a parser is "try to do something with the input if you possibly can". With XML the culture is "if it's faulty, send it back for repair or replacement".

Related

How to make libxml2 parse non-strict HTML?

I am using LibXML in my Vala application to parse HTML code. However the HTML I use is invalid if you pass it through validator (although browser displays it normally). In this HTML some tags are not closed, e.g. they use <img> instead of <img /> and <meta> instead of <meta/>. I cannot do anything about it, e.g. ask them to write valid HTML. But I need to parse it and libxml2 fails to do this (in short, doc->get_root_element() always return null).
Can I do something to make libxml2 parse invalid HTML?
HTML is not XML. People tried to make it XML (it was called XHTML), and we mostly just learned that people can't be trusted to write valid XML. When you say that it is invalid, I assume you mean it is not valid XML but is, in fact, valid HTML.
libxml includes an HTML parser, you need to use that. In Vala everything is in the Html namespace.

Does HTML conform to the XML specification?

HTML and XML are syntactically very similar, so what I want to know is if valid HTML code will always conform to the XML specification.
No, it won't.
HTML 2 through 4.x were SGML applications, not XML applications. (HTML+ might also have been an SGML application, it isn't clear from a brief skim of the specification)
HTML 5 has its own parse rules.
(XHTML and the XML serialisation of HTML 5 will be XML though)
Does HTML conform to the XML specification?
No, it does not. HTML supports:
unclosed tags (e.g. <img> instead of <img />)
wrongly nested tags (e.g. <b><i>bla</b></i> instead of <b><i>bla</i></b>)
unquoted attributes (e.g. <a name=foo>...</a>)
contents that is not propery encoded (e.g. <em>this & that</em> instead of <em>this & that</em>)
tags that explicitly must contain unencoded content (i.e. <script>)
named entities (e.g. © instead of ©)
The standard does not explicitly allow all of these notions, but all HTML parsers understand and support them.
None of them is legal in XML.
HTML is more lenient. For example,
<!DOCTYPE html>
<title>foo</title>
bar
is a valid HTML5 document, but it's obviously not valid XML, since XML requires a top-level element that encompasses the whole document.
However, you can use one of the XHTML languages, which are applications of XML with the same semantics as the corresponding HTML standards.

Is HTML an application of XML?

So, yeah, is HTML a particular application of XML? Like, instead of user-customizable tags, "hard coded" fixed tags decided by the W3C and interpreted by navigators? Or are them totally different things?
Also, in which case is XML better than a database to transfer information inside a Web application? (I was thinking, saving users information or things like that may do better with XML documents than with a database).
Here's a history of HTML
...The HTML that Tim invented was strongly based on SGML (Standard Generalized Mark-up Language), an internationally agreed upon method for marking up text into structural units such as paragraphs, headings, list items and so on. SGML could be implemented on any machine. The idea was that the language was independent of the formatter (the browser or other viewing software) which actually displayed the text on the screen. The use of pairs of tags such as and is taken directly from SGML, which does exactly the same. The SGML elements used in Tim's HTML included P (paragraph); H1 through H6 (heading level 1 through heading level 6); OL (ordered lists); UL (unordered lists); LI (list items) and various others. What SGML does not include, of course, are hypertext links: the idea of using the anchor element with the HREF attribute was purely Tim's invention, as was the now-famous `www.name.name' format for addressing machines on the Web....
And in no case is XML "better" than a database (are cakes better than ovens?). XML isn't for storing data, it's for transfering it. Unless the data is absolutely minimal, you have to find some other way to store it. Opening static XML files on the file system over and over as you save and read data is a terrible way to go about it.
So, yeah, is HTML a particular application of XML?
No.
HTML 4 is an application of SGML, but most parsers for it do not treat it as such.
XHTML is an application of XML, but it is usually served as text/html instead of application/xhtml+xml and so is treated like HTML.
HTML 5 is not an application of either SGML or XML (except in its XML serialisation) and has its own parsing rules.
Also, in which case is XML better than a database to transfer information inside a Web application?
XML is a good basis for a data exchange format. It is not a good basis for storing data in order to search it (which is what happens "inside" most web applications)
HTML and XML both come from SGML, hence their similarities. But XML is a strict grammar (no predefined tag names), while HTML is both a not very strict grammar and a vocabulary (tag names). There is an HTML variant which strictly complies with XML rules : XHTML.
As for using XML as a database, it is possible under certain circumstances. But it really depends on your architecture, language, volumetry and lots of other considerations. I suggest you open a new question with more details for this.
XHTML is a reformulation of HTML as XML app.
You can invent your own tags. I don't think HTML5 has a doctype for that though. You can create them with JavaScript and initalize/style then with CSS like any other element.
instead of using XML, spit out JSON, seriously, do this.
if you are worried about using your db, think about switching to couchdb or nosql. they're ripe for JSON.
don't get me wrong, your thought process isn't wrong, you can do that. i've seen it done rather well. but most people don't get it right. and seriously, JSON is your friend.
For the differences between HTML & XML see:
http://www.w3schools.com/xml/xml_whatis.asp
XML is primarily used for transfering data, not storing it. A database will generally give you much more flexibility in querying the data.
HTML allows things that XML doesn't allow, like omitting end tags, omitting the quotes around attribute values, and using upper-case and lower-case interchangeably. So HTML is not just another XML vocabulary.
XHTML, however, was an attempt to reformulate HTML as an XML vocabulary.

Should I write Polyglot HTML5 documents?

I've been considering converting my current HTML5 documents to polyglot HTML5 ones. I figure that even if they only ever get served as text/html, the extra checks of writing it XML would help to keep my coding habits tidy and valid.
Is there anything particularly thrilling in the HTML5-only space that would make this an unwise choice?
Secondly, the specs are a bit hazy on how to validate a polyglot document. I assume the basics are:
No errors when run through the W3C Validator as HTML5
No errors when run through an XML parser
But are there any other rules I'm missing?
Thirdly, seeing as it is a polyglot, does anyone know any caveats to serving it as application/xhtml+xml to supporting browsers and text/html to non-supporting ones?
Edit: After a small bit of experimenting I found that entities like break in XHTML5 (no DTD). That XML parser is a bit of a double-edged sword, I guess I've answered my third question already.
Work on defining how to create HTML5 polyglot documents is currently on-going, but see http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html for an early draft. It's certainly possible to do, but it does require a good deal of coding discipline, and you will need to decide whether it's worth the effort. Although I create HTML4.01/XHTML1.0 polyglot documents, I create them using an XML tool chain which guarantees XML well-formedness and have specialized code to ensure compatibility with HTML non-void elements and valid XML characters. Direct hand coding would be very difficult.
One known current issue in HTML5 is the srcdoc attribute on the iframe element. Because the value of the attribute contains markup, certain characters need to be escaped. The HTML5 draft spec describes how to do this for the HTML serialization, but not (the last time I looked) how to do it in the XHTML serialization.
I'm late to the party but after 5 years the question is still relevant.
On one hand closing all my tags strongly appeals to me. For people reading it, for easier editing, for Great Justice. OTOH, looking at the gory details of the polyglot spec — http://www.sitepoint.com/have-you-considered-polyglot-markup/ has a convenient summary at the end — it's clear to me I can't get it all right by hand.
https://developer.mozilla.org/en/docs/Writing_JavaScript_for_XHTML also sheds interesting light on why XHTML failed: the very choice to use XML mime type has various side effects at run time. By now it should be routine for good JS code to handle these (e.g. always lowercase tag names before comparing) but I don't want all that. There are enough cross-browser issues to test for as-is, thank you.
So I think there is a useful middle way:
For now serve only as text/html. Stop worrying that it will actually parse as exactly the same DOM with same runtime behavior in both HTML and XML modes.
Only strive that it parses as some well-formed XML. It helps readers, it helps editors, it lets me use XML parser on my own documents.
Unfortunately, polyglot tools are rare to non-existant — it's hard to even serialize back XML in a way that also passes the HTML requirements...
No brainer: always self close void tags (<hr/>) and separately close non-void tags (<script ...></script>).
No brainers: use lowercase tags and attr (except some SVG but foreign content uses XML rules anyway), always quote attribute values, always provide attribute values (selected="selected" is more verbose than stanalone selected but I can live with that).
Inline <script> and <style> are most annoying. I can't use & or < inside without breaking XML parsing. I need:
<script>/*<![CDATA[*/
foo < bar && bar < baz;
/*]]>*/</script>
...and that's about it! Not caring about XML namespaces or matching HTML's implied DOM for tables drops about half the rules :-)
Await some future when I can directly go to authoring XHTML, skipping polyglotness. The benefits are I'll be able to forget the tag-closing limitations, will be able to directly consume and produce it with XML tools. Sure, neglecting xml namespaces and other things now will make the switch harder, but I think I'll create more new documents in this future than convert existing ones.
Actually I'm not entirely sure what's stopping me from living in that future right now. Is it only IE 8? I'm also a tiny bit concerned about the all-or-nothing error handling. I'm slighly hoping a future HTML spec will find a way to shrink the HTML vs XML gaps, e.g. make browsers accept <hr></hr> and <script .../> in HTML— while still retaining HTML error handling.
Also, tools. Having libraries in many languages that can serialize to polyglot markup would make it feasible for programs to generate it. Having tools to validate and convert HTML5 <-> polyglot <-> XHTML5 would help. Otherwise, it's pretty much doomed.
Given that the W3C's documentation on the differences between HTML and XHTML isn't even finished, it's probably not worth your time to try to do polyglot. Not yet anyways.... give it another couple of years.
In any event, only in the extremely narrow circumstances where you are actively planning on parsing your HTML as XML for some specific purpose, should you invest the extra time in XML-compliance. There are no benefits of doing it purely for consumption by web browsers -- only drawbacks.
Should you? Yes. But first some clarification on a couple points.
Sending the Content-Type: application/xhtml+xml header only means it should go through an XML parser, it still has all the benefits of HTML5 as far as I can tell.
About , that isn't defined in XML, the only character entity references XML defines are lt, gt, apos, quot, and amp, you will need to use numeric character references for anything else. The code for nbsp is   or  , I personally prefer hex because unicode code points are represented that way (U+00A0).
Sending the header is useful for testing because you can quickly find problems with your markup such as unclosed tags, stray end tags, text that could be interpreted as a tag, etc, basically stuff that can break the look or even functionality of your site.
Most significantly in my opinion, is if you are allowing user input and it fails to parse, that generally means you didn't escape their data and are leaving yourself open to a vulnerability. Parsed as HTML, you might not ever notice a problem until someone starts injecting scripts to harass your users or steal data.
This page is pretty good about explaining what polyglot markup is: https://blog.whatwg.org/xhtml5-in-a-nutshell
This sounds like a very difficult thing to do. One of the downfalls of XHTML was that it wasn't possible to steer successfully between the competing demands of XML and vintage HTML.
I think if you write HTML5 and validate it successfully, you will have as tidy and valid a document as anyone would need.
This wiki has some information not present in the W3C document: http://wiki.whatwg.org/wiki/HTML_vs._XHTML

How to parse not strict HTML documents indulgently?

i've got one more question today
are there any html parsers with not strict syntax analyzers available?
as far as i can see such analyzers are built in web browsers
i mean it should be very nice to get a parser that indulgently process the input document allowing any of the following situations that are invalid in xhtml and xml:
not self-closed single tags. for example: <br> or <hr>...
mismatched casing pairs: <td>...</TD>
attributes with no quotes marks: <span class=hilite>...</SPAN>
so on and so on... etc
suggest any suitable parser, please
thank you
TagSoup is available for various languages, including Java, C++ (Taggle) and XSLT (TSaxon).
...TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
If you're happy with Python, Beautiful Soup is just such a parser.
"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."
Hpricot is particularly good at parsing broken markup if you're not afraid of a bit of Ruby. http://github.com/whymirror/hpricot