How to make libxml2 parse non-strict HTML? - html

I am using LibXML in my Vala application to parse HTML code. However the HTML I use is invalid if you pass it through validator (although browser displays it normally). In this HTML some tags are not closed, e.g. they use <img> instead of <img /> and <meta> instead of <meta/>. I cannot do anything about it, e.g. ask them to write valid HTML. But I need to parse it and libxml2 fails to do this (in short, doc->get_root_element() always return null).
Can I do something to make libxml2 parse invalid HTML?

HTML is not XML. People tried to make it XML (it was called XHTML), and we mostly just learned that people can't be trusted to write valid XML. When you say that it is invalid, I assume you mean it is not valid XML but is, in fact, valid HTML.
libxml includes an HTML parser, you need to use that. In Vala everything is in the Html namespace.

Related

Can I give jsoup a fallback character encoding to use when meta tags aren't found?

I am using Jsoup to parse HTML files which have unknown character encodings. I am calling Jsoup.parse with a null charset and lettting Jsoup autodetect. Some of the files have meta tags and Jsoup picks that up nicely.
Some of my files however have no meta tags and use various encodings that are not UTF-8. Jsoup falls back to UTF-8 for these cases resulting in some broken characters.
I have found that the juniversalchardet library is able to autodetect these cases correctly. For example it correctly detected the WINDOWS-1252 encoding in several examples.
Ideally I want to use the meta tags if they exist. If they do not then fallback to what juniversalchardet reports (not just guess UTF-8).
Can I provide Jsoup with a fallback charset to use only in cases when it cannot find a meta tag?
Alternatively, can I get info from Jsoup about whether it had to guess the encoding or not? If it reports that it guessed then I could call out to juniversalchardet and then reparse with an explicit encoding passed to Jsoup.
I have looked into the source code of Jsoup and as of v1.8.3 it appears that the code to detect the charset from meta tags is not factored out into a separate method (look for source of org.jsoup.helper.DataUtil class). Additionally information about whether it guessed or not does not appear to make it to the resulting document.
Is there a better way to achieve my goal? Is there a library for detecting character encodings of files that already can make use of html meta tags if they exist, which I could use inplace of jsoup's auto-detetion entirely?
I decided to use Apache Tika. It has an HtmlEncodingDetector class to find HTML meta tags. When that fails due to meta tags not existing I fallback to Tika's UniversalEncodingDetector. (The latter is a wrapper for juniversalchardet. I use the wrapper instead of calling juniversalchardet directly because it's handy for both detectors to have the same Java interface.)
I effectively never use Jsoup's auto detection now.
The only caveat is that Tika is quite a large project and adding it pulled in a large number of irrelevant dependencies.

Parsing an html document using an XML-parser

Can I parse an HTML file using an XML parser?
Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.
The intended use is to make an HTML parser, that is part of a web crawler application
You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.
elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
elements that don’t need end tags; e.g., <p> <dt> <li> (their end tags can be implied)
elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
attributes with unquoted values; for example, <meta charset=utf-8>
attributes that are empty, with no separate value given at all; e.g., <input disabled>
XML parsers will fail to parse any HTML document that uses any of those features.
HTML parsers, on the other hand, will basically never fail no matter what a document contains.
All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.
The intended use is to make an HTML parser, that is part of a web
crawler application
If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.
These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:
parse5 (node.js/JavaScript)
html5lib (python)
html5ever (rust)
validator.nu html5 parser (java)
gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)
syntactically they are almost identical
Computers are picky. "Almost identical" isn't good enough. HTML allows things that XML doesn't, therefore an XML parser will reject (many, though not all) HTML documents.
In addition, there's a different quality culture. With HTML the culture for a parser is "try to do something with the input if you possibly can". With XML the culture is "if it's faulty, send it back for repair or replacement".

Is valid XML also valid HTML

I'm trying to convert some XML to HTML. The XML contains only a few known elements that map to HTML tags. Do I need to html encode text nodes?
Is valid XML also valid HTML assuming we are only using HTML tags?
Is valid XML also valid HTML assuming we are only using HTML tags?
No. Here's a simple example.
<div>
<span/>
</div>
This is well-formed and valid XML. It is not valid HTML (except when processed as XHTML) in any version of HTML.
That's not to say that a HTML parser won't process it, but that's not a good test. An HTML parser will process any byte sequence, valid or not.
Is valid* XML also valid HTML assuming we are only using HTML tags?
*Note that "valid" is not the same as "well-formed". Validity is a property that requires well-formedness and succesful comparison against a DTD or schema. Well-formedness only means syntactical correctness, which is what you mean here.
Yes. HTML uses a few conventions that are not present in XML (prominently unclosed tags, unencoded tag bodies like <script>, namespaces are unsupported, incorrect tag nesting is glossed over) but all things considered well-formed, vanilla (!) XML that only uses HTML tag names will be understood by an HTML parser.
Vanilla means in this case: No custom DTDs, no custom named character entities.
Do I need to html encode text nodes?
No. All characters valid in a certain encoding (say, UTF-8) will be acceptable in both XML and HTML, as long as the encoding is correctly declared. Character escaping schemes are compatible, so e.g.   (or &xA0;) will represent a non-breaking space in both XML and HTML. Writing that non-breaking space verbatim (i.e as single byte xA0) into the text will work as well. Named character entities besides <, >, &, " and &apos; are unsupported in XML, whereas all numbered character entities XML could use will work in HTML. That means you will not encounter a problem there.
XML that does not declare an encoding will default to UTF-8. You should not have a problem with leaving all text nodes and attribute values as they are as long as you use the same encoding for your HTML.

Using an NSXMLParser to parse HTML

I'm working on an app which aggregates some feeds from the internet and reformats the content. So I'm looking for a way to parse some HTML. Given XML and HTML are very similar in structure I was thinking "maybe I should just use an NSXMLParser" I'm already using it to parse my RSS feeds and I've become comfortable using it, but I'm running into a problem.
The parser will not recognize <p> as an element. It has no problem extracting elements like <title>, or <img>, but it doesn't like <p>. Has anyone tried doing this, and if so do you have any suggestion or work arounds for this issue? I think the XMLParser is good for what I'm doing and I would like to use it, but obviously, if I can't get the text in <p> elements it's completely useless to me.
Any suggestions are welcome, even ones suggesting a different method entirely. I've looked into some third party libraries for doing this but from what I've read they all have some bugs and I would much prefer to use something provided by Apple.
There's absolutely nothing special about "p" as the name of an element. While it is hard to be sure because you haven't provided an example of the HTML you are parsing, the problem is most likely caused by HTML that is not well-formed XML. In other words, using NSXMLParser would work on XHTML, but not necessarily plain-old HTML.
The "p" element is frequently found in HTML without the matching closing tag, which is not valid XML. My guess is that you would have to convert the HTML to XHTML before trying to parse it with an NSXMLParser
HTML is not necessarily well-formed XML, and that's the trouble when you parse it as XML.
Take the following example:
<body>
<p>123
<p>abc
<p>789
</body>
If you view this chunk of html in a browser, it would show just as what you expected. But if you parse this as xml, there would be trouble, as those p tags are not closed.
I recommend you use my DTHTMLParser which is modeled after NSXMLParser and uses libxml2 to parse HTML perfectly. You generally cannot rely on the HTML to be well-formed and be parseable as xml.
libxml2 has a HTML mode where it is able to ignore things like un-closed tags and whatever HTML might have in ideosyncrasies.
HTML parsing explained:
http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/
http://www.cocoanetics.com/2012/01/taming-html-parsing-with-libxml-2/
DTHTMLParser documentation:
https://docs.cocoanetics.com/DTFoundation/Classes/DTHTMLParser.html
Source, part of DTFoundation:
DTHTMLParser.h
DTHTMLParser.m

What is the safest way to extract <title> from an HTML file using xpath?

Here is my current xpath code "/html/head/title".
But you know, in the real world html environment, the code format usually broken, e.g. <html> tag is missing could cause an exception. So, I would like to know if there's a safe way to extract the <title> tag? (something like getElementByTagName)
"//title" perhaps?
Because of the unruly nature of html markup you should use an html parsing library. You didn't specify a platform or language but there are a number of open source libraries out there.
Actually /html/head/title should work just fine, even on badly malformed mark-up, assuming:
there is a title element;
your HTML parser behaves the same way browser parsers do;
your HTML parser puts the HTML elements into the null namespace.
You will have to allow for the possibility of there being multiple title elements in invalid HTML, so /html/head/title[1] is possibly better.
If you can use javascript, you can do it:
document.title
If you have something that an XML parser can parse (which is not the case with most HTML, but needs to be the case to use XPath), then you could use //title to get the element.