Parsing HTML fragments with libxml SAX - html

I need to parse HTML fragments, by which I mean the files lack <html>, <head> and <body> elements, otherwise having well-formed XHTML syntax, UTF8 encoding guaranteed. It looks like libxml is ideal for this task, but I have certain constraints which I just don't know how to implement.
htmlSAXParseFile() does its job well enough, but it seems to create DOM itself, inserting body and html elements in the process. I would like to create DOM myself because I may need to skip some elements and modify others on the fly. Is it possible to somehow tell libxml not to create DOM at all and just parse HTML and call my handlers?
If that's not possible for libxml HTML parser, I might as well use xmlSAXUserParseFile() which doesn't seem to create DOM. However, since the files have a structure like <p>...</p><p>...</p>, the parser just spits "Extra content at the end of the document" too early. Is there a way to suppress some parsing errors while still getting notified about them (just because nobody guarantees there will never be other errors in those files)?
There is a whole heck of parsing functions in libxml, some of which accept xmlParserOption as a parameter. Alas, xmlSAXUserParseFile() does not. And those which do all seem to create DOM for some irrelevant API design reasons. Am I missing an obvious candidate?
Oh, and I confess that my reluctance to use libxml's DOM looks like a quirk. I am extremely constrained with RAM, so I desperately need total control over DOM to be able to drop some nodes on low memory conditions and re-read them if necessary.
Thanks in advance.

OK, since nobody has answered the question, I'll try to do it myself.
I wrote all the start/end element handlers and it looks like libxml does not create the DOM any more. At least, the returned document pointer is NULL. It still insists on html and body elements, but I can live with that.
One major problem is that libxml preserves all whitespace nodes, no matter what. So I have to parse text content to eliminate ignorable whitespace. It's ugly, but it works. Should I mention that parsing UTF-8 is the kind of fun you rarely miss?
To be honest, libxml documentation is atrocious. My advice to anybody who ever tries reading the docs: read the source code instead. The code is much more readable and documented.
Thanks for attention.

Related

Perl HTML::TreeBuilder adding <html>, <head> and <body> tags to parsed content, how to stop or work around it?

Background:
I'm using HTML::TreeBuilder to parse an entire html page, say "whole_page" for reference's sake. I'm then using the inherited parse_content method (same as for whole_page) of a new TreeBuilder object to to parse a chunk of html, say "html_to_insert". The root element of html_to_insert should be a <div> tag. Ultimately, the html_to_insert tree needs to be inserted into the the whole_page tree.
Problem:
The html_to_insert tree is being wrapped with <html>, <head> and <body> tags, which I obviously don't need. I looked at HTML::Parser to see if there was a parameter that might solve the problem, but I couldn't find anything.
Question:
Is there a simple way to stop the parse method from wrapping html_to_insert with the un-needed tags? Knowing what I'm trying to do, am I doing this ass backwards (is there a better way)?
Thanks for any help.
You might want to look on guts method in HTML::Tree. It returns only non-implicit nodes as a list.
If you can ensure your HTML is XHTML-compliant, that is, it's a proper XML document, you may be able to use XML tools to do the job instead. In the past, I've used XML::Twig for this type of job, it was a bit easier that way.
Of course, if you're parsing arbitrary web pages from the internet, you may not have this type of guarantee.

What are some good ways to parse HTML and CSS in Perl?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

Should I write Polyglot HTML5 documents?

I've been considering converting my current HTML5 documents to polyglot HTML5 ones. I figure that even if they only ever get served as text/html, the extra checks of writing it XML would help to keep my coding habits tidy and valid.
Is there anything particularly thrilling in the HTML5-only space that would make this an unwise choice?
Secondly, the specs are a bit hazy on how to validate a polyglot document. I assume the basics are:
No errors when run through the W3C Validator as HTML5
No errors when run through an XML parser
But are there any other rules I'm missing?
Thirdly, seeing as it is a polyglot, does anyone know any caveats to serving it as application/xhtml+xml to supporting browsers and text/html to non-supporting ones?
Edit: After a small bit of experimenting I found that entities like break in XHTML5 (no DTD). That XML parser is a bit of a double-edged sword, I guess I've answered my third question already.
Work on defining how to create HTML5 polyglot documents is currently on-going, but see http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html for an early draft. It's certainly possible to do, but it does require a good deal of coding discipline, and you will need to decide whether it's worth the effort. Although I create HTML4.01/XHTML1.0 polyglot documents, I create them using an XML tool chain which guarantees XML well-formedness and have specialized code to ensure compatibility with HTML non-void elements and valid XML characters. Direct hand coding would be very difficult.
One known current issue in HTML5 is the srcdoc attribute on the iframe element. Because the value of the attribute contains markup, certain characters need to be escaped. The HTML5 draft spec describes how to do this for the HTML serialization, but not (the last time I looked) how to do it in the XHTML serialization.
I'm late to the party but after 5 years the question is still relevant.
On one hand closing all my tags strongly appeals to me. For people reading it, for easier editing, for Great Justice. OTOH, looking at the gory details of the polyglot spec — http://www.sitepoint.com/have-you-considered-polyglot-markup/ has a convenient summary at the end — it's clear to me I can't get it all right by hand.
https://developer.mozilla.org/en/docs/Writing_JavaScript_for_XHTML also sheds interesting light on why XHTML failed: the very choice to use XML mime type has various side effects at run time. By now it should be routine for good JS code to handle these (e.g. always lowercase tag names before comparing) but I don't want all that. There are enough cross-browser issues to test for as-is, thank you.
So I think there is a useful middle way:
For now serve only as text/html. Stop worrying that it will actually parse as exactly the same DOM with same runtime behavior in both HTML and XML modes.
Only strive that it parses as some well-formed XML. It helps readers, it helps editors, it lets me use XML parser on my own documents.
Unfortunately, polyglot tools are rare to non-existant — it's hard to even serialize back XML in a way that also passes the HTML requirements...
No brainer: always self close void tags (<hr/>) and separately close non-void tags (<script ...></script>).
No brainers: use lowercase tags and attr (except some SVG but foreign content uses XML rules anyway), always quote attribute values, always provide attribute values (selected="selected" is more verbose than stanalone selected but I can live with that).
Inline <script> and <style> are most annoying. I can't use & or < inside without breaking XML parsing. I need:
<script>/*<![CDATA[*/
foo < bar && bar < baz;
/*]]>*/</script>
...and that's about it! Not caring about XML namespaces or matching HTML's implied DOM for tables drops about half the rules :-)
Await some future when I can directly go to authoring XHTML, skipping polyglotness. The benefits are I'll be able to forget the tag-closing limitations, will be able to directly consume and produce it with XML tools. Sure, neglecting xml namespaces and other things now will make the switch harder, but I think I'll create more new documents in this future than convert existing ones.
Actually I'm not entirely sure what's stopping me from living in that future right now. Is it only IE 8? I'm also a tiny bit concerned about the all-or-nothing error handling. I'm slighly hoping a future HTML spec will find a way to shrink the HTML vs XML gaps, e.g. make browsers accept <hr></hr> and <script .../> in HTML— while still retaining HTML error handling.
Also, tools. Having libraries in many languages that can serialize to polyglot markup would make it feasible for programs to generate it. Having tools to validate and convert HTML5 <-> polyglot <-> XHTML5 would help. Otherwise, it's pretty much doomed.
Given that the W3C's documentation on the differences between HTML and XHTML isn't even finished, it's probably not worth your time to try to do polyglot. Not yet anyways.... give it another couple of years.
In any event, only in the extremely narrow circumstances where you are actively planning on parsing your HTML as XML for some specific purpose, should you invest the extra time in XML-compliance. There are no benefits of doing it purely for consumption by web browsers -- only drawbacks.
Should you? Yes. But first some clarification on a couple points.
Sending the Content-Type: application/xhtml+xml header only means it should go through an XML parser, it still has all the benefits of HTML5 as far as I can tell.
About , that isn't defined in XML, the only character entity references XML defines are lt, gt, apos, quot, and amp, you will need to use numeric character references for anything else. The code for nbsp is   or  , I personally prefer hex because unicode code points are represented that way (U+00A0).
Sending the header is useful for testing because you can quickly find problems with your markup such as unclosed tags, stray end tags, text that could be interpreted as a tag, etc, basically stuff that can break the look or even functionality of your site.
Most significantly in my opinion, is if you are allowing user input and it fails to parse, that generally means you didn't escape their data and are leaving yourself open to a vulnerability. Parsed as HTML, you might not ever notice a problem until someone starts injecting scripts to harass your users or steal data.
This page is pretty good about explaining what polyglot markup is: https://blog.whatwg.org/xhtml5-in-a-nutshell
This sounds like a very difficult thing to do. One of the downfalls of XHTML was that it wasn't possible to steer successfully between the competing demands of XML and vintage HTML.
I think if you write HTML5 and validate it successfully, you will have as tidy and valid a document as anyone would need.
This wiki has some information not present in the W3C document: http://wiki.whatwg.org/wiki/HTML_vs._XHTML

Django templatetag for rendering a subset of html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}

Win32.: How to scrape HTML without regular expressions?

A recent blog entry by a Jeff Atwood says that you should never parse HTML using regular expressions - yet doesn't give an alternative.
I want to scrape search search results, extracting values:
<div class="used_result_container">
...
...
<div class="vehicleInfo">
...
...
<div class="makemodeltrim">
...
<a class="carlink" href="[Url]">[MakeAndModel]</a>
...
</div>
<div class="kilometers">[Kilometers]</div>
<div class="price">[Price]</div>
<div class="location">
<span class='locationText'>Location:</span>[Location]
</div>
...
...
</div>
...
...
</div>
...and it repeats
You can see the values I want to extract, [enclosed in brackets]:
Url
MakeAndModel
Kilometers
Price
Location
Assuming we accept the premise that parsing HTML:
generally a bad idea
rapidly devolves into madness
What's the way to do it?
Assumptions:
native Win32
loose html
Assumption clarifications:
Native Win32
.NET/CLR is not native Win32
Java is not native Win32
perl, python, ruby are not native Win32
assume C++, in Visual Studio 2000, compiled into a native Win32 application
Native Win32 applications can call library code:
copied source code
DLLs containing function entry points
DLLs containing COM objects
DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects
Loose HTML
xml is not loose HTML
xhtml is not loose HTML
strict HTML is not loose HTML
Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality I was present the assumption that any HTML parser must be generous in the HTML it accepts.
Clarification#2
Assuming you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would you walk a DOM tree? I need a DIV node that is a class of used_result_container, which has a child DIV of class of vehicleInfo. But the nodes don't necessarily have to be direct children of one another.
It sounds like I'm trading one set of regular expression problems for another. If they change the structure of the HTML, I will have to re-write my code to match - as I would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do I do instead?
And would I not be writing a regular expression parser for DOM nodes? i'm writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way - the way that Jeff alluded to.
I intentionally kept the original question vague, so as not to lead people down the wrong path. I didn't want to imply that the solution, necessarily, had anything to do with:
walking a DOM tree
xpath queries
Clarification#3
The sample HTML I provided I trimmed down to the important elements and attributes. The mechanism I used to trim the HTML down was based on my internal bias that uses regular expressions. I naturally think that I need various "sign-posts in the HTML that I look for.
So don't confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of all the original HTML.
Update 4
The only proposed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: then what?
Now that I have the DOM, what do I do with it? It seems that I still have to walk the tree with some sort of regular DOM expression parser, capable of forward matching and capture.
In this particular case i need all the used_result_container DIV nodes which contain vehicleInfo DIV nodes as children. Any used_result_container DIV nodes that do not contain vehicleInfo has a child are not relevant.
Is there a DOM regular expression parser with capture and forward matching? I don't think XPath can select higher level nodes based on criteria of lower level nodes:
\\div[#class="used_result_container" && .\div[#class="vehicleInfo"]]\*
Note: I use XPath so infrequently that I cannot make up hypothetical xpath syntax very goodly.
Python:
lxml - faster, perhaps better at parsing bad HTML
BeautifulSoup - if lxml fails on your input try this.
Ruby: (heard of the following libraries, but never tried them)
Nokogiri
hpricot
Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.
If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)
Edit: Your post has really grown since it first came out... I'll try to answer what I can.
i don't think XPath can select higher level nodes based on criteria of lower level nodes:
It can. Try //div[#class='vehicleInfo']/parent::div[#class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.
how then do you access repeating structures of data?
It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.
Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...
I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.
Native Win32
You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).
Use Html Agility Pack for .NET
Update
Since you need something native/antique, and the markup is likely bad, I would recommend running the markup through Tidy and then parsing it with Xerces
Use Beautiful Soup.
Beautiful Soup is an HTML/XML parser
for Python that can turn even invalid
markup into a parse tree. It provides
simple, idiomatic ways of navigating,
searching, and modifying the parse
tree. It commonly saves programmers
hours or days of work. There's also a
Ruby port called Rubyful Soup.
If you are really under Win32 you can use a tiny and fast COM object to do it
example code with vbs:
Set dom = CreateObject("htmlfile")
dom.write("<div>Click for <img src='http://www.google.com/images/srpr/logo1w.png'>Google</a></div>")
WScript.Echo(dom.Images.item(0).src)
You can also do this in JScript, or VB/Dephi/C++/C#/Python etc on Windows. It use mshtml.dll dom layout and parser directly.
The alternative is to use an html dom parser. Unfortunately, it seems like most of them have problems with poorly formed html, so in addition you need to run it through html tidy or something similar first.
If a DOM parser is out of the question - for whatever reason,
I'd go for some variant of PHP's explode() or whatever is available in the programming language that you use.
You could for example start out by splitting by <div class="vehicleInfo">, which would give you each result (remember to ignore the first place). After that you could loop the results split each result by <div class="makemodeltrim"> etc.
This is by no means an optimal solution, and it will be quite fragile (almost any change in the layout of the document would break the code).
Another option would be to go after some CSS selector library like phpQuery or similar for your programming language.
Use a DOM parser
e.g. for java check this list
Open Source HTML Parsers in Java (I like to use cobra)
Or if you are sure e.g. that you only want to parse a certain subset of your html which ideally is also xml valid you could use some xml parser to parse only fragment you pass it in and then even use xpath to request the values your are interested in.
Open Source XML Parsers in Java (e.g. dom4j is easy to use)
I think libxml2, despite its name, also does its best to parse tag soup HTML. It is a C library, so it should satisfy your requirements. You can find it here.
BTW, another answer recommended lxml, which is a Python library, but is actually built on libxml2. If lxml worked well for him, chances are libxml2 is going to work well for you.
How about using Internet Explorer as an ActiveX control? It will give you a fully rendered structure as it viewed the page.
The HTML::Parser and HTML::Tree modules in Perl are pretty good at parsing most typical so-called HTML on the web. From there, you can locate elements using XPath-like queries.
What do you think about ihtmldocument2,
I think it should help.