Handling of errors while parsing HTML - html

For various reasons that are beyond the scope of this question, I am using an adhoc html parsing class written in python. This simple class has been so far sufficient for the kind of input it was fed but it recently tried to parse http://forum.macbidouille.com/index.php?showtopic=160607
This webpage is obviously automatically generated by some php code but it contains user-generated html which are included verbatim as a signature for each post. Most notably, http://forum.macbidouille.com/index.php?showtopic=160607#entry1563022 contains the following HTML (comments removed and tags indented for clarity):
<div class="signature">
<span style="font-family:Verdana">
<span style="color:#8B0000">
<span style="font-size:12pt;line-height:100%">
<div align='center'>La Culture coûte cher, mais l'inculture coûte encore plus cher à la Société. <br />
<span style="font-size:8pt;line-height:100%"><i>Marcel Landowsky</i></span>
</span><br />
</div>
</span>
</span>
<div align='left'><br />macbook unibody 10.6.8 - 2.26ghz - 4Go- 250Go - <br />Je n'ai pas de télévision !</div>
</div>
As should be obvious from the above, there is a stray tag that is closed too early. i.e., we have invalid HTML here. Nothing extraordinary but this is sufficient to make my parsing code fail. Specifically, so far, that parsing code has a very simple error handling strategy: it merely tries to match each closing tag with the currently opened tag and if the closing tag does not match, it is ignored.
In the case of the above code, this results in ignoring on line 7 because it does not match the currently open tag from line 5 and then ignoring on the last line because it does not match the currently open tag on line 2. The result is that all the html that follows this block is assumed to be hierarchicaly included within the first tag which leads to other problems later.
What I would like to achieve is to 'synchronize' the parsing state better and I wonder what kind of simple approach would lead to a parser that can handle this block of html. I can see how I could try to minimize the number of closing tags thrown away once I have completed the parsing by re-arranging the generated tree but I am looking for a simpler solution.
I know that the first answer will be: "use library X" and this is likely what I am going to end up doing but I am actually curious as to what kind of interesting parsing and error handling strategies could be used in this case. i.e., I am trying to get educated :)
thanks!

Your best bet is to try to parse (and fix) the user-supplied HTML first, otherwise you may end up with all kinds of the original DOM structure corruptions. First off, I guess, you should check user HTML for the tag nesting and sanitize it (i.e. the </span> has no corresponding start tag, so it should be removed). If you have an HTML-only parser, enclose the user HTML in <div>..</div> before parsing - this should do the trick.

Related

Can I safely replace "<ul>" tags within HTML using regexes?

I am trying to solve this issue, where users paste invalid HTML that we have to deal with, of the form <ol><ul><li>item</li></ul></ol>. We are currently parsing using lxml. In legal HTML, <ol> cannot have a (direct) child of a <ul> (it must be in an <li>) so lxml closes the ol tag too soon to try to "repair" the HTML, producing <div><ol/><ul><li>item</li></ul>.
The user-pasted text also might be invalid XML (e.g., bare <br> tag), so we can't just parse it as XML.
Thus, we can neither parse it as HTML nor XML, because it might be invalid.
To make this certain (common) case of invalid HTML into valid HTML, can we just replace all <ul> tags with <ol> tags using regexes?
If I use lxml to parse <ol><ol><li>item</li></ol></ol>, the output looks fine (does not close a tag too soon).
However, I don't want to break actual user-typed text, and I'm wondering if there are edge cases I haven't thought of (like "<ul>" within a <pre> tag or some other crazy thing that isn't actually a tag, though I've tested that particular case).
Yes, it would change unnumbered lists to numbered lists. I'm okay with that.
Yes, I have read this fun regex answer.
In general, there is no guarantee of a 'non-edge case' transform with HTML and regular expressions. HTML, more so than XML, has rules that make a direct text replacement of things that look like tags problematic.
The following text validates as HTML using w3c.org validation checker without any warnings.
<!DOCTYPE html>
<html lang="en">
<head>
<title><!--<ul>--></title>
<style lang="css">s {content: "<ul>";}</style>
<script>"<ul>"</script>
</head>
<body data-ul="<ul>"></body>
</html>
That aside, using some regular expression heuristics might solve the issue at hand - at least insofar as a reasonable scope. A streaming HTML token parser that does not attempt to apply any validation or DOM/tree building might also be useful for the initial replacement stage.

HTML XPath: Extracting text mixed in with multiple level and complex tags?

related questions before:
HTML XPath: Extracting text mixed in with multiple tags?
HTML XPath: Selectively avoiding tags when extracting text
//sorry for my poor English
I'm a beginner of writing web crawler, I'm trying to extract main content from a web pages(in Chinese) by xpath(though I have learned that there are algorithms both taditional and machine learning ways to extracting web main content) ,and I'm a very beginner at writing xpath rules.
I'm in faced with a web page that contains text mixed in complex tags,I summarize it as follows,where character(e.g. A,A2) means text only,'...' means more tags even nested without text.I want to get "AA2BB2CDEFGHIJKLMNOP"
...
<div id="artibody" class="art_context">
<div align="center">...</div>
<div align="center"><font>A</font>A2</div>
<div align="left"><br><br><strong>B</strong>B2</div>
<div align="left">
<p>C<a>D</a>E</p>
<p>F<a>G</a>H<a>I</a>J</p>K
</div>
<div align="center">...</div>
<div align="center"><font>L</font></div>
<p>M</p><!--M contains only text luckly-->
<p>N</p>
<p>O</p>
<p>P<span>...</span><div class="shareBox">...</div>
</p>
<span id="arctTailMark"></span>
<script>
var page_navigation = document.getElementById('page_navigation');
...
</script>
<div style="padding:10px 0 30px 0">...</div>
</div>
Thanks for previous questions, I write a rule
'string(//div[#class=\"art_context\"])'
I get all content in plain text I want without tags ,but the js code in <script> is extracted as well.I tried the following,but it seems not helpful.There are still js codes in it .
'string(//div[#class=\"art_context\" and not(self::script)])'
The following one get "\r\n" only.
'//div[#class=\"art_context\" and not(self::script)]/text()'
Here are my questions:
1.How to write the xpath rule to meet my need : extracting content in div[#id="artibody"] except codes in <script>
2.Is the rule for question1 simple and powerful? Maybe I will meet more pages with a div[#id="artibody"] but the descendant nodes are quite different.
3.Any further suggestions on my task? Extracting web content from one website,but the main content lays in <div> with different id,class,and descendant node structure. I run the spider on my laptop(Intel corei5 3225,8G RAM) while using machine learning algorithms may decrease the crawl speed significantly.At the same time writing many xpath rule seems bothering.
I'd appreciate it if you could give me any suggestions on this question(and my English).
To get all descendant text nodes except the script contents, you can use this:
//div[#class="art_context"]//*[not(self::script)]/text()
In natural language: “Get all text nodes from descendants of all div[#class="art_context"] elements that are not script elements”.
The // after div[#class="art_context"] is needed to select descendants, not just children.
In comparison, the //div[#class="art_context" and not(self::script)]/text() expression in the question says “Get all text-node children of all div[#class="art_context"] elements that are not also script elements.”
So the and not(self::script) part in the expression in the question is redundant, because all the expression is doing is selecting just //div[#class="art_context"] anyway, and then the /text() part is selecting only the text-node direct children of that div, which is just line breaks.
Also, if instead of using XPath to just get the set of text nodes, you want to use XPath to get the result as a single string, you can use the functions string-join(…) and normalize-space(…):
normalize-space(string-join(//div[#class="art_context"]//*[not(self::script)]/text(), ""))

What do square brackets mean in html?

I am assisting on a project right now and building out templates for the first time, trying to wrap my head around a few things but one aspect of the html that's confusing me are certain things sitting in square brackets. I've never used these in html before so I'm just wondering what they are for (when I open the page in a browser they all show up as text)
Here's a bit of the code:
<div class="container">
[HASBREADCRUMBS]
<ol class="nav-breadcrumb">
[BREADCRUMBS]
</ol>
[/HASBREADCRUMBS]
<h1 class="header-title" style="color:[TITLECOLOR];font-size:[TITLESIZE];">[TITLE]</h1>
</div>
It's using some templating engine and the whole page is parsed before getting output to the browser. During parsing, those square bracket tags work as something else (depending on the templating engine used).
So, for example, [HASBREADCRUMBS] and [/HASBREADCRUMBS] could denote a piece of code that might be similar to:
if (breadcrumbs) {
and:
} // closed if
and for each value of the breadcrumbs object (whatever it might be) one ordered HTML list is rendered with the breadcrumb value as its content ([BREADCRUMBS]).
So in short: it's not HTML, that part of the file never reaches the browser but is converted into proper HTML (based on conditions, can also use loops, etc.) before rendering.
The square brackets have nothing to do with HTML. They probably belong to the template and will be replaced by actual value from the template engine.

Forcing a line break in a string in HTML, Equivalent to \n in HTML

This is being used in a Bootstrap Popover.
The live page under development can be viewed here
This is got to be simple but I can't find it anywhere. Within data-content attribute I want to force a paragraph or line break between "Date Assessed: 10-Nov-13 and Results: CR= ...
Using a BR or P tag doesn't work it shows the literal tag. In Javascript to force a line break you use \n how do you do the same in HTML within a quoted string?
<td class="setWidth concat"><div class="boldTitle"><a href="#"
class="tip" rel="popover" data-trigger="hover"
data-placement="top"
data-content="Date Assessed: 10-Nov-13 <br />
Results: Cr = 2.2 mg/dl"
data-original-title="Out of Range">
<span style="color:red"
class="glyphicon glyphicon-warning-sign"></span> Cr = 2.2 mg/dL</a></div></td>
See last update: Bootstrap gives you ability to specify that the content is HTML instead of text.
It depends entirely on bootstrap's implementation of the popover effect. If they are using $('.popover').html($(this).data('content')) then it should "just work". If they are using $('.popover').text($(this).data('content')) or otherwise escaping the results of the data-attribute first, then it probably won't.
If bootstrap's implementation isn't working the way you want it to work, you might be served better by writing your own javascript to handle the effect you're looking for.
See this fiddle for an example of a line break from a data-attribute working correctly:
http://jsfiddle.net/g32tw/1/
Update: I've updated the fiddle with a second link that produces the error you're experiencing, which is likely how bootstrap's implementation works.
UPDATE: just looked at bootstrap's documentation. Have you tried adding "data-html" = "true" to the element?
Source: http://getbootstrap.com/javascript/#popovers-usage
Watch out with this - if the content is end-user-supplied using the html option might subject you to XSS attack vulnerabilities. If you trust the data it's fine. See https://www.acunetix.com/websitesecurity/cross-site-scripting/ for information about cross-site scripting.
I am not sure that you can. You could try to have two data items:
data-assessdate="Date Assessed: 10-Nov-13"
data-results="Cr = 2.2 mg/dl"
and reassemble afterward with Javascript before displaying:
var summary = this.dataset;
var newhtml=summary.assessdate . "<br />" . summary.results;
and then write newhtml to the DOM where ever you want.

What does the html bindpoint attribute do?

I'm currently making a facebook application and while investigating the (X)HTML source code for a message thread page to see if it was possible to link to specific messages within threads (apparently it's not), I encountered an HTML attribute that I cannot seem to find any information about. Some span elements on the page had a 'bindpoint' attribute that was set to various values (presumably element IDs). Here is an excerpt from the page source (I replaced some private info with Xs)
<div class="GBThreadMessageRow_Info">
<span class="GBThreadMessageRow_AuthorLink_Wrapper" bindpoint="authorLinkWrapper">
XXXXXXXX
</span>
<span class="GBThreadMessageRow_Date">
April 8, 2010 at 10:13pm
</span>
<span bindpoint="branchLinkWrapper" class="GBThreadMessageRow_BranchLink">Reply</span>
<span bindpoint="reportLinkWrapper" class="GBThreadMessageRow_ReportLink"> • Report</span>
</div>
I have never seen this attribute before and any information about it would be useful/helpful/interesting. Thanks!
As was said in the comments, it has to be something they're doing in the javascript code.
Facebook uses an interesting technique to import their javascript files dynamically (basically they seem to write out script tags in the javascript, when necessary), and it's not quite as simple as just pressing ctrl-F through the first file you find.
So, in conclusion, the bindpoint attribute is something internal to the Facebook eco-system, and not standard HTML. From the name, I assume it has something to do with which events (clicks, mouseovers, etc.) should be binded to the element in question, which is signified by a variable name give in the bindpoint attribute. Or maybe it has to do with which element the element in question should be 'binded' to, like the for attribute for a label. Anyway, this is pure speculation.