How do you handle malformed HTML in Perl?

How do you handle malformed HTML in Perl? - html

I'm interested in a parser that could take a malformed HTML page, and turn it into well formed HTML before performing some XPath queries on it. Do you know of any?

You should not use an XML parser to parse HTML. Use an HTML parser.
Note that the following is perfectly valid HTML (and an XML parser would choke on it):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Is this valid?</title>
</head>
<body>
<p>This is a paragraph
<table>
<tr> <td>cell 1 <td>cell 2
<tr> <td>cell 3 <td>cell 4
</table>
</body>
</html>
There are many task specific (in addition to the general purpose) HTML parsers on CPAN. They have worked perfectly for me on an immense variety of extremely messy (and most of the time invalid) HTML.
It would be possible to give specific recommendations if you can specify the problem you are trying to solve.
There is also HTML::TreeBuilder::XPath which uses HTML::Parser to parse the document into a tree and then allows you to query it using XPath. I have never used it but see Randal Schwartz's HTML Scraping with XPath.
Given the HTML file above, the following short script:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file("valid.html");
my #td = $tree->findnodes_as_strings('//td');
print $_, "\n" for #td;
outputs:
C:\Temp> z
cell 1
cell 2
cell 3
cell 4
The key point here is that the document was parsed by an HTML parser as an HTML document (despite the fact that we were able to query it using XPath).

Unless you're looking to learn more about wheels, use the HTML Tidy code.

You could rephrase the question like this:
I'm interested in a parser that could take a malformed HTML page C source, and turn it into well formed HTML C source before performing some XPath queries compilation and linking on it. Do you know of any?
Now the question may be a bit more obvious: it's not going to be easy. If it's truly malformed HTML, you may need to do the work by hand until it can be fed into an HTML parser. Then, you can use any of the other modules presented here to do the work. It's unlikely though that you could ever programatically translate raw HTML into strictly valid xhtml.

Related

Basic Working Example of an XXE Attack in HTML

I'm trying to run some tests with XXE attacks in an html page, but i'm having trouble coming up with a working example. After looking around the internet for a long time, I came up with this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<script id="embeddedXML" type="text/xml">
<!DOCTYPE foo [
<!ELEMENT foo ANY>
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<foo>&xxe;</foo>
</script>
</head>
<body>
<script type="application/javascript">
alert(document.getElementById('embeddedXML').innerHTML);
</script>
</body>
</html>
But, it doesn't work. The XML inside the script tag doesn't "run", per se, meaning that when the alert pops up, it just displays the xml as plaintext. It doesn't interpret the DOCTYPE header thing and get the information from the listed file.
It's been very hard to google around for this because apparently XML doesn't "run", but something needs to happen where this text is interpreted instead of just written out. I don't know what that thing is, or how to get it working inside an HTML page as written here.
any tips much appreciated. Thanks!

See OWASP
Among the Risk Factors is:
The application parses XML documents.
Now, script elements are defined (in HTML 4 terms) as containing CDATA, so markup in them (except </script>) has no special meaning. So there is no XML parsing going on there.
Meanwhile alert() deals in strings, not in markup, so there's still no XML parsing going on.
Since you have no XML parser, there's no vulnerability.
In general, if you want XML parsing in the middle of a web page then you need to use JavaScript (e.g. with DOM Parser but I wouldn't be surprised if it was not DTD aware and so not vulnerable (and even if it was vulnerable then it might well block access to local external entities).

HTML parsing issue in Racket

I want to parse some HTML documents, it seems that Racket's html and xml library can't handle this very well. For example, here's an HTML document:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Test</title>
<script>
var k = "<scr";
</script>
</head>
<body>
</body>
</html>
Neither read-html nor read-xml can parse this. They think the <scr in var k = "<scr" is part of an opening tag.
So, is there a better way to do this?

Try the html-parsing package.
The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”
Although I don't know for sure if it will handle <script> tags like this, it might. The author, Neil Van Dyke, is active on the Racket mailing list.

Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.
I'm developing an application that needs to parse a website with tables in it. As deriving XPath expression for scraping web pages is boring and error-prone work, I'd like to use the XPath extractor feature of Firebug (or similar tools in other browsers) for this.
Example input looks like this:
<!-- snip -->
<table id="example">
<tr>
<th>Example Cell</th>
<th>Another one</th>
</tr>
<tr>
<td>foobar</td>
<td>42</td>
</tr>
</table>
<!-- snip -->
I want to extract the first data cell ("foobar"). Firebug proposes the XPath expression
//table[#id="example"]/tbody/tr[2]/td[1]
which works fine in any XPath tester plugins, but not my own application (no results found). If I cut down the query to //table[#id], it works again.
What's going wrong?

The Problem: DOM Requires <tbody/> Tags
Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.
The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says
The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element.
There is an in-depth explanation in another answer on stackoverflow.
On the other hand, HTML does not necessarily require that tag to be used:
The TBODY start tag is always required except when the table contains only one table body and no table head or foot sections.
Most XPath Processors Work on raw XML
Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/> tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".
This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!
Reproducing the Issue
Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org on the command line. Latter will probably not contain any <tbody/> elements (they're rarely used), Firebug will always show them.
Solution 1: Remove /tbody Axis Step
Check if the table you're stuck at really does not contain a <tbody/> element (see last paragraph). If it does, you've probably got another kind of problem.
Now remove the /tbody axis step, so your query will look like
//table[#id="example"]/tr[2]/td[1]
Solution 2: Skip <tbody/> Tags
This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.
Replace the /tbody axis step by a descendant-or-self step:
//table[#id="example"]//tr[2]/td[1]
Solution 3: Allow Both Input With and Without <tbody/> Tags
If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).
XPath 1.0:
//table[#id="example"]/tr[2]/td[1] | //table[#id="example"]/tbody/tr[2]/td[1]
XPath 2.0: //table[#id="example"]/(tbody, .)/tr[2]/td[1]

Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)
Before parsing, get the html as a string. Insert missing <tbody> and </tbody> tags with regex, then load it back into your DOMDocument object.
Jens Erat gives a good explanation, but here is
Solution 4: Make sure the HTML source always has the <tbody> tags with regex
JavaScript
var html = '<html><table><tr><td>foo</td><td>bar</td></tr></table></html>';
html.replace(/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/g,"$1<tbody>").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1</tbody>$4");
PHP
$html = $dom->saveHTML();
$html = preg_replace(array('/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1<tbody>','$1</tbody>$4'),$html);
$dom->loadHTML($html);
Just the regex:
matches `<table>` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `<tbody>` also with stuff inside the tag
/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/
replace with
$1<tbody>
the $1 referencing the captured `<table>` tag with contents.
Do the same for the closing tag like this:
/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/
replace with
$1</tbody>$4
This way the dom will ALWAYS have the <tbody> tags where necessary.

One W3C validation errors I really want to correct

This would be my first website and I do not want to leave it these errors. Can someone please help me with these ones?
Error 1:
if (xmlhttp.readyState==4 && xmlhttp.status==200)
error: character "&" is the first character of a delimiter but occurred as data.
WHEN i &, then my AJAX code stops working
I have no clue how to correct this one.
Error 2:
…ems"><a href="brushdescription.php?id=<?php echo $popularbrushesrow['bd_brushi…
error: character "<" is the first character of a delimiter but occurred as data
Again the same error but for < this time
UPDATE:
I am using this doctype:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

< and & are some of the predefined entities in XML, which need escaping when validating the page as XML or XHTML.
< should be replaced with < (less than)
& should be replaced with & (ampersand)
However, if using these characters in JavaScript you can (instead) enclose the script in a <![CDATA[]]> section, which instructs the parser to not interpret the code as markup and will also not result in a validation error.

Try wrapping your Javascript with <![CDATA[]]> tags like so:
<script>
//<![CDATA[
// Javascript goes here.
//]]>
</script>
Also, you should look into separation of concerns. Try to move your logic out of you view. If your Javascript is in your HTML page, try to include it from a separate file.
From Wikipedia:
HyperText Markup Language (HTML), Cascading Style Sheets (CSS), and JavaScript (JS) are complementary languages used in the development of webpages and websites. HTML is mainly used for organization of webpage content, CSS is used for definition of content presentation style, and JS defines how the content interacts and behaves with the user. Historically, this was not the case though. Prior to the introduction of CSS, HTML performed both duties of defining semantics and style.

Use HTML, not XHTML (or, if you insist on using XHTML, see the guidelines on how to write XHTML that can be parsed as HTML).
I can't see how you could have generated that error. Some more context would be useful.

For the first error, consider switching from XHTML to HTML5. There's really little reason to use XHTML. Use this:
<!DOCTYPE html>
The W3C validator is for client-side code, but it seems you are trying to validate server-side code, hence the PHP tag. Send the rendered code for validation and the second error will go away. The rendered code is the one visible in the browser under "View source". You can supply the URL if it's already online somewhere.

By XML rules, “The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively.” So “&&” is to be written as &&.
However, this is such works only when the document is processed as “real XHTML” due to having been sent with an XML content type, e.g. with the HTTP header Content-Type: application/xhtml+xml. Doing so implies that old versions of IE will choke on it and modern browsers will refuse to render the document at all if it contains any well-formedness error. People don’t normally do that – they just use XHTML because someone told them to do so, and their documents are sent with an HTML document type, which means among other things that script element content is processed differently. This explains why a fix that satisfies the validator makes the page break.
In the XHTML 1.0 specification, the (in)famous appendix C says: “Use external scripts if your script uses < or & or ]]> or --.” This is the simple cure if you need to use XHTML. Just put your script in an external file and “call” it with <script src="foo.js"></script>.

command line tidy for HTML that does not insert HTML global structures like <!DOCTYPE>, <body>

I'm using this html5-tidy program to pipe debug output from a web app to display to the console so that as a developer, a debug dump of a string variable containing HTML will not be an awful blob of text but rather a somewhat structured view of the HTML.
This is basically to extend upon what I have already done using perltidy for inspecting perl data structures: The string outputted from Data::Dumper is sent through perltidy to make it easier for a human to analyze. Because Dumper will only ever produce syntactically valid Perl, this works pretty well.
Until we get to the big blobs of HTML text variables.
So I'd like to do the same with text (intelligently insert whitespace and newlines), but tidy is doing too much work for me:
$ ../bin/tidy -q test_tidy.html 2>/dev/null | diff test_tidy.html -
1,6c1,17
< <!-- COMMENT --> <p>This example shows how Tidy can indent output while preserving formatting of particular elements.</p><pre>This is <em>genuine preformatted</em> text</pre> <!-- END -->
---
> <!-- COMMENT -->
> <!DOCTYPE html>
> <html>
> <head>
> <meta name="generator" content=
> "HTML Tidy for HTML5 (experimental) for Linux https://github.com/w3c/tidy-html5/tree/c63cc39">
> <title></title>
> </head>
> <body>
> <p>This example shows how Tidy can indent output while preserving
> formatting of particular elements.</p>
> <pre>
> This is <em>genuine preformatted</em> text
> </pre>
> <!-- END -->
> </body>
> </html>
Theoretically I can make assumptions about how tidy is going to "always" add those extraneous things, and basically extract them back out, or something. But that's horrible for many reasons. First, if I go in and take that stuff out, then if it so happens that the input has it in there correctly or partially correct, it will get changed by tidy to be more correct than the original input was, which is bad! I can potentially display both copies so that there is no strange ambiguity in using the tool. But I'd like to avoid this and somehow have the tidy just do tidying of these HTML pieces, instead of trying to build a standalone HTML page.
However I am basically really close to what I want, so I'd rather not try to make something from scratch because I know it will be difficult and error-prone. Tidy also automatically sends over STDERR a really nice collection of warnings and errors (which I suppressed in the example above) which are also superb for placing alongside the debug functionality because while we have a good automated code checking standard in place for processing perl, the generated HTML is not subject to any sort of scrutiny.

In my opinion, you shouldn't even be using html tidy. It makes for bad coding practices. But you can, however, validate your code. This should make things a little more tidy.
http://jigsaw.w3.org/css-validator/
http://validator.w3.org/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How do you handle malformed HTML in Perl? - html

I'm interested in a parser that could take a malformed HTML page, and turn it into well formed HTML before performing some XPath queries on it. Do you know of any?

Unless you're looking to learn more about wheels, use the HTML Tidy code.

Related

Basic Working Example of an XXE Attack in HTML

HTML parsing issue in Racket

Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

One W3C validation errors I really want to correct

command line tidy for HTML that does not insert HTML global structures like <!DOCTYPE>, <body>

Categories

Resources