I want to parse some HTML documents, it seems that Racket's html and xml library can't handle this very well. For example, here's an HTML document:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Test</title>
<script>
var k = "<scr";
</script>
</head>
<body>
</body>
</html>
Neither read-html nor read-xml can parse this. They think the <scr in var k = "<scr" is part of an opening tag.
So, is there a better way to do this?
Try the html-parsing package.
The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”
Although I don't know for sure if it will handle <script> tags like this, it might. The author, Neil Van Dyke, is active on the Racket mailing list.
Related
I'm trying to run some tests with XXE attacks in an html page, but i'm having trouble coming up with a working example. After looking around the internet for a long time, I came up with this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<script id="embeddedXML" type="text/xml">
<!DOCTYPE foo [
<!ELEMENT foo ANY>
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<foo>&xxe;</foo>
</script>
</head>
<body>
<script type="application/javascript">
alert(document.getElementById('embeddedXML').innerHTML);
</script>
</body>
</html>
But, it doesn't work. The XML inside the script tag doesn't "run", per se, meaning that when the alert pops up, it just displays the xml as plaintext. It doesn't interpret the DOCTYPE header thing and get the information from the listed file.
It's been very hard to google around for this because apparently XML doesn't "run", but something needs to happen where this text is interpreted instead of just written out. I don't know what that thing is, or how to get it working inside an HTML page as written here.
any tips much appreciated. Thanks!
See OWASP
Among the Risk Factors is:
The application parses XML documents.
Now, script elements are defined (in HTML 4 terms) as containing CDATA, so markup in them (except </script>) has no special meaning. So there is no XML parsing going on there.
Meanwhile alert() deals in strings, not in markup, so there's still no XML parsing going on.
Since you have no XML parser, there's no vulnerability.
In general, if you want XML parsing in the middle of a web page then you need to use JavaScript (e.g. with DOM Parser but I wouldn't be surprised if it was not DTD aware and so not vulnerable (and even if it was vulnerable then it might well block access to local external entities).
On a website, we are using a HTML head base tag. The reason is the convenience of linking to static resources through relative URLs and it's very hard to change. Content URLs are always fully qualified. So the head section looks like:
<head>
<base href="http://example.com/static/" />
</head>
Now, we are using RDFa to specify structured data on the page. To populate i.e. a schema:Product page, say http://example.com/product1. Now, the problem comes from the base tag: in the absence of any other reparation, the RDFa parser considers the whole RDFa data is about http://example.com/static, not about http://example.com/product1.
We have tried with mixed results adding the property about="http://example.com/product1" on either <html> or <body>.
This intermittently works with Google's Structured Data Testing Tool. Intermittently in the sense that about 2 months ago it seemed to work when added to <body>, now it appears to work when added to <head>.
However, in the Search Console under "Structured Data" the situation is not even intermittently working. It used to work about 8 months ago with <html about="..."> but now it just doesn't work either way. I mean the pages are indexed, but not the structured data.
So, is there a standard, tried and proven way to make Google (and any generic meta parser) properly know the URL of a webpage that has a generic <base href="" /> tag that is different to its actual URL?
Example 1
Assume the following is rendered by HTTP GET http://bar.com/product1
<html prefix="schema: http://schema.org/">
<head>
<base href="http://foo.com/" />
</head>
<body about="http://bar.com/product1" typeof="schema:Product">
<span property="schema:name">Bar product</span>
</body>
</html>
The above:
Was working with Google based on Google Search Console / Structured Data ~8 months ago and Google Structured Data Testing Tool ~2 months ago
Is not working with Google based on Google Search Console / Structured Data since 8m ago (no errors reported, but new content is not fetched into structured data report), is not parsing with Testing Tool ATM
Example 2
<html prefix="schema: http://schema.org/" about="http://bar.com/product1" typeof="schema:Product">
<head>
<base href="http://foo.com/" />
</head>
<body>
<span property="schema:name">Bar product</span>
</body>
</html>
Was not parsing with Google Structured Data Testing Tool ~2 months ago
Is parsing with Google Structured Data Testing Tool ATM
Is not working with Google based on Google Search Console / Structured Data ATM (no errors reported, but new content is not fetched into structured data report)
Both of your example snippets seem to work correctly in Google’s Structured Data Testing Tool. As one would expect, they generate the same output.
#type Product
#id http://bar.com/product1
name Bar product
I can’t test it in Google’s Search Console, but I could imagine that the issue you see is not related to the RDFa markup.
Anyway, you could try to use resource instead of about. While both ways are fine in RDFa, RDFa Lite supports only resource. I’m not saying that Google only supports RDFa Lite (probably not, because their SDTT seems to support about fine), but when they refer to RDFa, they typically link to the RDFa Lite spec.
<html prefix="schema: http://schema.org/">
<head>
<base href="http://foo.com/" />
</head>
<body resource="http://bar.com/product1" typeof="schema:Product">
<span property="schema:name">Bar product</span>
</body>
</html>
I have a client/friend who is preparing an email to send through agencyaccess. They require an all-inclusive document with the html, and plain text versions of the email in one HTML document. I have a basic understanding I think, but am a bit confused. I generally use Mailchimp to handle my email marketing.
So we would use a regular html document with
<html>
<head>
<title>Our Email</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
our html markup
</body>
</html>
but then is it somewhere below this that we declare an alternative mime-type for the plain text version, and then the email client chooses which to show? And would the both of these have to be wrapped in a multipart/mixed mime-type?
I know this is probably pretty simple, but most of what I had read handled the MIME-type declaration in the PHP file sending the mail, but we need to differentiate inside this document. Really just wondering how this should be structured.
So it was confusion for nothing. The service implied that the user was expected to upload a single document to cover the both(which implies needed to specify MIME-types in the document), but this was not the case, as they required everything to fall inside html markup. The service itself was supposed to offer the additional step to insert the plain text version, and it was a bug on their part that they are working on. Hope that makes sense, but thanks for the responses guys
I'm trying to pass a page through the W3C Validator. The validation fails on the sitemap, which I'm including like this:
<link rel="sitemap" type="application/xml" title="Sitemap" href="../sitemap.xml" />
The error I'm getting is:
Bad value sitemap for attribute rel on element link: Not an absolute IRI. The string sitemap is not a registered keyword or absolute URL.
I have been trying forever to fix it, but nothing I'm trying seems to work plus this is the recommended layout by Google and Html5 Boilerplate.
Is there anything wrong with my syntax? Seems correct, but why is it not passing?
Dropping in from the future (June 2021).
The entry:
<link rel="sitemap" type="application/xml" title="Sitemap" href="/sitemap.xml">
is now accepted by the W3 HTML5 Validator
That is to say:
rel="sitemap"
is now a valid attribute + value.
Validating the following HTML file:
<!DOCTYPE html>
<html lang="en-gb">
<head>
<meta charset="utf-8">
<title>My Rel Sitemap Test</title>
<link rel="sitemap" type="application/xml" title="Sitemap" href="/sitemap.xml">
</head>
<body>
<h1>My Rel Sitemap Test</h1>
<p>This is my Rel Sitemap Test.</p>
<p>The document passes.</p>
<p>This document is valid HTML5 + ARIA + SVG 2 + MathML 3.0</p>
</body>
</html>
here: https://validator.w3.org/nu/
returns the response:
Document checking completed. No errors or warnings to show.
The short answer is that you cannot.
HTML 5 defines the values that you are allowed to use in rel and sitemap is not one of the ones recognised by the validator.
The error message does say that you can register a new link type on a wiki, but sitemap is already there so you just have to wait for the validator developers to update the validator to reflect the new state of the wiki (assuming nobody deletes the entry).
(The basic problems here are that having the specification use a wiki page as a normative resource is nuts, that HTML 5 is still a draft, and that the HTML 5 validator is still considered experimental).
If you only need w3c validator to pass, perhaps you could detect its user agent and modify the output of your application so that it passes. I think of strict validation as more of a marketing benefit then anything when it comes to minor issues like this. If other developers use w3c validator to say your client's web site is full of errors, then that is annoying.
You can check if the HTTP_USER_AGENT contains "W3C_Validator" and remove the non-standard code.
In CFML, I wrote code like this to make my Google Authorship link still able to validate on w3c validator:
<cfif cgi.HTTP_USER_AGENT CONTAINS "W3C_Validator">data-</cfif>rel="publisher"
I just posted a question on the google forum if they could begin supporting data-rel or if they could confirm if google search does already support it. The structured data testing tool they provide doesn't parse data-rel when I tested it just now.
http://www.google.com/webmasters/tools/richsnippets
Hopefully, someone will follow up:
https://groups.google.com/a/googleproductforums.com/d/msg/webmasters/-/g0RDfpFwmqAJ
The string sitemap is not a registered keyword or absolute URL
Your problem is right here:
href="../sitemap.xml"
You are using a relative URL to indicate where your sitemap is. Try to put something like this:
<link rel="sitemap" type="application/xml" title="Sitemap" href="/myfolder/sitemap.xml" />
EDIT
Since Robots crawl first in your root directory the best approach is indeed use your sitemap.xml file in your root directory:
<link rel="sitemap" type="application/xml" title="Sitemap" href="/sitemap.xml" />
or
<link rel="sitemap" type="application/xml" title="Sitemap" href="http://yoursite.com/sitemap.xml" /> <!-- No www -->
Also,
Make sure your link tag is a child of your head tag
Try this!
<link rel="alternate" type="application/xml" title="Site Map" href="http://yoursite.com/sitemap.xml" />
The rel Attribute alternate is recognized also for RSS and ATOM feeds. I personally use it for all xml documents.
I'm interested in a parser that could take a malformed HTML page, and turn it into well formed HTML before performing some XPath queries on it. Do you know of any?
You should not use an XML parser to parse HTML. Use an HTML parser.
Note that the following is perfectly valid HTML (and an XML parser would choke on it):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Is this valid?</title>
</head>
<body>
<p>This is a paragraph
<table>
<tr> <td>cell 1 <td>cell 2
<tr> <td>cell 3 <td>cell 4
</table>
</body>
</html>
There are many task specific (in addition to the general purpose) HTML parsers on CPAN. They have worked perfectly for me on an immense variety of extremely messy (and most of the time invalid) HTML.
It would be possible to give specific recommendations if you can specify the problem you are trying to solve.
There is also HTML::TreeBuilder::XPath which uses HTML::Parser to parse the document into a tree and then allows you to query it using XPath. I have never used it but see Randal Schwartz's HTML Scraping with XPath.
Given the HTML file above, the following short script:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file("valid.html");
my #td = $tree->findnodes_as_strings('//td');
print $_, "\n" for #td;
outputs:
C:\Temp> z
cell 1
cell 2
cell 3
cell 4
The key point here is that the document was parsed by an HTML parser as an HTML document (despite the fact that we were able to query it using XPath).
Unless you're looking to learn more about wheels, use the HTML Tidy code.
You could rephrase the question like this:
I'm interested in a parser that could take a malformed HTML page C source, and turn it into well formed HTML C source before performing some XPath queries compilation and linking on it. Do you know of any?
Now the question may be a bit more obvious: it's not going to be easy. If it's truly malformed HTML, you may need to do the work by hand until it can be fed into an HTML parser. Then, you can use any of the other modules presented here to do the work. It's unlikely though that you could ever programatically translate raw HTML into strictly valid xhtml.