I'm trying to run some tests with XXE attacks in an html page, but i'm having trouble coming up with a working example. After looking around the internet for a long time, I came up with this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<script id="embeddedXML" type="text/xml">
<!DOCTYPE foo [
<!ELEMENT foo ANY>
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<foo>&xxe;</foo>
</script>
</head>
<body>
<script type="application/javascript">
alert(document.getElementById('embeddedXML').innerHTML);
</script>
</body>
</html>
But, it doesn't work. The XML inside the script tag doesn't "run", per se, meaning that when the alert pops up, it just displays the xml as plaintext. It doesn't interpret the DOCTYPE header thing and get the information from the listed file.
It's been very hard to google around for this because apparently XML doesn't "run", but something needs to happen where this text is interpreted instead of just written out. I don't know what that thing is, or how to get it working inside an HTML page as written here.
any tips much appreciated. Thanks!
See OWASP
Among the Risk Factors is:
The application parses XML documents.
Now, script elements are defined (in HTML 4 terms) as containing CDATA, so markup in them (except </script>) has no special meaning. So there is no XML parsing going on there.
Meanwhile alert() deals in strings, not in markup, so there's still no XML parsing going on.
Since you have no XML parser, there's no vulnerability.
In general, if you want XML parsing in the middle of a web page then you need to use JavaScript (e.g. with DOM Parser but I wouldn't be surprised if it was not DTD aware and so not vulnerable (and even if it was vulnerable then it might well block access to local external entities).
Related
On a website, we are using a HTML head base tag. The reason is the convenience of linking to static resources through relative URLs and it's very hard to change. Content URLs are always fully qualified. So the head section looks like:
<head>
<base href="http://example.com/static/" />
</head>
Now, we are using RDFa to specify structured data on the page. To populate i.e. a schema:Product page, say http://example.com/product1. Now, the problem comes from the base tag: in the absence of any other reparation, the RDFa parser considers the whole RDFa data is about http://example.com/static, not about http://example.com/product1.
We have tried with mixed results adding the property about="http://example.com/product1" on either <html> or <body>.
This intermittently works with Google's Structured Data Testing Tool. Intermittently in the sense that about 2 months ago it seemed to work when added to <body>, now it appears to work when added to <head>.
However, in the Search Console under "Structured Data" the situation is not even intermittently working. It used to work about 8 months ago with <html about="..."> but now it just doesn't work either way. I mean the pages are indexed, but not the structured data.
So, is there a standard, tried and proven way to make Google (and any generic meta parser) properly know the URL of a webpage that has a generic <base href="" /> tag that is different to its actual URL?
Example 1
Assume the following is rendered by HTTP GET http://bar.com/product1
<html prefix="schema: http://schema.org/">
<head>
<base href="http://foo.com/" />
</head>
<body about="http://bar.com/product1" typeof="schema:Product">
<span property="schema:name">Bar product</span>
</body>
</html>
The above:
Was working with Google based on Google Search Console / Structured Data ~8 months ago and Google Structured Data Testing Tool ~2 months ago
Is not working with Google based on Google Search Console / Structured Data since 8m ago (no errors reported, but new content is not fetched into structured data report), is not parsing with Testing Tool ATM
Example 2
<html prefix="schema: http://schema.org/" about="http://bar.com/product1" typeof="schema:Product">
<head>
<base href="http://foo.com/" />
</head>
<body>
<span property="schema:name">Bar product</span>
</body>
</html>
Was not parsing with Google Structured Data Testing Tool ~2 months ago
Is parsing with Google Structured Data Testing Tool ATM
Is not working with Google based on Google Search Console / Structured Data ATM (no errors reported, but new content is not fetched into structured data report)
Both of your example snippets seem to work correctly in Google’s Structured Data Testing Tool. As one would expect, they generate the same output.
#type Product
#id http://bar.com/product1
name Bar product
I can’t test it in Google’s Search Console, but I could imagine that the issue you see is not related to the RDFa markup.
Anyway, you could try to use resource instead of about. While both ways are fine in RDFa, RDFa Lite supports only resource. I’m not saying that Google only supports RDFa Lite (probably not, because their SDTT seems to support about fine), but when they refer to RDFa, they typically link to the RDFa Lite spec.
<html prefix="schema: http://schema.org/">
<head>
<base href="http://foo.com/" />
</head>
<body resource="http://bar.com/product1" typeof="schema:Product">
<span property="schema:name">Bar product</span>
</body>
</html>
What is the right way to insert HTML snippet into main HTML file with HTML5 imports?
The answer to more generic question https://stackoverflow.com/a/22142111/239247 mentions that it is possible to do:
<head>
<link rel="import" href="header.html">
</head>
But this doesn't work. I don't need to insert JS and CSS. Only plain HTML markup inserted at the top of <body>. What is the most simple way to do this and keep HTML readable?
The way I have found to do this is to use ASP.NET and .cshtml files and use razor, as seen here:
http://weblogs.asp.net/scottgu/asp-net-mvc-3-layouts
Beyond simply inserting html into other html this also allows you to have consistent navigation bars, footers, etc. and minimizes your code profile. Also, the use of a layout file gives the site a better feel as only a section of the site refreshes when you click an internal link, as opposed to the whole site.
Found a way to do this from html5rocks, but it is far from being readable. This is the ideal way:
<body>
<include src="header.html"/>
</body>
And this is how it is implemented by HTML5 imports:
<head>
<link rel="import" href="header.html">
</head>
<body>
...
<script>
document.body.appendChild(
document.querySelector('link[rel="import"]')
.import
.querySelector('body')
.cloneNode(true));
</script>
</body>
Notes:
not clear how to choose include if both header.html and footer.html are there
querySelector('body') is required to avoid Uncaught HierarchyRequestError: Failed to execute 'appendChild' on 'Node': Nodes of type '#document' may not be inserted inside nodes of type 'BODY'.
not clear how to insert <body> tag contents without the tag itself
HTML5 import doesn't work in Firefox (38) http://caniuse.com/#feat=imports =/
See: http://www.w3schools.com/angular/angular_includes.asp
It says:
HTML Includes in Future HTML. Including a portion of HTML in HTML is, unfortunately, not (yet) supported by HTML.
So this is on its way, but not here yet.
EDIT: If you are able to, I would use PHP, which is close to the level of cleanliness. The link I inclueded shows multiple ways to do what you are trying to do.
Late edit: if it still counts (for those worried about sourcing): http://caniuse.com/#feat=imports
I want to parse some HTML documents, it seems that Racket's html and xml library can't handle this very well. For example, here's an HTML document:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Test</title>
<script>
var k = "<scr";
</script>
</head>
<body>
</body>
</html>
Neither read-html nor read-xml can parse this. They think the <scr in var k = "<scr" is part of an opening tag.
So, is there a better way to do this?
Try the html-parsing package.
The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”
Although I don't know for sure if it will handle <script> tags like this, it might. The author, Neil Van Dyke, is active on the Racket mailing list.
I have a client/friend who is preparing an email to send through agencyaccess. They require an all-inclusive document with the html, and plain text versions of the email in one HTML document. I have a basic understanding I think, but am a bit confused. I generally use Mailchimp to handle my email marketing.
So we would use a regular html document with
<html>
<head>
<title>Our Email</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
our html markup
</body>
</html>
but then is it somewhere below this that we declare an alternative mime-type for the plain text version, and then the email client chooses which to show? And would the both of these have to be wrapped in a multipart/mixed mime-type?
I know this is probably pretty simple, but most of what I had read handled the MIME-type declaration in the PHP file sending the mail, but we need to differentiate inside this document. Really just wondering how this should be structured.
So it was confusion for nothing. The service implied that the user was expected to upload a single document to cover the both(which implies needed to specify MIME-types in the document), but this was not the case, as they required everything to fall inside html markup. The service itself was supposed to offer the additional step to insert the plain text version, and it was a bug on their part that they are working on. Hope that makes sense, but thanks for the responses guys
I'm interested in a parser that could take a malformed HTML page, and turn it into well formed HTML before performing some XPath queries on it. Do you know of any?
You should not use an XML parser to parse HTML. Use an HTML parser.
Note that the following is perfectly valid HTML (and an XML parser would choke on it):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Is this valid?</title>
</head>
<body>
<p>This is a paragraph
<table>
<tr> <td>cell 1 <td>cell 2
<tr> <td>cell 3 <td>cell 4
</table>
</body>
</html>
There are many task specific (in addition to the general purpose) HTML parsers on CPAN. They have worked perfectly for me on an immense variety of extremely messy (and most of the time invalid) HTML.
It would be possible to give specific recommendations if you can specify the problem you are trying to solve.
There is also HTML::TreeBuilder::XPath which uses HTML::Parser to parse the document into a tree and then allows you to query it using XPath. I have never used it but see Randal Schwartz's HTML Scraping with XPath.
Given the HTML file above, the following short script:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file("valid.html");
my #td = $tree->findnodes_as_strings('//td');
print $_, "\n" for #td;
outputs:
C:\Temp> z
cell 1
cell 2
cell 3
cell 4
The key point here is that the document was parsed by an HTML parser as an HTML document (despite the fact that we were able to query it using XPath).
Unless you're looking to learn more about wheels, use the HTML Tidy code.
You could rephrase the question like this:
I'm interested in a parser that could take a malformed HTML page C source, and turn it into well formed HTML C source before performing some XPath queries compilation and linking on it. Do you know of any?
Now the question may be a bit more obvious: it's not going to be easy. If it's truly malformed HTML, you may need to do the work by hand until it can be fed into an HTML parser. Then, you can use any of the other modules presented here to do the work. It's unlikely though that you could ever programatically translate raw HTML into strictly valid xhtml.