preventing xss hole in django - html

In the Django docs it says:
Django templates escape specific characters which are particularly
dangerous to HTML. While this protects users from most malicious
input, it is not entirely foolproof. For example, it will not protect
the following:
<style class={{ var }}>...</style>
If var is set to 'class1 onmouseover=javascript:func()', this can
result in unauthorized JavaScript execution, depending on how the
browser renders imperfect HTML.
How can I prevent this?

I'm not especially familiar with Django, but it looks to me like the error they intended to point out is that there are no quotes around the attribute value, meaning that the space in the example value causes the rest of the string (onmouseover=...) to be interpreted as a separate attribute. Instead, you should put quotes like so:
<style class="{{ var }}">...</style>
If I understand correctly, this would be safe since all the characters that could interfere with the quoting are escaped. You might want to verify that interpretation; for example, write <span title="{{ var }}">foo</span>, run the template with foo set to <>"'&, and then make sure that they're properly escaped in the HTML and that the title appears in the browser with the original characters.

One thing you can do is not allow variable classes. You can use something like
<style class={% if class_foo %}foo{% elif class_bar %}bar{% else %}baz{% endif %}>...</style>
There are also filters available to prevent xss elsewhere: https://docs.djangoproject.com/en/dev/ref/templates/builtins/#std:templatefilter-escape

Related

Why does the browser automatically unescape html tag attribute values?

Below I have an HTML tag, and use JavaScript to extract the value of the widget attribute. This code will alert <test> instead of <test>, so the browser automatically unescapes attribute values:
alert(document.getElementById("hau").attributes[1].value)
<div id="hau" widget="<test>"></div>
My questions are:
Can this behavior be prevented in any way, besides doing a double escape of the attribute contents? (It would look like this: &lt;test&gt;)
Does anyone know why the browser behaves like this? Is there any place in the HTML specs that this behavior is mentioned explicitly?
1) It can be done without doing a double escape
Looks like yours is closer to htmlEncode().
If you don't mind using jQuery
alert(htmlEncode($('#hau').attr('widget')))
function htmlEncode(value){
//create a in-memory div, set it's inner text(which jQuery automatically encodes)
//then grab the encoded contents back out. The div never exists on the page.
return $('<div/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="hau" widget="<test>"></div>
If you're interested in a pure vanilla js solution
alert(htmlEncode(document.getElementById("hau").attributes[1].value))
function htmlEncode( html ) {
return document.createElement( 'a' ).appendChild(
document.createTextNode( html ) ).parentNode.innerHTML;
};
<div id="hau" widget="<test>"></div>
2) Why does the browser behave like this?
Only because of this behaviour, we are able to do a few specific things, such as including quotes inside of a pre-filled input field as shown below, which would not have been possible if the only way to insert " is by adding itself which again would require escaping with another char like \
<input type='text' value=""You &apos;should&apos; see the double quotes here"" />
The browser unescapes the attribute value as soon as it parses the document (mentioned here). One of the reasons might be that it would otherwise be impossible to include, for example, double quotes in your attribute value (well, technically it would if you put the value in single quotes instead, but then you wouldn't be able to include single quotes in the value).
That said, the behavior cannot be prevented, although if you really must use the value with the HTML entities being part of it, you could simply turn your special characters back into the codes (I recommend Underscore's escape for such task).

XSS without HTML tags

It is possible to do a XSS attack if my input does not allow < and > characters?
Example: I enter <script>alert('this');</script> text
But it if I delete < and > the script is not text:
I enter script alert('this'); script text
Yes, it could still be possible.
e.g. Say your site injects user input into the following location
<img src="http://example.com/img.jpg" alt="USER-INPUT" />
If USER-INPUT is " ONLOAD="alert('xss'), this will render
<img src="http://example.com/img.jpg" alt="" ONLOAD="alert('xss')" />
No angle brackets necessary.
Also, check out OWASP XSS Experimental Minimal Encoding Rules.
For HTML body:
HTML Entity encode < &
specify charset in metatag to avoid UTF7 XSS
For XHTML body:
HTML Entity encode < & >
limit input to charset http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
So within the body you can get away with only encoding (or removing) a subset of the characters usually recommended to prevent XSS. However, you cannot do this within attributes - the full XSS (Cross Site Scripting) Prevention Cheat Sheet recommends the following, and they do not have a minimal alternative:
Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH; format, including spaces. (HH = Hex Value)
The is mainly though to cover the three types of ways of specifying the attribute value:
Unquoted
Single quoted
Double quoted
Encoding in such a way will prevent XSS in attribute values in all three cases.
Also be wary that UTF-7 attacks do not need angle bracket characters. However, unless the charset is explicitly set to UTF-7, this type of attack isn't possible in modern browsers.
+ADw-script+AD4-alert(document.location)+ADw-/script+AD4-
Also beware of attributes that allow URLs like href and ensure any user input is a valid web URL. Using a reputable library to validate the URL is highly recommended using an allow-list approach (e.g. if protocol not HTTPS then reject). Attempting to block sequences like javascript: is not sufficient.
If the user-supplied input is printed inside an HTML attribute, you also need to escape quotation marks or you would be vulnerable inputs like this:
" onload="javascript-code" foobar="
You should also escape the ampersand character as it generally needs to be encoded inside HTML documents and might otherwise destroy your layout.
So you should take care of the following characters: < > & ' "
You should however not completely strip them but replace them with the correct HTML codes i.e. < > & " '

Passing style parameters in query string

I have a simple html page with a div element in it.
The innerHTML property of the div is set through query String.
In query string I pass html strings,i.e.
<p style='font-size:20px;color:green;'> Sun rises in the east </p> etc...
I get the appropriate output.
However, if I pass color code in style attribute say, #00990a, I am not displayed any content.
Can someone help me through this?
if theres a color code that contains a #, everything after that will be treated fragment identifier. to avoid this you have to url-encode your parameter-value (replacing # with %23 an d doing the same with other characters that have a special meaning (#&%=?#...)).
Finally your url should look like this:
PageUrl?Content=%3Cp+style%3D%27color%3A%23009900%27%3EContent%3C%2Fp%3E
Since you haven't shown us any code, I shall guess…
In a URI, # indicates the start of the fragment identifier (as ? indicates the start of the query string). Your colour is terminated the query string and starting the fragment identifier. You need to URL encode any character that has special meaning in URLs. (# is %23).
Do make sure that you sanitise the passed HTML and CSS on the server though. It is very easy to expose yourself to XSS attacks otherwise.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

escaping html inside comment tags

escaping html is fine - it will remove <'s and >'s etc.
ive run into a problem where i am outputting a filename inside a comment tag eg. <!-- ${filename} -->
of course things can be bad if you dont escape, so it becomes:
<!-- <c:out value="${filename}"/> -->
the problem is that if the file has "--" in the name, all the html gets screwed, since youre not allowed to have <!-- -- -->.
the standard html escape doesnt escape these dashes, and i was wondering if anyone is familiar with a simple / standard way to escape them.
Definition of a HTML comment:
A comment declaration starts with <!, followed by zero or more comments, followed by >. A comment starts and ends with "--", and does not contain any occurrence of "--".
Of course the parsing of a comment is up to the browser.
Nothing strikes me as an obvious solution here, so I'd suggest you str_replace those double dashes out.
There is no good way to solve this. You can't just escape them because comments are read in plaintext. You will have to do something like put a space between the hyphens, or use some sort of code for hyphens (like [HYPHEN]).
Since it is obvoius that you cannnot directly display the '--'s you can either encode them or use the fn:escapeXml or fn:replace tags for appropriate replacements.
JSTL documentation
There's no universal working way to escape those characters in html unless the - characters are in multiples of four so if you do -- it wont work in firefox but ---- will work. So it all depends on the browser. For Example, looking at Internet Explorer 8, it is not a problem, those characters are escaped properly. The same goes for Googles Chrome... However Firefox even the latest browser (3.0.4), it doesn't handle escaping of these characters well.
You shouldn't be trying to HTML-escape, the contents of comments are not escapable and it's fine to have a bare ‘>’ or ‘&’ inside.
‘--’ is its own, unrelated problem and is not really fixable. If you don't need to recover the exact string, just do a replacement to get rid of them (eg. replace with ‘__’).
If you do need to get a string through completely unmolested to a JavaScript that will be reading the contents of the comment, use a string literal:
<!-- 'my-string' -->
which the script can then read using eval(commentnode.data). (Yes, a valid use for eval() at last!)
Then your escaping problem becomes how to put things in JS string literals, which is fairly easily solvable by escaping the ‘'’ and ‘-’ characters:
<!-- 'Bob\x27s\x2D\x2Dstring' -->
(You should probably also escape ‘<’, ‘&’ and ‘"’, in case you ever want to use the same escaping scheme to put a JS string literal inside a <​script> block or inline handler.)