Safely process user content in Django template language [duplicate] - html

Is there a generic "form sanitizer" that I can use to ensure all html/scripting is stripped off the submitted form? form.clean() doesn't seem to do any of that - html tags are all still in cleaned_data. Or actually doing this all manually (and override the clean() method for the form) is my only option?

strip_tags actually removes the tags from the input, which may not be what you want.
To convert a string to a "safe string" with angle brackets, ampersands and quotes converted to the corresponding HTML entities, you can use the escape filter:
from django.utils.html import escape
message = escape(form.cleaned_data['message'])

Django comes with a template filter called striptags, which you can use in a template:
value|striptags
It uses the function strip_tags which lives in django.utils.html. You can utilize it also to clean your form data:
from django.utils.html import strip_tags
message = strip_tags(form.cleaned_data['message'])

Alternatively, there is a Python library called bleach:
Bleach is a whitelist-based HTML sanitization and text linkification library. It is designed to take untrusted user input with some HTML.
Because Bleach uses html5lib to parse document fragments the same way browsers do, it is extremely resilient to unknown attacks, much more so than regular-expression-based sanitizers.
Example:
import bleach
message = bleach.clean(form.cleaned_data['message'],
tags=ALLOWED_TAGS,
attributes=ALLOWED_ATTRIBUTES,
styles=ALLOWED_STYLES,
strip=False, strip_comments=True)

Related

Are there native methods to include dynamic content in HTML?

You should never create html by concatenating strings "<div>" + user_content + "</div>" because user_content may include html tags that inject scripts leading to XSS attacks.
There are libraries that escape html in string, the DOM's javascript API allows safe content assignment with .textContent, but what about native html methods?
For instance, there could be an html element text that escapes all inner content and renders it as text. A length attribute would tell the parser to skip the next length characters.
`<div><text length="${user_content.length()}">` + user_content + "</text></div>"
For background, I'm writing an aws lambda function in javascript. It reads user data from dynamo and builds a simple web page. The lambda has no dependencies (except aws-sdk which is included in the lambda environment) and I'd like to keep it that way as a challenge. The page should not use javascript.
Can you write a javascript program that generates a safe html document, that does not use javascript, from untrusted input. Also, without using html entity encoding.

Sanitize <script> element contents

Say that I want to provide some data to my client (in the first response, with no latency) via a dynamic <script> element.
<script><%= payload %></script>
Say that payload is the string var data = '</script><script>alert("Muahahaha!")';</script>. An end tag (</script>) will allow users to inject arbitrary scripts into my page. How do I properly sanitize the contents of my script element?
I figure I could change </script> to <\/script> and <!-- to <\!--. Are there any other dangerous strings I need to escape? Is there a better way to provide this "cold start" data?
Edited for non-mutation of data.
If I'm interpreting this correctly. You want to prevent the user from ending the script tag prematurely within the user submitted string. That can be done for html just as you stated with adding the backslash in with the ending tag <\/script>. That is the only escaping you should have to worry about in that case. You shouldn't need to escape html comments as the browser will interpret it as part of the javascript. Perhaps if some older browsers don't interpret script tags default to the type of text/javascript correctly (language="javascript" which is deprecated) adding in type='text/javascript' may be necessary.
Based on Mike Samuel's answer here I may have been wrong about not needing to escape html comments. However I was not able to reproduce it in chrome or chromium.
Assuming that you're doing this:
Payload is set to
var data = '[this is user controlled data]';
and the rest of the code (assignment, quotes and semi-colon) is generated by your application, then the encoding you want is hex entity encoding.
See the OWASP XSS Prevention Cheat Sheet, Rule #3 for more information. This will convert
</script><script>alert("Muahahaha!")
into
var data = '\x3c\x2fscript\x3e\x3cscript\x3ealert\x28\x22Muahahaha\x21\x22\x29';
Try this and you will see this has the advantage of storing the user set string exactly correct, no matter what characters it contains. Additionally it takes care of single and double quote encoding. As a super bonus, it is also suitable for storing in HTML attributes:
<a onclick="alert('[user data]');" />
which normally would have to be HTML encoded again for correct display (because & inside an HTML attribute is interpreted as &). However, hex entity encoding does not include any HTML characters with special meaning so you get two for the price of one.
Update from comments
The OP indicated that the server-side code would be generated in the form
var data = <%= JSON.stringify(data) %>;
The above still applies. It is upto the JSON class to properly hex entity encode values as they're inserted into the JSON. This cannot easily be done outside of the class as you'd have to effectively parse the JSON again to determine the current language context. I wouldn't recommend going for the simple option of escaping the forward slash in the </script> because there are other sequences that can end the grammar context such as CDATA closing tags. Escape properly and your code will be future proof and secure.

Why do I need XSS library while I can use Html-encode?

I'm trying to understand why do I need to use XSS library when I can merely do HtlEncode when sending data from server to client ...?
For example , here in Stackoverflow.com - the editor - all the SO tem neads to do is save the user input and display it with html encode.
This way - there will never going to be a HTML tag - which is going to be executed.
I'm probably wrong here -but can you please contradict my statement , or exaplain?
For example :
I know that IMG tag for example , can has onmouseover , onload which a user can do malicious scripts , but the IMG won't event run in the browser as IMG since it's <img> and not <img>
So - where is the problem ?
HTML-encoding is itself one feature an “XSS library” might provide. This can be useful when the platform doesn't have a native HTML encoder (eg scriptlet-based JSP) or the native HTML encoder is inadequate (eg not escaping quotes for use in attributes, or ]]> if you're using XHTML, or #{} if you're worried about cross-origin-stylesheet-inclusion attacks).
There might also be other encoders for other situations, for example injecting into JavaScript strings in a <script> block or URL parameters in an href attribute, which are not provided directly by the platform/templating language.
Another useful feature an XSS library could provide might be HTML sanitisation, for when you want to allow the user to input data in HTML format, but restrict which tags and attributes they use to a safe whitelist.
Another less-useful feature an XSS library could provide might be automated scanning and filtering of input for HTML-special characters. Maybe this is the kind of feature you are objecting to? Certainly trying to handle HTML-injection (an output stage issue) at the input stage is a misguided approach that security tools should not be encouraging.
HTML encoding is only one aspect of making your output safe against XSS.
For example, if you output a string to JavaScript using this code:
<script>
var enteredName = '<%=EnteredNameVariableFromServer %>';
</script>
You will be wanting to hex entity encode the variable for proper insertion in JavaScript, not HTML encode. Suppose the value of EnteredNameVariableFromServer is O'leary, then the rendered code when properly encoded will become:
<script>
var enteredName = 'O\x27leary';
</script>
In this case this prevents the ' character from breaking out of the string and into the JavaScript code context, and also ensures proper treatment of the variable (HTML encoding it would result in the literal value of O'leary being used in JavaScript, affecting processing and display of the value).
Side note:
Also, that's not quite true of Stack Overflow. Certain characters still have special meanings like in the <!-- language: lang-none --> tag. See this post on syntax highlighting if you're interested.

For setHTML() method, is it still safe If we do not use Safehtml but we validate the String & only accept some limited html tag (Gwt)?

Any widget that has setHTML method could give a hole in security system, but if we validate String & only accept some limited html tags such as <b>, <i>.... And then we put this string into setHTML method.
Then my question is "is it still safe if we do that"
For example, we check the String text to make sure it only contain some limited html tags <b>, </b>, <i>, </i>... If the string text contain other tags then we won't let uses to input that text. Then we use:
html1.setHTML(text); instead of html1.setHTML(SafeHtmlUtils.fromString(text))
i don't know why html1.setHTML(SafeHtmlUtils.fromString(text)) does not generate the formatted text, it just shows plain text when i run it in eclipse? For example
html1.setHTML(SafeHtmlUtils.fromString("<b>text</b>"))
will have plain text result <b>text</b> instead of bold text "text" with correct html format
You want to sanitize the html, not escape it. The fromString method is meant to escape the string - if a user types enters a < b, but forgets the space, then adds >c, you don't want the c to be bold and the b to be missing entirely. Escaping is done to actually render the string that is given, assuming it is text.
On the complete other end of the spectrum, you can use fromTrustedString which tells GWT that you absolutely trust the source of the data, and that you will allow it to do anything. This typically should not be done for any data that comes from the user.
Somewhere off to the side of all of the then we have sanitation, the process where you take a string that is meant to be HTML, and ensure it is safe, rather than either treating it like text, or trusting it implicitly. This is hard to do well - any tag that has a style attribute could potentially attack you (this is why GWT has SafeStyle like SafeHtml, any tag that has a uri, url or href could be used to attack (hence SafeUri), and any attribute that the browser treats as a callback such as onclick or the like can be used to run JavaScript. The HtmlSanitizer type is meant to be able to do this.
There is a built-in implementation of this, as of at least GWT 2.4 - SimpleHtmlSanitizer. This class whitelists certain html tags, including your <b> and <i> tags, as well as a few others. Attributes are completely removed, as there are too many cases where they might not be safe. As the class name suggests, this is just a simple approach to this problem - a more complex and in-depth approach might be more true to the original code, but this also comes with the risk of allowing unsafe HTML content.

How do I html encode text inside Action Script 3?

I have an app that builds XML, the text nodes values are coming from the users.
How would I HTML encode that input to avoid bad characters?
Preferably looking for a built in solution in Action Script.
I've used escape() and unescape() for POST variables in the past, not sure if it's the best solution for XML though.
This method seems promising: http://www.markledford.com/blog/2009/02/25/as3-htmldecode-htmlencode-xml-hack/
You can also try converting the string to a textnode as mentioned in How do you encode XML safely with ActionScript 3?