Sanitize <script> element contents - html

Say that I want to provide some data to my client (in the first response, with no latency) via a dynamic <script> element.
<script><%= payload %></script>
Say that payload is the string var data = '</script><script>alert("Muahahaha!")';</script>. An end tag (</script>) will allow users to inject arbitrary scripts into my page. How do I properly sanitize the contents of my script element?
I figure I could change </script> to <\/script> and <!-- to <\!--. Are there any other dangerous strings I need to escape? Is there a better way to provide this "cold start" data?

Edited for non-mutation of data.
If I'm interpreting this correctly. You want to prevent the user from ending the script tag prematurely within the user submitted string. That can be done for html just as you stated with adding the backslash in with the ending tag <\/script>. That is the only escaping you should have to worry about in that case. You shouldn't need to escape html comments as the browser will interpret it as part of the javascript. Perhaps if some older browsers don't interpret script tags default to the type of text/javascript correctly (language="javascript" which is deprecated) adding in type='text/javascript' may be necessary.
Based on Mike Samuel's answer here I may have been wrong about not needing to escape html comments. However I was not able to reproduce it in chrome or chromium.

Assuming that you're doing this:
Payload is set to
var data = '[this is user controlled data]';
and the rest of the code (assignment, quotes and semi-colon) is generated by your application, then the encoding you want is hex entity encoding.
See the OWASP XSS Prevention Cheat Sheet, Rule #3 for more information. This will convert
</script><script>alert("Muahahaha!")
into
var data = '\x3c\x2fscript\x3e\x3cscript\x3ealert\x28\x22Muahahaha\x21\x22\x29';
Try this and you will see this has the advantage of storing the user set string exactly correct, no matter what characters it contains. Additionally it takes care of single and double quote encoding. As a super bonus, it is also suitable for storing in HTML attributes:
<a onclick="alert('[user data]');" />
which normally would have to be HTML encoded again for correct display (because & inside an HTML attribute is interpreted as &). However, hex entity encoding does not include any HTML characters with special meaning so you get two for the price of one.
Update from comments
The OP indicated that the server-side code would be generated in the form
var data = <%= JSON.stringify(data) %>;
The above still applies. It is upto the JSON class to properly hex entity encode values as they're inserted into the JSON. This cannot easily be done outside of the class as you'd have to effectively parse the JSON again to determine the current language context. I wouldn't recommend going for the simple option of escaping the forward slash in the </script> because there are other sequences that can end the grammar context such as CDATA closing tags. Escape properly and your code will be future proof and secure.

Related

why sql server add extra backslash in url syntax when i store data using json_encode()? [duplicate]

The reason for this "escapes" me.
JSON escapes the forward slash, so a hash {a: "a/b/c"} is serialized as {"a":"a\/b\/c"} instead of {"a":"a/b/c"}.
Why?
JSON doesn't require you to do that, it allows you to do that. It also allows you to use "\u0061" for "A", but it's not required, like Harold L points out:
The JSON spec says you CAN escape forward slash, but you don't have to.
Harold L answered Oct 16 '09 at 21:59
Allowing \/ helps when embedding JSON in a <script> tag, which doesn't allow </ inside strings, like Seb points out:
This is because HTML does not allow a string inside a <script> tag to contain </, so in case that substring's there, you should escape every forward slash.
Seb answered Oct 16 '09 at 22:00 (#1580667)
Some of Microsoft's ASP.NET Ajax/JSON API's use this loophole to add extra information, e.g., a datetime will be sent as "\/Date(milliseconds)\/". (Yuck)
The JSON spec says you CAN escape forward slash, but you don't have to.
I asked the same question some time ago and had to answer it myself. Here's what I came up with:
It seems, my first thought [that it comes from its JavaScript
roots] was correct.
'\/' === '/' in JavaScript, and JSON is valid JavaScript. However,
why are the other ignored escapes (like \z) not allowed in JSON?
The key for this was reading
http://www.cs.tut.fi/~jkorpela/www/revsol.html, followed by
http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2. The feature of
the slash escape allows JSON to be embedded in HTML (as SGML) and XML.
PHP escapes forward slashes by default which is probably why this appears so commonly. I suspect it's because embedding the string "</script>" inside a <script> tag is considered unsafe.
Example:
<script>
var searchData = <?= json_encode(['searchTerm' => $_GET['search'], ...]) ?>;
// Do something else with the data...
</script>
Based on this code, an attacker could append this to the page's URL:
?search=</script> <some attack code here>
Which, if PHP's protection was not in place, would produce the following HTML:
<script>
var searchData = {"searchTerm":"</script> <some attack code here>"};
...
</script>
Even though the closing script tag is inside a string, it will cause many (most?) browsers to exit the script tag and interpret the items following as valid HTML.
With PHP's protection in place, it will appear instead like this, which will NOT break out of the script tag:
<script>
var searchData = {"searchTerm":"<\/script> <some attack code here>"};
...
</script>
This functionality can be disabled by passing in the JSON_UNESCAPED_SLASHES flag but most developers will not use this since the original result is already valid JSON.
Yes, some JSON utiltiy libraries do it for various good but mostly legacy reasons. But then they should also offer something like setEscapeForwardSlashAlways method to set this behaviour OFF.
In Java, org.codehaus.jettison.json.JSONObject does offer a method called
setEscapeForwardSlashAlways(boolean escapeForwardSlashAlways)
to switch this default behaviour off.

Why do I need XSS library while I can use Html-encode?

I'm trying to understand why do I need to use XSS library when I can merely do HtlEncode when sending data from server to client ...?
For example , here in Stackoverflow.com - the editor - all the SO tem neads to do is save the user input and display it with html encode.
This way - there will never going to be a HTML tag - which is going to be executed.
I'm probably wrong here -but can you please contradict my statement , or exaplain?
For example :
I know that IMG tag for example , can has onmouseover , onload which a user can do malicious scripts , but the IMG won't event run in the browser as IMG since it's <img> and not <img>
So - where is the problem ?
HTML-encoding is itself one feature an “XSS library” might provide. This can be useful when the platform doesn't have a native HTML encoder (eg scriptlet-based JSP) or the native HTML encoder is inadequate (eg not escaping quotes for use in attributes, or ]]> if you're using XHTML, or #{} if you're worried about cross-origin-stylesheet-inclusion attacks).
There might also be other encoders for other situations, for example injecting into JavaScript strings in a <script> block or URL parameters in an href attribute, which are not provided directly by the platform/templating language.
Another useful feature an XSS library could provide might be HTML sanitisation, for when you want to allow the user to input data in HTML format, but restrict which tags and attributes they use to a safe whitelist.
Another less-useful feature an XSS library could provide might be automated scanning and filtering of input for HTML-special characters. Maybe this is the kind of feature you are objecting to? Certainly trying to handle HTML-injection (an output stage issue) at the input stage is a misguided approach that security tools should not be encouraging.
HTML encoding is only one aspect of making your output safe against XSS.
For example, if you output a string to JavaScript using this code:
<script>
var enteredName = '<%=EnteredNameVariableFromServer %>';
</script>
You will be wanting to hex entity encode the variable for proper insertion in JavaScript, not HTML encode. Suppose the value of EnteredNameVariableFromServer is O'leary, then the rendered code when properly encoded will become:
<script>
var enteredName = 'O\x27leary';
</script>
In this case this prevents the ' character from breaking out of the string and into the JavaScript code context, and also ensures proper treatment of the variable (HTML encoding it would result in the literal value of O'leary being used in JavaScript, affecting processing and display of the value).
Side note:
Also, that's not quite true of Stack Overflow. Certain characters still have special meanings like in the <!-- language: lang-none --> tag. See this post on syntax highlighting if you're interested.

Label text ignoring html tags

<label for="abc" id="xyz">http://abc.com/player.js</xref>?xyz="foo" </label>
is ignoring
</xref> tag
value in the browser. So, the displayed output is
http://abc.com/player.js?xyz="foo"
but i want the browser to display
http://abc.com/player.js</xref>?xyz="foo"
Please help me how to achieve this.
It isn't being ignored. It is being treated as an end tag (for a non-HTML element that has no start tag). Use < if you want a < character to appear as data instead of as "start of tag".
That said, this is a URL and raw <, > and " characters shouldn't appear in URIs anyway. So encode it as http://abc.com/player.js%3C/xref%3E?xyz=%22foo%22
You should do it like this
"http://abc.com/player.js%3C/xref%3E?xyz=foo"
Url should be encoded properly to work as valid URL
Use encodeURI for encoding URLs for a valid one
var ValidURL = encodeURI("http://abc.com/player.js</xref>?xyz=foo");
See this answer on encodeURI for better knowledge.
I misunderstood the question, I thought the URI was to be used elsewhere within JavaScript. But the question pretty clearly states that the URI is to just be rendered as text.
If the text being displayed is being passed in from a server, then your best bet is to encode it before printing it on the page (or if you're using a template engine, then you can most likely just encode it on the template). Pretty much any web framework/templating engine should have this functionality.
However, if it is just static HTML, just manually encode the the characters. If you don't know the codes off the top of your head, you can just use some online converter to help, such as something like:
HTML Encode/Decode:
http://htmlentities.net/
Old Answer:
Try encoding the URI using the JavaScript function encodeURI before using it:
encodeURI('http://abc.com/player.js</xref>?xyz="foo"');
You can also decode it using decodeURI if need be:
decodeURI(yourEncodedURI);
So ultimately I don't think you'll be able to get the browser to display the </xref> tag as is, but you will be able to preserve it (using encodeURI/decodeURI) and use it in your code, if this is what you need.
Fiddle:
http://jsfiddle.net/rk8nR/3/
More info:
When are you supposed to use escape instead of encodeURI / encodeURIComponent?

What other characters beside ampersand (&) should be encoded in HTML href/src attributes?

Is the ampersand the only character that should be encoded in an HTML attribute?
It's well known that this won't pass validation:
Because the ampersand should be &. Here's a direct link to the validation fail.
This guy lists a bunch of characters that should be encoded, but he's wrong. If you encode the first "/" in http:// the href won't work.
In ASP.NET, is there a helper method already built to handle this? Stuff like Server.UrlEncode and HtmlEncode obviously don't work - those are for different purposes.
I can build my own simple extension method (like .ToAttributeView()) which does a simple string replace.
Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:
http://query.com/?q=foo&lt=bar&gt=baz
Even though there aren't trailing semi-colons, since < is the entity for < and > is the entity for >, some old browsers would translate this URL to:
http://query.com/?q=foo<=bar>=baz
So you need to specify & as & to prevent this from occurring for links within an HTML parsed document.
The purpose of escaping characters is so that they won't be processed as arguments. So you actually don't want to encode the entire url, just the values you are passing via the querystring. For example:
http://example.com/?parameter1=<ENCODED VALUE>&parameter2=<ENCODED VALUE>
The url you showed is actually a perfectly valid url that will pass validation. However, the browser will interpret the & symbols as a break between parameters in the querystring. So your querystring:
?q=whatever&lang=en
Will actually be translated by the recipient as two parameters:
q = "whatever"
lang = "en"
For your url to work you just need to ensure that your values are being encoded:
?q=<ENCODED VALUE>&lang=<ENCODED VALUE>
Edit: The common problems page from the W3C you linked to is talking about edge cases when urls are rendered in html and the & is followed by text that could be interpreted as an entity reference (&copy for example). Here is a test in jsfiddle showing the url:
http://jsfiddle.net/YjPHA/1/
In Chrome and FireFox the links works correctly, but IE renders &copy as ©, breaking the link. I have to admit I've never had a problem with this in the wild (it would only affect those entity references which don't require a semicolon, which is a pretty small subset).
To ensure you're safe from this bug you can HTML encode any of your URLS you render to the page and you should be fine. If you're using ASP.NET the HttpUtility.HtmlEncode method should work just fine.
You do not need HTML escapement here:
According to the HTML5 spec:
http://www.w3.org/TR/html5/tokenization.html#character-reference-in-attribute-value-state
&lang= should be parsed as non-recognized character reference and value of the attribute should be used as it is: http://domain.com/search?q=whatever&lang=en
For the reference: added question to HTML5 WG: http://lists.w3.org/Archives/Public/public-html/2011Sep/0163.html
In HTML attribute values, if you want ", '&' and a non-breaking space as a result, you should (as an author who is clear about intent) have ", & and in the markup.
For " though, you don't have to use " if you use single quotes to encase your attribute values.
For HTML text nodes, in addition to the above, if you want < and > as a result, you should use < and >. (I'd even use these in attribute values too.)
For hfnames and hfvalues (and directory names in the path) for URIs, I'd used Javascript's encodeURIComponent() (on a utf-8 page when encoding for use on a utf-8 page).
If I understand the question correctly, I believe this is what you want.

Embedding JSON objects in script tags

EDIT: For future reference, I'm using non-xhtml content type definition <!html>
I'm creating a website using Django, and I'm trying to embed arbitrary json data in my pages to be used by client-side javascript code.
Let's say my json object is {"foo": "</script>"}. If I embed this directly,
<script type='text/javascript'>JSON={"foo": "</script>"};</script>
The first closes the json object. (also, it will make the site vulnerable to XSS, since this json object will be dynamically generated).
If I use django's HTML escape function, the resulting output is:
<script type='text/javascript'>JSON={"foo": "</script>"};</script>
and the browser cannot interpret the <script> tag.
The question I have here is,
Which characters am i suppose to escape / not escape in this situation?
Is there automated way to perform this in Python / django?
If you are using XHTML, you would be able to use entity references (<, >, &) to escape any string you want within <script>. You would not want to use a <![CDATA[...]]> section, because the sequence "]]>" can't be expressed within a CDATA section, and you would have to change the script to express ]]>.
But you're probably not using XHTML. If you're using regular HTML, the <script> tag acts somewhat like a CDATA section in XML, except that it has even more pitfalls. It ends with </script>. There are also arcane rules to allow <!-- document.write("<script>...</script>") --> (the comments and <script> opening tag must both be present for </script> to be passed through). The compromise that the HTML5 editors adopted for future browsers is described in HTML 5 tokenization and CDATA Escapes
I think the takeaway is that you must prevent </script> from occurring in your JSON, and to be safe you should also avoid <script>, <!--, and --> to prevent runaway comments or script tags. I think it's easiest just to replace < with \u003c and --> with --\>
I tried backslash escaping the forward slash and that seems to work:
<script type='text/javascript'>JSON={"foo": "<\/script>"};</script>
have you tried that?
On a side note, I am surprised that the embedded </script> tag in a string breaks the javascript. Couldn't believe it at first but tested in Chrome and Firefox.
I would do something like this:
<script type='text/javascript'>JSON={"foo": "</" + "script>"};</script>
For this case in python, I have opened a bug in the bug tracker. However the rules are indeed complicated, as <!-- and <script> play together in quite evil ways even in the adopted html5 parsing rules. BTW, ">" is not a valid JSON escape, so it would better be replaced with "\u003E", thus the absolutely safe escaping should be to escape \u003C and \u003E AND a couple other evil characters mentioned in the python bug...