We want to allow "normal" href links to other webpages, but we don't want to allow anyone to sneak in client-side scripting.
Is searching for "javascript:" within the HREF and onclick/onmouseover/etc. events good enough? Or are there other things to check?
It sounds like you're allowing users to submit content with markup. As such, I would recommend taking a look at a few articles about preventing cross-site scripting which would cover a bit more than simply preventing javascript from being inserted into an HREF tag. Below is one I found that might be useful:
http://weblogs.java.net/blog/gmurray71/archive/2006/09/preventing_cros.html
You'll have to use a whitelist of allowed protocols to be completely safe. If you use a blacklist, sooner or later you'll miss something like "telnet://" or "shell:" or some exploitable browser-specific thing you've never heard of...
Nope, there's a lot more that you need to check.
First of the URL could be encoded (using HTML entities or URL encoding or a mixture of both).
Secondly you need to check for malformed HTML, which the browser might guess at and end up allowing some script in.
Thirdly you need to check for CSS based script, e.g. background: url(javascript:...) or width:expression(...)
There's probably more that I've missed - you need to be careful!
You have to be extremely careful when taking user input. You'll want to do a whitelist as mentioned, but not just with the href. Example:
<img src="nosuchimage.blahblah" onerror="alert('Haxored!!!');" />
or
click meh
one option would be to disallow html at all and use the same sort of formatting that some forums use. Just replace
[url="xxx"]yyy[/url]
with
yyy
That'll get you around the issues with mouse over etc. Then just make sure the link starts off with a white-listed protocol, and doesn't have a quote in it (" or some such that might be decrypted by php or the browser).
Sounds like you're looking for the companion function to PHP's strip_tags, which is strip_attributes. Unfortunately, it hasn't been written yet. (Hint, hint.)
There is, however, an interesting-looking suggestion in the strip_tags documentation, here:
http://www.php.net/manual/en/function.strip-tags.php#85718
In theory this will strip anything that isn't an href, class, or ID from submitted links; seems like you probably want to lock it down even further and just take hrefs.
Related
I really dislike the non-semantic usage of <big> on our wiki, and would like to prevent it. Flat-out commands didn't work so far, so I'm switching to doing it by code...
AFAIK, there's no configuration switch to control the blacklist/whitelist of HTML tags. Looking at the source code, it seems like the data is coming from Sanitizer::getRecognizedTagData(), while the work itself is done in Sanitizer::removeHTMLtags(). However, I do not see a way to add to the list myself, except using one of the hooks before or after (InternalParseBeforeSanitize, InternalParseBeforeLinks) and either:
Call Sanitizer::removeHTMLtags() again myself, with the additional tag to blacklist as a parameter
Do a search myself on the text to remove all the <big> tags.
The first one is a duplication of work, the second one is a duplication of code. Is there a better way? What would you recommend?
No coding is needed: just install AbuseFilter and create a rule that warns or disallows on save of pages containing these tags.
So, the citeattribute is used with an URL as a value to indicate the source of a quote (for <q> and <blockquote>) or a page that can provide additional information (for <del> and <ins>).
Because this URL isn't shown in any way to the end user by the browser, the only reason to put it in the document would be for non-user, hence crawlers, bots and whatever. You could also use it with a script, but that's not in my intentions.
My question: is it, from your experience, worth it to bother with this attribute or should I pass over?
What if I linked to that URL with a, which is way more common: would your answer change?
Yes, use cite, the semantic web can be ours today. Cite becomes more important the more people use it. If everyone used the attribute, developers could make some pretty awesome stuff.
Using extra semantics such as these are also good for SEO. Even if Google does not currently look for it, it is a safe bet they eventually will.
This is an optional attribute which you can use if you feel it has value in your case. If you are concerned that your quote's legitimacy might be in question, and you have a source you took it from, then adding the cite attribute gives inquisitive people or bots an idea of where you got the information. If you don't really care that people know where you got the quote from, then don't bother.
You might also put it for your own reference so when you look at it a year from now you'll have that information at hand.
I was wondering if there was any way to dynamically obfuscate html on a live server but not offline, so soon as my website was visited the source would be obfuscated rather than in plain text.
Since the client (browser) will have to parse it into a sensible DOM tree, this is pretty much fruitless. These days it's a lot more common to inspect a site using Firebug/Webkit Inspector, which provides a nicely formatted, navigable tree. Most people won't even notice that the HTML is "obfuscated", much less be stopped by it.
Executable code can be obfuscated by minimizing variable names and such without changing the result. HTML is the result though, if you change anything about it, the result will change. So "obfuscation" would mostly be limited to creative use of spacing anyway.
The real question you should ask yourself is "why do I need to obfuscate HTML?". If you're hiding sensitive information, then you should be either encrypting that data, or never presenting it to the client.
Most sensitive information or transactions should take place on the server, and the client only receives a token, or encrypted information, or a unique transaction identifier that can be passed back and forth.
Let me put it this way: There's no way to dynamically obfuscate the HTML on your site such that any reasonably competent person couldn't get it anyway.
You could use JavaScript to attempt to obfuscate it, but you'd have to do it in a way that didn't actually affect the DOM.
You could generate the contents of the page itself with JavaScript, but that is likely to damage accessibility, and once again the DOM will have to be in a condition the browser can use.
You could insert massive amounts of whitespace into the source, but that is easily overcome as well.
All this, and you make it harder and more annoying to manage your site. Minification has its purpose, but obfuscation here is lose-lose.
Your could search for and remove all tabs, newlines, extra spaces, and comments
If you are using php, IonCube has a plugin. it can be found here: http://www.ioncube.com/html_encoder.php it turns your html page into minified javascript.
I keep seeing things like <div ... g_editable="true" ... >.
I've searched for anything to help me understand its purpose, but all I get back is more markup and nothing explaining it.
Can somebody please explain it?
The g_editable="true" attribute is used by Google in their Gmail email composition (and likely other Google products as well). Instead of using a textarea tag they use a source-less iframe and all of the email contents are highlighted, typed in, made bold, adding line breaks etc are all handled by javascript so that it resembles a textarea, but with a lot more functionality.
The g_editable="true" attribute itself is likely to limit the scope that different functions and events have so that you can't just boldify text in your contacts list and that sort of thing. It's probably why they have it in an iframe as well.
It may just be a custom attribute defined by Google. It could be a hook for their JavaScript and/or CSS.
For best practice though, they should have prefixed it with data-.
I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.