URL hash format, what's allowed and what's not? - html

I'm using hash-based navigation in my rich web app. I also found I needed to create permalinks that would point to single instances of resources, but since I cannot cause the page o refresh, and the main page is loaded from a single path '/', I cannot use real URLs. Instead I thought about using hashes. Let me give you an example because I know the explanation above sucks.
So, instead of having http://example.com/path/to/resource/1, I would have http://example.com/#path/to/resource/1
This seems to work ok, and browser believes '#path/to/resource/1' is a hash (slashes permitted, I think) but I was wondering about what characters are allowed in URL hash. Is there a specification or a RFC that I could read to find out what the standard behavior of browsers is when it comes to hashes?
EDIT: Ok, so silly me. Didn't actually check if slashes worked in all browsers. Chrome obviously doesn't like them. Only works in FF.

Look at: http://www.w3.org/Addressing/rfc1630.txt or http://www.w3.org/Addressing/URL/4_2_Fragments.html
Basically you can use anything that can be encoded in an URL.
Note: There might be browser inconsistencies. If you fear them, you might use a serialization mechanism, like converting the string to hex or something (will be twice longer though), or use an id of some sort.

This document should help. Slashes are allowed, but the lexical analysis might differ between browsers.

I think you might find that useful: RFC3986
If you use PHP to generate your page paths you could also urlencode() which generates you a valide URL.

Related

Hide the #hash fragment from the url while keeping the same behavior?

For a typical tag like this:
One
Is there a way to keep the same behavior of jumping to the corresponding id, but without displaying the fragment identifier in the url?
I know I can just write some js to handle all of this, but it seems like there outta be an easy way to do this, not that I could find any though.
Thanks!
The fragment identifier is part of the URL. You can't really navigate to a section in the document represented by the fragment identifier without having the fragment identifier in the destination URL.
So the best you can do is to prevent the browser from navigating to the hash, and then faking the jump. And that — at least the former — requires JS.
As mentioned in the comments, due to the nature of URL fragments you should avoid doing this unless you have a very specific reason to do so (e.g. if the hashes are an implementation detail and not intended to be exposed to or bookmarked by the user).

Should the percent symbol (%) always be HTML-escaped?

I know the percent symbol has to be URL-encoded when being passed around, but when I display it in the browser, is it also necessary to escape it like so: %?
In URLs, the percent sign (%) has a special meaning, so it should be escaped. In HTML, it does not, so it is not necessary to escape it.
I agree with the chosen answer, but would like to qualify the statement “it is not necessary to escape it.”
If you have a need (or desire) to escape a percentage sign in HTML code, (and there are good reasons to do this with any potentially ambiguous character or symbol) then I would highly recommend using the percentage entity code % as opposed to any numeric code. (those I use when there is no entity name you could use)
That was the answer I was looking for when I found this page, because I forgot it looses the final "e".
We should probably all be using at least the entities kindly listed here. (whoever Webmasterish is; thank you)
Reasoning: Numeric codes (and particularly byte codes from unencoded characters) change with code–pages, on systems using different default languages, and / or different operating systems. (Windows and Mac using slightly different code sets for “English” being the classic, which still plagues plain–text eMail sent between Apple Mail and Outlook) This is slowing down, and should stop with UTF, but I'm still seeing it pop up.
If you're converting HTML to some other mark–up, (note, I used "–" not a "-", or even "−" for the same reason) such as LaTeX, DVI, PostScript or even MarkDown, then it's useful to completely squash any ambiguity… And those processes tend to happen on the information you least expect to be used in such a way when you initially write it. So just get used to doing it everywhere and be grateful to your former self for having had the foresight to do so. Probably years down the line, when you're looking to update formulae to be more readable by utilising MathJax or such, and keep picking up hyphenated words. <swearmarks>
I'd like to add this - if you use javascript in href, you are in troubles too. Check this example:
http://jsfiddle.net/cs4MZ/
One of the workarounds might be using onclick instead of href.
If you're talking about in HTML text, visible to the reader, no. It can't do anything harmful, there.
...if you're talking about inside of HTML attributes, then yes, that would be good to consider.
URLs and HTML are different languages, as weird as that might seem, so they have different weaknesses.

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.

Ultimate Website Testing String

I've been grappling with the fraught area of escaping user (text) input for web pages. The ultimate goal is to have user input displayed and stored exactly as typed in, without breaking anything.
To that end I have been using the following test string :
'"_$%^&*()+=-£{}[]/n/<>\#~;|,.?#:!&``"'
It seems to work well (even Stack Overflow or Twitter is not immune, hence the back ticks). My question is, will this string capture most escaping problems, for example going from a web page via Ajax and to a database and back again?
In fact how do I display this string in Stack Overflow without the back ticks?
Is there a better one, e.g. say one that will highlight encoding problems too?
When I'm testing, I'm using something like this
a’b<’>",!"/%$?$&?%(()%/"!"/&?%$/"&$/"?%&?-f¯Ñ112üêù
This is generally sufficient to highlight encoding issues, at least from what I can see.
Including a mathematical symbol such as unicode x2202 might be useful too.
That seems like it should be all of them. The smartest thing to do would be to (depending on the language you're using) use a library that has been well tested, that can sanitize user input. Just ask around what other websites use.
See here: http://gendoh.com/2511063
The post itself is written in Korean, but you could see what makes difference between several given patterns. (V1 to V3 are for generic web apps while V4 and V5 is for javascripts.)

Removing Javascript from HREFs

We want to allow "normal" href links to other webpages, but we don't want to allow anyone to sneak in client-side scripting.
Is searching for "javascript:" within the HREF and onclick/onmouseover/etc. events good enough? Or are there other things to check?
It sounds like you're allowing users to submit content with markup. As such, I would recommend taking a look at a few articles about preventing cross-site scripting which would cover a bit more than simply preventing javascript from being inserted into an HREF tag. Below is one I found that might be useful:
http://weblogs.java.net/blog/gmurray71/archive/2006/09/preventing_cros.html
You'll have to use a whitelist of allowed protocols to be completely safe. If you use a blacklist, sooner or later you'll miss something like "telnet://" or "shell:" or some exploitable browser-specific thing you've never heard of...
Nope, there's a lot more that you need to check.
First of the URL could be encoded (using HTML entities or URL encoding or a mixture of both).
Secondly you need to check for malformed HTML, which the browser might guess at and end up allowing some script in.
Thirdly you need to check for CSS based script, e.g. background: url(javascript:...) or width:expression(...)
There's probably more that I've missed - you need to be careful!
You have to be extremely careful when taking user input. You'll want to do a whitelist as mentioned, but not just with the href. Example:
<img src="nosuchimage.blahblah" onerror="alert('Haxored!!!');" />
or
click meh
one option would be to disallow html at all and use the same sort of formatting that some forums use. Just replace
[url="xxx"]yyy[/url]
with
yyy
That'll get you around the issues with mouse over etc. Then just make sure the link starts off with a white-listed protocol, and doesn't have a quote in it (" or some such that might be decrypted by php or the browser).
Sounds like you're looking for the companion function to PHP's strip_tags, which is strip_attributes. Unfortunately, it hasn't been written yet. (Hint, hint.)
There is, however, an interesting-looking suggestion in the strip_tags documentation, here:
http://www.php.net/manual/en/function.strip-tags.php#85718
In theory this will strip anything that isn't an href, class, or ID from submitted links; seems like you probably want to lock it down even further and just take hrefs.