Is escaping image url ok? - html

I've been (not so) accidentally escaping my image url's for a while now, and have never seen any issues in any browsers.
eg, changing this:
<img src="http://example.com/image.jpg" />
into this:
<img src="http://example.com/image.jpg" />
But I have been wondering if there could be issues with this. I've left this in place so far due to XSS concerns around where the image URL's come from, but I could regex validity check them and not escape them too..
It does make pages slightly bigger curently..
Has anyone experienced issues with escaping image URLs..?

The "escaping" you're doing there is purely on the level of the XML/HTML, and by the time the document has been read and understood by the XML/HTML parser the escaping is gone -- long before any "URL-ness" even comes into play.
So, no, there shouldn't be any issues with that, but probably also not many benefits either :)

Related

JSoup - Quotations inside attributes

I'm using JSoup in an attempt to built valid XML from a couple of websites. Most of the time it has worked phenomenally well, but recently I've encountered some cases of bad HTML that JSoup can't seem to fix.
<meta name="saploTags" content="Tag1,Tag2,Tag3," Tag4,Tag5,Tag6"/>
Results in
<meta name="saploTags" content="Tag1,Tag2,Tag3," tag4,tag5,tag6"="" />
This causes problems later on when I'm trying to index the resulting XML. Does anyone have any suggestions what to do? Preferably I'd have everything between the leftmost and rightmost quotation marks escaped or removed in some way in order to prevent data loss (like content="Tag1,Tag2,Tag3,Tag4,Tag5,Tag6". Otherwise it would be ok if JSoup cut off after its first "end quote", disregarding the last tags, like content="Tag1,Tag2,Tag3".
(Similar problems that I've found is e.g. <img src=".." alt="This text contains the quote "The quote" and here's some more text"/> which causes similar problems)
Is it possible to get around this with jsoup, or have I reached a dead end?
/Regards, Magnus
That's quite simply not valid XML nor HTML. Those double quotes should be turned into character references if they're to be considered as part of the attribute value. Even if you could set a parser to be very lenient, it's not gonna be able to solve this because it is no longer clear where the attribute content ends.
Trying to automatically fix this seems rather difficult. There's all sorts of corner cases that'll wreak havoc on any sort of solution. How's this supposed to be interpreted, for example:
<element attribute="this isn't "quite" the=correct way="to=" do things"" />
Look at how the SO code formatter struggles with it.
Even making sense of this yourself is difficult, let alone writing a tool that's gonna make sense of what is or isn't attribute content.
Simple approach? Just don't accept invalid HTML. It's lenient enough as it is, with most parsers allowing lower case and upper case element names, closing tags not always being mandatory etc. If people still manage to generate invalid HTML, then too bad for them.

is this url unvalid and not good practice?

i have a url in this format:
http://www.example.com/manchester united
note the space between manchester and united, is this bad practice, or is it perfectly fine, i just wanted to before i proceed, thanks
The space is not a valid character in URIs; you have to replace it with %20. It may also be considered bad practice. Replacing the space with -, + or _ is preferable; it is both “prettier” and doesn't require escaping of the URI.
Most browsers will still try to parse URIs with a space; but that's highly ambiguous.
It's bad practice not only because browsers are required to turn the space into a %20 and thus obfuscate your users' address bars, but because it would be difficult to communicate the url to anyone.
Furthermore, what about all of those "find links in text" regexes that are around stack overflow? You effectively break them all!
It will be replaced in the address bar as http://www.example.com/manchester%20united, which I personally think if far uglier than the alternative http://www.example.com/manchester_united.
I believe spaces in URLS are replaced with a %20 sign by many browsers.
you will need to add %20 instead of the space, however the browser will do it for you, I would rather not have any spaces in the URI
Technically this will work. The browser will replace the space with a %20, and the server will translate it back.
But ... it's not generally a good idea because it can lead to ambiguity, or difficulty in communicating the URL to others, particularly in an advertising setting where you're expecting someone to type in a URL they've seen in print.
Maybe a question for: https://webmasters.stackexchange.com/
But...
If you enter than into a browser, it will add %20 between manchester and united. Technically you should do this in your HTML page but most modern browsers can handle this. Common practice is to split them out with a hyphen i.e. http://www.example.com/manchester-united.
Look at the URL of this question for an example of this in action.
Can do that, but apparently it's bad style.
See the following: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

Value of whitespace in html source code

I'm using a function to generate all output in php. Using that function I can control whether to display the code like this:
<html><header></header><body><p>Hello World!</p></body></html>
or like this
<html>
<header>
</header>
<body>
<p>Hello World!</p>
</body>
</html>
including the indentation and all.
Is there a particular value to displaying the code indented and spaced (besides seemingly slower loading time)? I usually don't need to view the source code, since I can simply access the PHP file. During development I would most likely prefer whitespace, but when on production would it necessarily be advantageous?
Thanks!
I'd space it out if you have the option, there's nothing wrong with white-space to make something readable, and with GZip it makes the download difference not all that major anyway. You never know when you'll have to debug a style, it'll save you time later by having it pretty now, trust me.
All whitespace is condensed to a single space, rather than nothing, so there is a slight difference. For example:
<img src="image.jpg"><img src="image2.jpg">
Will produce slightly different results to this:
<img src="image.jpg">
<img src="image2.jpg">
So at a minimum, use a single space/newline between tags. Personally I prefer using spacing on live sites because it aids live debugging, and when using gzip the difference between space and no-space is tiny anyway.
And of course, it would also help budding new developers who might like to see "how it was done".
I prefer to omit whitespaces, especially in production.
You can still view the code through Firebug. there is no reason to do "view source".
Note that spaces can cause some problems, because they are considered as a space.
If you strip out whitespace you'll rue your parsimonious nature one day when you have to View Source in Internet Explorer on some remote client machine and have to wade through a swamp of HTML tags.
Whitespace will unnecessarily accumulate network bandwidth. No, GZIP won't fix it up to with 100%. I myself trim all the whitespace from the response and then pass it through GZIP. The only ones who cares about whitespace in HTML source are webdevelopers who are curious how the page source look like. They are really not worth the waste of network bandwidth --unless you're practically the only visitor ;)
Is Javascript or PHP generating your Html? Either way - utilize an escape character.
\n = new line
\t = tab
\r = carriage return
<?php echo "This is a test. <br> \n"; ?>

Removing Javascript from HREFs

We want to allow "normal" href links to other webpages, but we don't want to allow anyone to sneak in client-side scripting.
Is searching for "javascript:" within the HREF and onclick/onmouseover/etc. events good enough? Or are there other things to check?
It sounds like you're allowing users to submit content with markup. As such, I would recommend taking a look at a few articles about preventing cross-site scripting which would cover a bit more than simply preventing javascript from being inserted into an HREF tag. Below is one I found that might be useful:
http://weblogs.java.net/blog/gmurray71/archive/2006/09/preventing_cros.html
You'll have to use a whitelist of allowed protocols to be completely safe. If you use a blacklist, sooner or later you'll miss something like "telnet://" or "shell:" or some exploitable browser-specific thing you've never heard of...
Nope, there's a lot more that you need to check.
First of the URL could be encoded (using HTML entities or URL encoding or a mixture of both).
Secondly you need to check for malformed HTML, which the browser might guess at and end up allowing some script in.
Thirdly you need to check for CSS based script, e.g. background: url(javascript:...) or width:expression(...)
There's probably more that I've missed - you need to be careful!
You have to be extremely careful when taking user input. You'll want to do a whitelist as mentioned, but not just with the href. Example:
<img src="nosuchimage.blahblah" onerror="alert('Haxored!!!');" />
or
click meh
one option would be to disallow html at all and use the same sort of formatting that some forums use. Just replace
[url="xxx"]yyy[/url]
with
yyy
That'll get you around the issues with mouse over etc. Then just make sure the link starts off with a white-listed protocol, and doesn't have a quote in it (" or some such that might be decrypted by php or the browser).
Sounds like you're looking for the companion function to PHP's strip_tags, which is strip_attributes. Unfortunately, it hasn't been written yet. (Hint, hint.)
There is, however, an interesting-looking suggestion in the strip_tags documentation, here:
http://www.php.net/manual/en/function.strip-tags.php#85718
In theory this will strip anything that isn't an href, class, or ID from submitted links; seems like you probably want to lock it down even further and just take hrefs.

Apart from <script> tags, what should I strip to make sure user-entered HTML is safe?

I have an app that reprocesses HTML in order to do nice typography. Now, I want to put it up on the web to let users type in their text. So here's the question: I'm pretty sure that I want to remove the SCRIPT tag, plus closing tags like </form>. But what else should I remove to make it totally safe?
Oh good lord you're screwed.
Take a look at this
Basically, there are so many things you want to strip out. Plus, there's stuff that's valid, but could be used in malicious ways. What if the user wants to set their font size smaller on a footnote? Do you care if that get applied to your entire page? How about setting colors? Now all the words on your page are white on a white background.
I would look into the requirements phase again.
Is a markdown-like alternative possible?
Can you restrict access to the final content, reducing risk of exposure? (meaning, can you set it up so the user only screws themselves, and can't harm other people?)
You should take the white-list rather than the black-list approach: Decide which features are desired, rather than try to block any unwanted feature.
Make a list of desired typographic features that match your application. Note that there is probably no one-size-fits-all list: It depends both on the nature of the site (programming questions? teenagers' blog?) and the nature of the text box (are you leaving a comment or writing an article?). You can take a look at some good and useful text boxes in open source CMSs.
Now you have to chose between your own markup language and HTML. I would chose a markup language. The pros are better security, the cons are incapability to add unexpected internet contents, like youtube videos. A good idea to prevent users' rage is adding an "HTML to my-site" feature that translates the corresponding HTML tags to your markup language, and delete all other tags.
The pros for HTML are consistency with standards, extendability to new contents types and simplicity. The big con is code injection security issues. Should you pick HTML tags, try to adopt some working system for filtering HTML (I think Drupal is doing quite a good job in this case).
Instead of blacklisting some tags, it's always safer to whitelist. See what stackoverflow does: What HTML tags are allowed on Stack Overflow?
There are just too many ways to embed scripts in the markup. javascript: URLs (encoded of course)? CSS behaviors? I don't think you want to go there.
There are plenty of ways that code could be sneaked in - especially watch for situations like <img src="http://nasty/exploit/here.php"> that can feed a <script> tag to your clients, I've seen <script> blocked on sites before, but the tag got right through, which resulted in 30-40 passwords stolen.
<iframe>
<style>
<form>
<object>
<embed>
<bgsound>
Is what I can think of. But to be sure, use a whitelist instead - things like <a>, <img>† that are (mostly) harmless.
† Just make sure that any javascript:... / on*=... are filtered out too... as you can see, it can get quite complicated.
I disagree with person-b. You're forgetting about javascript attributes, like this:
<img src="xyz.jpg" onload="javascript:alert('evil');"/>
Attackers will always be more creative than you when it comes to this. Definitely go with the whitelist approach.
MediaWiki is more permissive than this site; yes, it accepts setting colors (even white on white), margins, indents and absolute positioning (including those that would put the text completely out of screen), null, clippings and "display;none", font sizes (even if they are ridiculously small or excessively large) and font-names (even if this is a legacy non-Unicode Symbol font name that will not render text successfully), as opposed to this site which strips out almost everything.
But MediaWiki successifully strips out the dangerous active scripts from CSS (i.e. the behaviors, the onEvent handlers, the active filters or javascript link targets) without filtering completely the style attribute, and bans a few other active elements like object, embed, bgsound.
Both sits are banning marquees as well (not standard HTML, and needlessly distracting).
But MediaWiki sites are patrolled by lots of users and there are policy rules to ban those users that are abusing repeatedly.
It offers support for animated iamges, and provides support for active extensions, such as to render TeX maths expressions, or other active extensions that have been approved (like timeline), or to create or customize a few forms.