I've been (not so) accidentally escaping my image url's for a while now, and have never seen any issues in any browsers.
eg, changing this:
<img src="http://example.com/image.jpg" />
into this:
<img src="http://example.com/image.jpg" />
But I have been wondering if there could be issues with this. I've left this in place so far due to XSS concerns around where the image URL's come from, but I could regex validity check them and not escape them too..
It does make pages slightly bigger curently..
Has anyone experienced issues with escaping image URLs..?
The "escaping" you're doing there is purely on the level of the XML/HTML, and by the time the document has been read and understood by the XML/HTML parser the escaping is gone -- long before any "URL-ness" even comes into play.
So, no, there shouldn't be any issues with that, but probably also not many benefits either :)
I'm using an '&' symbol with HTML5 and UTF-8 in my site's <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.
http://validator.w3.org is giving me this:
& did not start a character reference. (& probably should have been escaped as &.)
Do I really need to do &?
I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.
Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as & and everything would be fine.
HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.
Keep this point in mind; if you're not escaping & to &, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.
Please just escape your code. It will save you a lot of trouble in the future.
Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.
Encoding & as & under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.
Compare the following: which is easier? Which is easier to bugger up?
Methodology 1
Write some content which includes ampersand characters.
Encode them all.
Methodology 2
(with a grain of salt, please ;) )
Write some content which includes ampersand characters.
On a case-by-case basis, look at each ampersand. Determine if:
It is isolated, and as such unambiguously an ampersand. eg. volt & amp > In that case don't bother encoding it.
It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. E.g., amp&volt >. In that case, don't bother encoding it.
It is not isolated, and ambiguous. E.g., volt& > Encode it.
??
HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "©=2" is still a problem, for example, since © is the copyright symbol.
However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.
I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:
Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.
Specifically, if HTML5 does not require using & in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.
Well, if it comes from user input then absolutely yes, for obvious reasons. Think if this very website didn't do it: the title of this question would show up as Do I really need to encode ‘&’ as ‘&’?
If it's just something like echo '<title>Dolce & Gabbana</title>'; then strictly speaking you don't have to. It would be better, but if you don't, no user will notice the difference.
Could you show us what your title actually is? When I submit
<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>Am I allowed loose & mpersands?</p>
</body>
</html>
to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...
In HTML, a & marks the begin of a reference, either of a character reference or of an entity reference. From that point on, the parser expects either a # denoting a character reference, or an entity name denoting an entity reference, both followed by a ;. That’s the normal behavior.
But if the reference name or just the reference opening & is followed by a white space or other delimiters like ", ', <, >, &, the ending ; and even a reference to represent a plain, & can be omitted:
<p title="&">foo & bar</p>
<p title="&">foo & bar</p>
<p title="&">foo & bar</p>
Only in these cases can the ending ; or even the reference itself be omitted (at least in HTML 4). I think HTML 5 requires the ending ;.
But the specification recommends to always use a reference like the character reference & or the entity reference & to avoid confusion:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
Update (March 2020): The W3C validator no longer complains about escaping URLs.
I was checking why image URLs need escaping and hence tried it in https://validator.w3.org. The explanation is pretty nice. It highlights that even URLs need to be escaped. [PS: I guess it will be unescaped when it's consumed since URLs need &. Can anyone clarify?]
<img alt="" src="foo?bar=qut&qux=fop" />
An entity reference was found in the document, but there is no
reference by that name defined. Often this is caused by misspelling
the reference name, unencoded ampersands, or by leaving off the
trailing semicolon (;). The most common cause of this error is
unencoded ampersands in URLs as described by the WDG in "Ampersands in
URLs". Entity references start with an ampersand (&) and end with a
semicolon (;). If you want to use a literal ampersand in your document
you must encode it as "&" (even inside URLs!). Be careful to end
entity references with a semicolon or your entity reference may get
interpreted in connection with the following text. Also keep in mind
that named entity references are case-sensitive; &Aelig; and æ
are different characters. If this error appears in some markup
generated by PHP's session handling code, this article has
explanations and solutions to your problem.
It depends on the likelihood of a semicolon ending up near your &, causing it to display something quite different.
For example, when dealing with input from users (say, if you include the user-provided subject of a forum post in your title tags), you never know where they might be putting random semicolons, and it might randomly display strange entities. So always escape in that situation.
For your own static HTML content, sure, you could skip it, but it's so trivial to include proper escaping, that there's no good reason to avoid it.
If the user passes it to you, or it will wind up in a URL, you need to escape it.
If it appears in static text on a page? All browsers will get this one right either way, and you don't worry much about it, since it will work.
Yes, you should try to serve valid code if possible.
Most browsers will silently correct this error, but there is a problem with relying on the error handling in the browsers. There is no standard for how to handle incorrect code, so it's up to each browser vendor to try to figure out what to do with each error, and the results may vary.
Some examples where browsers are likely to react differently is if you put elements inside a table but outside the table cells, or if you nest links inside each other.
For your specific example it's not likely to cause any problems, but error correction in the browser might for example cause the browser to change from standards compliant mode into quirks mode, which could make your layout break down completely.
So, you should correct errors like this in the code, if not for anything else so to keep the error list in the validator short, so that you can spot more serious problems.
A couple of years ago, we got a report that one of our web apps wasn't displaying correctly in Firefox. It turned out that the page contained a tag that looked like
<div style="..." ... style="...">
When faced with a repeated style attribute, Internet Explorer combines both of the styles, while Firefox only uses one of them, hence the different behavior. I changed the tag to
<div style="...; ..." ...>
and sure enough, it fixed the problem! The moral of the story is that browsers have more consistent handling of valid HTML than of invalid HTML. So, fix your damn markup already! (Or use HTML Tidy to fix it.)
If & is used in HTML then you should escape it.
If & is used in JavaScript strings, e.g., an alert('This & that'); or document.href, you don't need to use it.
If you're using document.write then you should use it, e.g. document.write(<p>this & that</p>).
If you're really talking about the static text
<title>Foo & Bar</title>
stored in some file on the hard disk and served directly by a server, then yes: it probably doesn't need to be escaped.
However, since there is very little HTML content nowadays that's completely static, I'll add the following disclaimer that assumes that the HTML content is generated from some other source (database content, user input, web service call result, legacy API result, ...):
If you don't escape a simple &, then chances are you also don't escape a & or a or <b> or <script src="http://attacker.com/evil.js"> or any other invalid text. That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks.
In other words: when you're already checking and escaping the other more problematic cases, then there's almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone-& unescaped.
The link has a fairly good example of when and why you may need to escape & to &
https://jsfiddle.net/vh2h7usk/1/
Interestingly, I had to escape the character in order to represent it properly in my answer here. If I were to use the built-in code sample option (from the answer panel), I can just type in & and it appears as it should. But if I were to manually use the <code></code> element, then I have to escape in order to represent it correctly :)
I know the percent symbol has to be URL-encoded when being passed around, but when I display it in the browser, is it also necessary to escape it like so: %?
In URLs, the percent sign (%) has a special meaning, so it should be escaped. In HTML, it does not, so it is not necessary to escape it.
I agree with the chosen answer, but would like to qualify the statement “it is not necessary to escape it.”
If you have a need (or desire) to escape a percentage sign in HTML code, (and there are good reasons to do this with any potentially ambiguous character or symbol) then I would highly recommend using the percentage entity code % as opposed to any numeric code. (those I use when there is no entity name you could use)
That was the answer I was looking for when I found this page, because I forgot it looses the final "e".
We should probably all be using at least the entities kindly listed here. (whoever Webmasterish is; thank you)
Reasoning: Numeric codes (and particularly byte codes from unencoded characters) change with code–pages, on systems using different default languages, and / or different operating systems. (Windows and Mac using slightly different code sets for “English” being the classic, which still plagues plain–text eMail sent between Apple Mail and Outlook) This is slowing down, and should stop with UTF, but I'm still seeing it pop up.
If you're converting HTML to some other mark–up, (note, I used "–" not a "-", or even "−" for the same reason) such as LaTeX, DVI, PostScript or even MarkDown, then it's useful to completely squash any ambiguity… And those processes tend to happen on the information you least expect to be used in such a way when you initially write it. So just get used to doing it everywhere and be grateful to your former self for having had the foresight to do so. Probably years down the line, when you're looking to update formulae to be more readable by utilising MathJax or such, and keep picking up hyphenated words. <swearmarks>
I'd like to add this - if you use javascript in href, you are in troubles too. Check this example:
http://jsfiddle.net/cs4MZ/
One of the workarounds might be using onclick instead of href.
If you're talking about in HTML text, visible to the reader, no. It can't do anything harmful, there.
...if you're talking about inside of HTML attributes, then yes, that would be good to consider.
URLs and HTML are different languages, as weird as that might seem, so they have different weaknesses.
Within my HTML, can I use the character entity reference " " in place of "%20" in Web URLs?
They're both spaces, right?
The short answer is, they are both used to represent "spaces", but they represent different spaces.
%20 is the URL escaping for byte 32, which corresponds to plain old space in pretty much any encoding you're likely to use in a URL.
is an HTML character reference which actually refers to character 160 of Unicode (and also ISO-8859-1 aka Latin-1). It's a different space character entirely -- the "non-breaking space". Even though they look pretty much the same, they're different characters and it's unlikely that your server will treat them the same way.
No. Neither are spaces (technically). Both represent spaces in different ways though. Make every effort to NOT have spaces, or representatives of spaces, in your URLs. Many find it more elegant (me included) to replace spaces with _ or -
No. is an HTML non-breaking-space entity; this entity has no meaning when used in a filesystem or wherever else that a URL might point. URLs are not encoded in HTML.
No, not in the URLs. What you can do is replace spaces in the textual representation of the URL.
So instead of:
http://some.site/doc%20with%20spaces
you can have:
http://some.site/doc with spaces
%20 is what you get with URL encoding, so this is what you should use if you are going to use it in a URL.
is a HTML entity, which is what should be used for 'non breaking space' in an HTML document.
Most persons try to absolutely avoid spaces in their filenames in URLs. They will give you a serious headache every time so try to do so.
If you want to have spaces in an URL you have to encode them with %20.
  is used by the browser to know how to display the page. This information is only used for displaying. The %20 will be sent to the server that manages all the stuff needed to transfer the webpage to your visitors. The server doesn't speak html so the server would interpret   as a normal part of the filenname and search for a file called in the way foo bar. This file will not be found. Much worse the web server will think that the & begins the variable part of the url and only search for the page foo and then try to generate a variable nbsp and a variable bar but he want see any values for them. All in all the web server can't handle a URL with an in it.
Neither are spaces. You shouldnt be using spaces but if for what ever reason you can't avoid it you should just be able to do...
Hey there
...clicking on which will automatically navigate the user to
WebSite/Web%20Page.aspx
We want to allow "normal" href links to other webpages, but we don't want to allow anyone to sneak in client-side scripting.
Is searching for "javascript:" within the HREF and onclick/onmouseover/etc. events good enough? Or are there other things to check?
It sounds like you're allowing users to submit content with markup. As such, I would recommend taking a look at a few articles about preventing cross-site scripting which would cover a bit more than simply preventing javascript from being inserted into an HREF tag. Below is one I found that might be useful:
http://weblogs.java.net/blog/gmurray71/archive/2006/09/preventing_cros.html
You'll have to use a whitelist of allowed protocols to be completely safe. If you use a blacklist, sooner or later you'll miss something like "telnet://" or "shell:" or some exploitable browser-specific thing you've never heard of...
Nope, there's a lot more that you need to check.
First of the URL could be encoded (using HTML entities or URL encoding or a mixture of both).
Secondly you need to check for malformed HTML, which the browser might guess at and end up allowing some script in.
Thirdly you need to check for CSS based script, e.g. background: url(javascript:...) or width:expression(...)
There's probably more that I've missed - you need to be careful!
You have to be extremely careful when taking user input. You'll want to do a whitelist as mentioned, but not just with the href. Example:
<img src="nosuchimage.blahblah" onerror="alert('Haxored!!!');" />
or
click meh
one option would be to disallow html at all and use the same sort of formatting that some forums use. Just replace
[url="xxx"]yyy[/url]
with
yyy
That'll get you around the issues with mouse over etc. Then just make sure the link starts off with a white-listed protocol, and doesn't have a quote in it (" or some such that might be decrypted by php or the browser).
Sounds like you're looking for the companion function to PHP's strip_tags, which is strip_attributes. Unfortunately, it hasn't been written yet. (Hint, hint.)
There is, however, an interesting-looking suggestion in the strip_tags documentation, here:
http://www.php.net/manual/en/function.strip-tags.php#85718
In theory this will strip anything that isn't an href, class, or ID from submitted links; seems like you probably want to lock it down even further and just take hrefs.