Related
I'm using an '&' symbol with HTML5 and UTF-8 in my site's <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.
http://validator.w3.org is giving me this:
& did not start a character reference. (& probably should have been escaped as &.)
Do I really need to do &?
I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.
Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as & and everything would be fine.
HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.
Keep this point in mind; if you're not escaping & to &, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.
Please just escape your code. It will save you a lot of trouble in the future.
Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.
Encoding & as & under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.
Compare the following: which is easier? Which is easier to bugger up?
Methodology 1
Write some content which includes ampersand characters.
Encode them all.
Methodology 2
(with a grain of salt, please ;) )
Write some content which includes ampersand characters.
On a case-by-case basis, look at each ampersand. Determine if:
It is isolated, and as such unambiguously an ampersand. eg. volt & amp > In that case don't bother encoding it.
It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. E.g., amp&volt >. In that case, don't bother encoding it.
It is not isolated, and ambiguous. E.g., volt& > Encode it.
??
HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "©=2" is still a problem, for example, since © is the copyright symbol.
However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.
I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:
Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.
Specifically, if HTML5 does not require using & in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.
Well, if it comes from user input then absolutely yes, for obvious reasons. Think if this very website didn't do it: the title of this question would show up as Do I really need to encode ‘&’ as ‘&’?
If it's just something like echo '<title>Dolce & Gabbana</title>'; then strictly speaking you don't have to. It would be better, but if you don't, no user will notice the difference.
Could you show us what your title actually is? When I submit
<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>Am I allowed loose & mpersands?</p>
</body>
</html>
to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...
In HTML, a & marks the begin of a reference, either of a character reference or of an entity reference. From that point on, the parser expects either a # denoting a character reference, or an entity name denoting an entity reference, both followed by a ;. That’s the normal behavior.
But if the reference name or just the reference opening & is followed by a white space or other delimiters like ", ', <, >, &, the ending ; and even a reference to represent a plain, & can be omitted:
<p title="&">foo & bar</p>
<p title="&">foo & bar</p>
<p title="&">foo & bar</p>
Only in these cases can the ending ; or even the reference itself be omitted (at least in HTML 4). I think HTML 5 requires the ending ;.
But the specification recommends to always use a reference like the character reference & or the entity reference & to avoid confusion:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
Update (March 2020): The W3C validator no longer complains about escaping URLs.
I was checking why image URLs need escaping and hence tried it in https://validator.w3.org. The explanation is pretty nice. It highlights that even URLs need to be escaped. [PS: I guess it will be unescaped when it's consumed since URLs need &. Can anyone clarify?]
<img alt="" src="foo?bar=qut&qux=fop" />
An entity reference was found in the document, but there is no
reference by that name defined. Often this is caused by misspelling
the reference name, unencoded ampersands, or by leaving off the
trailing semicolon (;). The most common cause of this error is
unencoded ampersands in URLs as described by the WDG in "Ampersands in
URLs". Entity references start with an ampersand (&) and end with a
semicolon (;). If you want to use a literal ampersand in your document
you must encode it as "&" (even inside URLs!). Be careful to end
entity references with a semicolon or your entity reference may get
interpreted in connection with the following text. Also keep in mind
that named entity references are case-sensitive; &Aelig; and æ
are different characters. If this error appears in some markup
generated by PHP's session handling code, this article has
explanations and solutions to your problem.
It depends on the likelihood of a semicolon ending up near your &, causing it to display something quite different.
For example, when dealing with input from users (say, if you include the user-provided subject of a forum post in your title tags), you never know where they might be putting random semicolons, and it might randomly display strange entities. So always escape in that situation.
For your own static HTML content, sure, you could skip it, but it's so trivial to include proper escaping, that there's no good reason to avoid it.
If the user passes it to you, or it will wind up in a URL, you need to escape it.
If it appears in static text on a page? All browsers will get this one right either way, and you don't worry much about it, since it will work.
Yes, you should try to serve valid code if possible.
Most browsers will silently correct this error, but there is a problem with relying on the error handling in the browsers. There is no standard for how to handle incorrect code, so it's up to each browser vendor to try to figure out what to do with each error, and the results may vary.
Some examples where browsers are likely to react differently is if you put elements inside a table but outside the table cells, or if you nest links inside each other.
For your specific example it's not likely to cause any problems, but error correction in the browser might for example cause the browser to change from standards compliant mode into quirks mode, which could make your layout break down completely.
So, you should correct errors like this in the code, if not for anything else so to keep the error list in the validator short, so that you can spot more serious problems.
A couple of years ago, we got a report that one of our web apps wasn't displaying correctly in Firefox. It turned out that the page contained a tag that looked like
<div style="..." ... style="...">
When faced with a repeated style attribute, Internet Explorer combines both of the styles, while Firefox only uses one of them, hence the different behavior. I changed the tag to
<div style="...; ..." ...>
and sure enough, it fixed the problem! The moral of the story is that browsers have more consistent handling of valid HTML than of invalid HTML. So, fix your damn markup already! (Or use HTML Tidy to fix it.)
If & is used in HTML then you should escape it.
If & is used in JavaScript strings, e.g., an alert('This & that'); or document.href, you don't need to use it.
If you're using document.write then you should use it, e.g. document.write(<p>this & that</p>).
If you're really talking about the static text
<title>Foo & Bar</title>
stored in some file on the hard disk and served directly by a server, then yes: it probably doesn't need to be escaped.
However, since there is very little HTML content nowadays that's completely static, I'll add the following disclaimer that assumes that the HTML content is generated from some other source (database content, user input, web service call result, legacy API result, ...):
If you don't escape a simple &, then chances are you also don't escape a & or a or <b> or <script src="http://attacker.com/evil.js"> or any other invalid text. That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks.
In other words: when you're already checking and escaping the other more problematic cases, then there's almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone-& unescaped.
The link has a fairly good example of when and why you may need to escape & to &
https://jsfiddle.net/vh2h7usk/1/
Interestingly, I had to escape the character in order to represent it properly in my answer here. If I were to use the built-in code sample option (from the answer panel), I can just type in & and it appears as it should. But if I were to manually use the <code></code> element, then I have to escape in order to represent it correctly :)
I'm using an '&' symbol with HTML5 and UTF-8 in my site's <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.
http://validator.w3.org is giving me this:
& did not start a character reference. (& probably should have been escaped as &.)
Do I really need to do &?
I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.
Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as & and everything would be fine.
HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.
Keep this point in mind; if you're not escaping & to &, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.
Please just escape your code. It will save you a lot of trouble in the future.
Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.
Encoding & as & under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.
Compare the following: which is easier? Which is easier to bugger up?
Methodology 1
Write some content which includes ampersand characters.
Encode them all.
Methodology 2
(with a grain of salt, please ;) )
Write some content which includes ampersand characters.
On a case-by-case basis, look at each ampersand. Determine if:
It is isolated, and as such unambiguously an ampersand. eg. volt & amp > In that case don't bother encoding it.
It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. E.g., amp&volt >. In that case, don't bother encoding it.
It is not isolated, and ambiguous. E.g., volt& > Encode it.
??
HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "©=2" is still a problem, for example, since © is the copyright symbol.
However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.
I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:
Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.
Specifically, if HTML5 does not require using & in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.
Well, if it comes from user input then absolutely yes, for obvious reasons. Think if this very website didn't do it: the title of this question would show up as Do I really need to encode ‘&’ as ‘&’?
If it's just something like echo '<title>Dolce & Gabbana</title>'; then strictly speaking you don't have to. It would be better, but if you don't, no user will notice the difference.
Could you show us what your title actually is? When I submit
<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>Am I allowed loose & mpersands?</p>
</body>
</html>
to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...
In HTML, a & marks the begin of a reference, either of a character reference or of an entity reference. From that point on, the parser expects either a # denoting a character reference, or an entity name denoting an entity reference, both followed by a ;. That’s the normal behavior.
But if the reference name or just the reference opening & is followed by a white space or other delimiters like ", ', <, >, &, the ending ; and even a reference to represent a plain, & can be omitted:
<p title="&">foo & bar</p>
<p title="&">foo & bar</p>
<p title="&">foo & bar</p>
Only in these cases can the ending ; or even the reference itself be omitted (at least in HTML 4). I think HTML 5 requires the ending ;.
But the specification recommends to always use a reference like the character reference & or the entity reference & to avoid confusion:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
Update (March 2020): The W3C validator no longer complains about escaping URLs.
I was checking why image URLs need escaping and hence tried it in https://validator.w3.org. The explanation is pretty nice. It highlights that even URLs need to be escaped. [PS: I guess it will be unescaped when it's consumed since URLs need &. Can anyone clarify?]
<img alt="" src="foo?bar=qut&qux=fop" />
An entity reference was found in the document, but there is no
reference by that name defined. Often this is caused by misspelling
the reference name, unencoded ampersands, or by leaving off the
trailing semicolon (;). The most common cause of this error is
unencoded ampersands in URLs as described by the WDG in "Ampersands in
URLs". Entity references start with an ampersand (&) and end with a
semicolon (;). If you want to use a literal ampersand in your document
you must encode it as "&" (even inside URLs!). Be careful to end
entity references with a semicolon or your entity reference may get
interpreted in connection with the following text. Also keep in mind
that named entity references are case-sensitive; &Aelig; and æ
are different characters. If this error appears in some markup
generated by PHP's session handling code, this article has
explanations and solutions to your problem.
It depends on the likelihood of a semicolon ending up near your &, causing it to display something quite different.
For example, when dealing with input from users (say, if you include the user-provided subject of a forum post in your title tags), you never know where they might be putting random semicolons, and it might randomly display strange entities. So always escape in that situation.
For your own static HTML content, sure, you could skip it, but it's so trivial to include proper escaping, that there's no good reason to avoid it.
If the user passes it to you, or it will wind up in a URL, you need to escape it.
If it appears in static text on a page? All browsers will get this one right either way, and you don't worry much about it, since it will work.
Yes, you should try to serve valid code if possible.
Most browsers will silently correct this error, but there is a problem with relying on the error handling in the browsers. There is no standard for how to handle incorrect code, so it's up to each browser vendor to try to figure out what to do with each error, and the results may vary.
Some examples where browsers are likely to react differently is if you put elements inside a table but outside the table cells, or if you nest links inside each other.
For your specific example it's not likely to cause any problems, but error correction in the browser might for example cause the browser to change from standards compliant mode into quirks mode, which could make your layout break down completely.
So, you should correct errors like this in the code, if not for anything else so to keep the error list in the validator short, so that you can spot more serious problems.
A couple of years ago, we got a report that one of our web apps wasn't displaying correctly in Firefox. It turned out that the page contained a tag that looked like
<div style="..." ... style="...">
When faced with a repeated style attribute, Internet Explorer combines both of the styles, while Firefox only uses one of them, hence the different behavior. I changed the tag to
<div style="...; ..." ...>
and sure enough, it fixed the problem! The moral of the story is that browsers have more consistent handling of valid HTML than of invalid HTML. So, fix your damn markup already! (Or use HTML Tidy to fix it.)
If & is used in HTML then you should escape it.
If & is used in JavaScript strings, e.g., an alert('This & that'); or document.href, you don't need to use it.
If you're using document.write then you should use it, e.g. document.write(<p>this & that</p>).
If you're really talking about the static text
<title>Foo & Bar</title>
stored in some file on the hard disk and served directly by a server, then yes: it probably doesn't need to be escaped.
However, since there is very little HTML content nowadays that's completely static, I'll add the following disclaimer that assumes that the HTML content is generated from some other source (database content, user input, web service call result, legacy API result, ...):
If you don't escape a simple &, then chances are you also don't escape a & or a or <b> or <script src="http://attacker.com/evil.js"> or any other invalid text. That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks.
In other words: when you're already checking and escaping the other more problematic cases, then there's almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone-& unescaped.
The link has a fairly good example of when and why you may need to escape & to &
https://jsfiddle.net/vh2h7usk/1/
Interestingly, I had to escape the character in order to represent it properly in my answer here. If I were to use the built-in code sample option (from the answer panel), I can just type in & and it appears as it should. But if I were to manually use the <code></code> element, then I have to escape in order to represent it correctly :)
I just uncovered this little gem of an IE9 bug. It seems that IE9 does not recognize a space preceding a period as a breaking point. As in a list of domain or file extensions. Open the following fiddle in IE9.
http://jsfiddle.net/cssguru/nNnzM/1/
I tried using escape characters, but it didn't help. Any suggestions on a workaround?
It is an annoying feature, but it is probably intentional and not regarded as a bug by the vendor. Instead, it is regarded as implementing Unicode line breaking rules (which are partly rather odd). According to those rules, the period (or FULL STOP as they call it) has line breaking class IS, infix numeric separator, and “When not used in a numeric context, infix separators are sentence-ending punctuation. Therefore they always prevent breaks before.”
To deal with such issues, it is probably best nowadays to insert U+200B ZERO WIDTH SPACE between a normal space and a period, e.g.
.web .shop .blog .nyc ...
U+200B is a control character that allows line breaks in a place where they would not otherwise be allowed.
Old IE versions (IE 6) may have difficulties with this, displaying a symbol of unrepresentable character in place of U+200B. An alternative method, the <wbr> tag, would not have this problem, but it seems that IE 8 and newer often fail to honor this age-old tag (perhaps because it never made its way to any standard, despite its usefulness).
I added a declaration for word-wrap in this update to your Fiddle, and that seems to have resolved the issue.
I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.
I have a lot of empty span tags and so on in my code, due to CSS image replacement techniques.
They trigger a HTML validation warning.
Should I care?
I agree with #Gumbo, but having lots of warnings that make no difference can hide the ones that you do need to pay attention to. If you can put something in the span that gets rid of the warning but doesn't break your UI, you should at least think about doing it.
I've made validation part of my workflow because it helps me catch mistakes early. And while I don't consider empty elements to be a problem, it negates some of the value of using a validator if I have to mentally parse a list of warnings each time and decide whether a warning is important or not. So I try to keep my pages both error- and warning-free so that a quick glance at the HTML Validator icon in the Firefox status bar only changes when there is a real problem. To that end I keep empty elements "unempty" by inserting an empty comment.
<span><!-- --></span>
(At least that works with the Tidy validator.)
Now, that being said, I don't think this is at all necessary for many situations. It is perfectly reasonably to think that adding eight extra characters to your code just to avoid a validator warning is ridiculous. But it works for me.
You should consider the behavior of the page for things like screen readers. It is common to actually put a few words describing the image in the tag that are then hidden by the image replacement.
See the CSS Zen Garden where you can see examples like H1 spans with text being replaced in CSS by images.
This will improve the not only the accessibility of your site, but also the search-ability.
An "empty" tag has a very specific definition in HTML:
<span/>
versus
<span></span>
The former is not permitted by the HTML 4.0 Strict DTD, so should be flagged by a validator. The only tags that can use the former syntax are those specifically identified as "EMPTY" in the DTD (eg, <br>).
The second form is valid HTML, and does not get flagged by the W3C validator.
So I have to assume that either (1) your validator is broken, or (2) you are using the tag incorrectly.
A warning is not an error. It’s just a reminder that you should improve something.
I suppose if bandwidth was an issue, those empty tags could be revisited to see if you could get them from appearing alogether.
Eric Meyer also says that empy tags are bad, semantically.
Warning don't mean it's wrong, but say it could, or sometimes should, be better.
In the same way
if("value"==variable)
is better than
if(variable=="value")