Using a "&" in <a></a> - html

Currently, I have:
Start Process
However, I ran this code through the W3 HTML validator (https://validator.w3.org), and it comes up with this:
& did not start a character reference. (& probably should have been escaped as &.)
Is there another proper way to put a "&" into an <a></a> tag, or should I just leave it like how it is?

Handling ampersands (&) in URLs is explained in the Web Design Group's Common Validator Problems page:
Ampersands (&'s) in URLs
Another common error occurs when including a URL which contains an ampersand ("&"):
<!-- This is invalid! --> ...
This example generates an error for "unknown entity section" because the "&" is assumed to begin an entity reference. Browsers often recover safely from this kind of error, but real problems do occur in some cases. In this example, many browsers correctly convert &copy=3 to ©=3, which may cause the link to fail. Since 〈 is the HTML entity for the left-pointing angle bracket, some browsers also convert &lang=en to 〈=en. And one old browser even finds the entity §, converting &section=2 to §ion=2.
To avoid problems with both validators and browsers, always use & in place of & when writing URLs in HTML:
...
Note that replacing & with & is only done when writing the URL in HTML, where "&" is a special character (along with "<" and ">"). When writing the same URL in a plain text email message or in the location bar of your browser, you would use "&" and not "&". With HTML, the browser translates "&" to "&" so the Web server would only see "&" and not "&" in the query string of the request.

Related

Golang html.Parse rewriting href query strings to contain &

I have the following code:
package main
import (
"os"
"strings"
"golang.org/x/net/html"
)
func main() {
myHtmlDocument := `<!DOCTYPE html>
<html>
<head>
</head>
<body>
WTF
</body>
</html>`
doc, _ := html.Parse(strings.NewReader(myHtmlDocument))
html.Render(os.Stdout, doc)
}
The html.Render function is producing the following output:
<!DOCTYPE html><html><head>
</head>
<body>
WTF
</body></html>
Why is it rewriting the query string and converting & to & (in-between bar and baz)?
Is there a way to avoid this behavior?
I'm trying to do template transformation, and I don't want it mangling my URLs.
html.Parse wants to generate valid HTML, and the HTML spec states that an amperstand in a href attribute must be encoded.
https://www.w3.org/TR/xhtml1/guidelines.html#C_12
In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&").
For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.
In this case, go is actually making your HTML better and valid
With that being said - browsers will unescape it, so the resulting url if it were to be clicked on would still be the correct one (without the &, just the &:
console.log(document.querySelector('a').href)
WTF
EDIT: Since people are being pedentic in the comments, I'll note that in HTML5 you are not required to escape the ampersand anymore, however it still always valid to escape it. On the otherhand, there are still situations in which it is invalid not to - essentially anytime the ampersand is followed by a semicolon but is not a named character:
An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section.
which means that a link like:
WTF
would be invalid, yet if it were
WTF
it would be valid.
So the parser sticks to a rule that is simpler to implement, and works in all versions of HTML, to make your HTML better and still valid.

Should I use & in href="" or & is enough in HTML4 and HTML5?

Should I use & in href="" or & is enough in HTML4 and HTML5?
The most of browser not have problem with this but how it should be done?
Call()
Or
Call()
& is correct. The bare ampersand generally works because if a browser sees an & that isn't followed by a valid entity reference, it passes it through, but there's no reason you should count on this behavior in code, especially if you can't guarantee the contents of the thing following the &. In particular, it's forbidden for &foo; to appear where "foo" is an alphanumeric string that isn't valid named character (this is called an "ambiguous ampersand" in HTML5), and it's obviously undesirable for &foo; to appear where "foo" is a valid named character, because it will be parsed as that character.
See the relevant spec requirements: https://html.spec.whatwg.org/multipage/syntax.html#attributes-2

What other characters beside ampersand (&) should be encoded in HTML href/src attributes?

Is the ampersand the only character that should be encoded in an HTML attribute?
It's well known that this won't pass validation:
Because the ampersand should be &. Here's a direct link to the validation fail.
This guy lists a bunch of characters that should be encoded, but he's wrong. If you encode the first "/" in http:// the href won't work.
In ASP.NET, is there a helper method already built to handle this? Stuff like Server.UrlEncode and HtmlEncode obviously don't work - those are for different purposes.
I can build my own simple extension method (like .ToAttributeView()) which does a simple string replace.
Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:
http://query.com/?q=foo&lt=bar&gt=baz
Even though there aren't trailing semi-colons, since < is the entity for < and > is the entity for >, some old browsers would translate this URL to:
http://query.com/?q=foo<=bar>=baz
So you need to specify & as & to prevent this from occurring for links within an HTML parsed document.
The purpose of escaping characters is so that they won't be processed as arguments. So you actually don't want to encode the entire url, just the values you are passing via the querystring. For example:
http://example.com/?parameter1=<ENCODED VALUE>&parameter2=<ENCODED VALUE>
The url you showed is actually a perfectly valid url that will pass validation. However, the browser will interpret the & symbols as a break between parameters in the querystring. So your querystring:
?q=whatever&lang=en
Will actually be translated by the recipient as two parameters:
q = "whatever"
lang = "en"
For your url to work you just need to ensure that your values are being encoded:
?q=<ENCODED VALUE>&lang=<ENCODED VALUE>
Edit: The common problems page from the W3C you linked to is talking about edge cases when urls are rendered in html and the & is followed by text that could be interpreted as an entity reference (&copy for example). Here is a test in jsfiddle showing the url:
http://jsfiddle.net/YjPHA/1/
In Chrome and FireFox the links works correctly, but IE renders &copy as ©, breaking the link. I have to admit I've never had a problem with this in the wild (it would only affect those entity references which don't require a semicolon, which is a pretty small subset).
To ensure you're safe from this bug you can HTML encode any of your URLS you render to the page and you should be fine. If you're using ASP.NET the HttpUtility.HtmlEncode method should work just fine.
You do not need HTML escapement here:
According to the HTML5 spec:
http://www.w3.org/TR/html5/tokenization.html#character-reference-in-attribute-value-state
&lang= should be parsed as non-recognized character reference and value of the attribute should be used as it is: http://domain.com/search?q=whatever&lang=en
For the reference: added question to HTML5 WG: http://lists.w3.org/Archives/Public/public-html/2011Sep/0163.html
In HTML attribute values, if you want ", '&' and a non-breaking space as a result, you should (as an author who is clear about intent) have ", & and in the markup.
For " though, you don't have to use " if you use single quotes to encase your attribute values.
For HTML text nodes, in addition to the above, if you want < and > as a result, you should use < and >. (I'd even use these in attribute values too.)
For hfnames and hfvalues (and directory names in the path) for URIs, I'd used Javascript's encodeURIComponent() (on a utf-8 page when encoding for use on a utf-8 page).
If I understand the question correctly, I believe this is what you want.

Render Apostrophe of Twitter feed is not correct in IE

Check out this page using IE : http://search.twitter.com/search?q=%23testvoorklant
You will find out the apostrophe is rendered as & apos ;!
If I want to use these feeds in my website, how should I handle this problem?
Regards
The code on that page is generally broken. When validating the page the W3C validator finds 43 errors and 23 warnings.
A lot of the errors is because the page is typed as HTML 4.01, but it contains a lot of XHTML code. Another frequent error is that the URLs contains & characters that are not encoded as &.
The error that most likely cause the problem that you see, is that the code contains the entity &apos;, which doesn't exist in HTML. If a browser displays that unchanged, that is perfectly normal.
If you want to clean up that data, you can just replace those with apostrophes, or ' in the cases when it needs to be encoded.

Do I encode ampersands in <a href...>?

I'm writing code that automatically generates HTML, and I want it to encode things properly.
Say I'm generating a link to the following URL:
http://www.google.com/search?rls=en&q=stack+overflow
I'm assuming that all attribute values should be HTML-encoded. (Please correct me if I'm wrong.) So that means if I'm putting the above URL into an anchor tag, I should encode the ampersand as &, like this:
<a href="http://www.google.com/search?rls=en&q=stack+overflow">
Is that correct?
Yes, it is. HTML entities are parsed inside HTML attributes, and a stray & would create an ambiguity. That's why you should always write & instead of just & inside all HTML attributes.
That said, only & and quotes need to be encoded. If you have special characters like é in your attribute, you don't need to encode those to satisfy the HTML parser.
It used to be the case that URLs needed special treatment with non-ASCII characters, like é. You had to encode those using percent-escapes, and in this case it would give %C3%A9, because they were defined by RFC 1738. However, RFC 1738 has been superseded by RFC 3986 (URIs, Uniform Resource Identifiers) and RFC 3987 (IRIs, Internationalized Resource Identifiers), on which the WhatWG based its work to define how browsers should behave when they see an URL with non-ASCII characters in it since HTML5. It's therefore now safe to include non-ASCII characters in URLs, percent-encoded or not.
By current official HTML recommendations, the ampersand must be escaped e.g. as & in contexts like this. However, browsers do not require it, and the HTML5 CR proposes to make this a rule, so that special rules apply in attribute values. Current HTML5 validators are outdated in this respect (see bug report with comments).
It will remain possible to escape ampersands in attribute values, but apart from validation with current tools, there is no practical need to escape them in href values (and there is a small risk of making mistakes if you start escaping them).
You have two standards concerning URLs in links (<a href).
The first standard is RFC 1866 (HTML 2.0) where in "3.2.1. Data Characters" you can read the characters which need to be escaped when used as the value for an HTML attribute. (Attributes themselves do not allow special characters at all, e.g. <a hr&ef="http://... is not allowed, nor is <a hr&ef="http://....)
Later this has gone into the HTML 4 standard, the characters you need to escape are:
< to <
> to >
& to &
" to &quote;
' to &apos;
The other standard is RFC 3986 "Generic URI standard", where URLs are handled (this happens when the browser is about to follow a link because the user clicked on the HTML element).
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
It is important to escape those characters so the client knows whether they represent data or a delimiter.
Example unescaped:
https://example.com/?user=test&password&te&st&goto=https://google.com
Example, a fully legitimate URL
https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com
Example fully legitimate URL in the value of an HTML attribute:
https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com
Also important scenarios:
JavaScript code as a value:
<img src="..." onclick="window.location.href = "https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com";">...</a> (Yes, ;; is correct.)
JSON as a value:
...
Escaped things inside escaped things, double encoding, URL inside URL inside parameter, etc,...
http://x.com/?passwordUrl=http%3A%2F%2Fy.com%2F%3Fuser%3Dtest&password=""123
I am posting a new answer because I find zneak's answer does not have enough examples, does not show HTML and URI handling as different aspects and standards and has some minor things missing.
Yes, you should convert & to &.
This HTML validator tool by W3C is helpful for questions like this. It will tell you the errors and warnings for a particular page.