HTML5 validation error a href (NFC) - html

I masked all special characters in the following URL, but w3c-validator still throws error.
I checked all the NFC Tutorials but I have no idea where is the error. Any idea ?
URL
Asche
w3c-Error
Line 618, Column 441: Bad value http://www.example.de/index.php?cnid=efcb9a458fb823ba877ef53b7162598f&ldtype=grid&cl=alist&tpl=&fnc=executefilter&fname=&attrfilter[3a5d1ca314a5205fa7b7b3baa5d2f94e][2f143d22ce421269b5c7d01a160f6541]=2f143d22ce421269b5c7d01a160f6541 for attribute href on element a: Illegal character in query component.
…21269b5c7d01a160f6541]=2f143d22ce421269b5c7d01a160f6541">Asche</a></li>
Syntax of IRI reference
Any URL. For example: /hello, #canvas, or http://example.org/. Characters should be represented in NFC and spaces should be escaped as %20.

The characters [ and ] need to be %-encoded in a URL, as %5B and %5D, according STD 66 (where Appendix A contains a syntax summary, showing that the brackets are “gen-delims” characters, which are not allowed in a query part except as %-encoded).
You should have posted an HTML document, since that’s what validators work on. The following test document (which validates) contains the URL you mention, properly encoded:
<!doctype html>
<meta charset=utf-8>
<title></title>
<a href=
"http://www.example.de/index.php?cnid=efcb9a458fb823ba877ef53b7162598f&ldtype=grid&cl=alist&tpl=&fnc=executefilter&fname=&attrfilter%5B3a5d1ca314a5205fa7b7b3baa5d2f94e%5D%5B2f143d22ce421269b5c7d01a160f6541%5D=2f143d22ce421269b5c7d01a160f6541">foo</a>
Quite apart from this, the URL does not work; it causes a “Multiple Choices” response, which is rather odd (such a message should be issued when the server is doing some content negotiation that does not find an acceptable alternative, and a list of alternatives should be presented; but here it’s more or less a “Not Found” situation).

Related

Golang html.Parse rewriting href query strings to contain &

I have the following code:
package main
import (
"os"
"strings"
"golang.org/x/net/html"
)
func main() {
myHtmlDocument := `<!DOCTYPE html>
<html>
<head>
</head>
<body>
WTF
</body>
</html>`
doc, _ := html.Parse(strings.NewReader(myHtmlDocument))
html.Render(os.Stdout, doc)
}
The html.Render function is producing the following output:
<!DOCTYPE html><html><head>
</head>
<body>
WTF
</body></html>
Why is it rewriting the query string and converting & to & (in-between bar and baz)?
Is there a way to avoid this behavior?
I'm trying to do template transformation, and I don't want it mangling my URLs.
html.Parse wants to generate valid HTML, and the HTML spec states that an amperstand in a href attribute must be encoded.
https://www.w3.org/TR/xhtml1/guidelines.html#C_12
In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&").
For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.
In this case, go is actually making your HTML better and valid
With that being said - browsers will unescape it, so the resulting url if it were to be clicked on would still be the correct one (without the &, just the &:
console.log(document.querySelector('a').href)
WTF
EDIT: Since people are being pedentic in the comments, I'll note that in HTML5 you are not required to escape the ampersand anymore, however it still always valid to escape it. On the otherhand, there are still situations in which it is invalid not to - essentially anytime the ampersand is followed by a semicolon but is not a named character:
An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section.
which means that a link like:
WTF
would be invalid, yet if it were
WTF
it would be valid.
So the parser sticks to a rule that is simpler to implement, and works in all versions of HTML, to make your HTML better and still valid.

HTML Validation

My website is almost ready but I notice that my website shows in Firefox and IE look different and in chrome is fine. I have used HTML Validation to check and have 2 errors and 2 warnings. This is the details :
Line 2, Column 13: there is no attribute "XMLNS"
<html xmlns="http://www.w3.org/1999/xhtml">
✉
You have used the attribute named above in your document, but the document type you are using does not support that attribute for this element. This error is often caused by incorrect use of the "Strict" document type with a document that uses frames (e.g. you must use the "Transitional" document type to get the "target" attribute), or by using vendor proprietary extensions such as "marginheight" (this is usually fixed by using CSS to achieve the desired effect instead).
This error may also result if the element itself is not supported in the document type you are using, as an undefined element will have no supported attributes; in this case, see the element-undefined error message for further information.
How to fix: check the spelling and case of the element and attribute, (Remember XHTML is all lower-case) and/or check that they are both allowed in the chosen document type, and/or use CSS instead of this attribute. If you received this error when using the element to incorporate flash media in a Web page, see the FAQ item on valid flash.
Line 4, Column 76: NET-enabling start-tag requires SHORTTAG YES
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
✉
For the current document, the validator interprets strings like according to legacy rules that break the expectations of most authors and thus cause confusing warnings and error messages from the validator. This interpretation is triggered by HTML 4 documents or other SGML-based HTML documents. To avoid the messages, simply remove the "/" character in such contexts. NB: If you expect to be interpreted as an XML-compatible "self-closing" tag, then you need to use XHTML or HTML5.
This warning and related errors may also be caused by an unquoted attribute value containing one or more "/". Example: http://w3c.org>W3C. In such cases, the solution is to put quotation marks around the value.
Line 4, Column 77: character data is not allowed here
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
✉
You have used character data somewhere it is not permitted to appear. Mistakes that can cause this error include:
putting text directly in the body of the document without wrapping it in a container element (such as a aragraph), or forgetting to quote an attribute value (where characters such as "%" and "/" are common, but cannot appear without surrounding quotes), or using XHTML-style self-closing tags (such as ) in HTML 4.01 or earlier. To fix, remove the extra slash ('/') character. For more information about the reasons for this, see Empty elements in SGML, HTML, XML, and XHTML.
Line 44, Column 12: NET-enabling start-tag requires SHORTTAG YES
<br/>
✉
For the current document, the validator interprets strings like according to legacy rules that break the expectations of most authors and thus cause confusing warnings and error messages from the validator. This interpretation is triggered by HTML 4 documents or other SGML-based HTML documents. To avoid the messages, simply remove the "/" character in such contexts. NB: If you expect to be interpreted as an XML-compatible "self-closing" tag, then you need to use XHTML or HTML5.
This warning and related errors may also be caused by an unquoted attribute value containing one or more "/". Example: http://w3c.org>W3C. In such cases, the solution is to put quotation marks around the value.
As for Line 2, Column 13, I have added in this under header.php:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
However, this error is still there and I really do not know how to tidy it. On top of that, I am not a programmer and very poor in HTML....
Thank you.
You are trying to validate a document with some XHTML features as HTML 4. You are not showing the first line of the HTML document, and this would be all-important. In general, you should show a complete HTML document that reproduces the issue. But in this case, the apparent reason is wrong DOCTYPE. You should replace the existing HTML 4 DOCTYPE by one that declares some version of XHTML 1.0, if XHTML 1.0 is what you are trying to use.

Why is "&reg" being rendered as "®" without the bounding semicolon

I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be
http://ravercats.com/meow?foo=bar&region=catnip
is instead coming through as:
http://ravercats.com/meow?foo=bar®ion=catnip
I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:
&VALUE;
where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.
Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.
Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:
<html>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</html>
EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".
Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognized by modern browsers' HTML parsers.
Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e.g. when followed by a space) or otherwise always escape & as & whenever in doubt.
For reference, the full list of named character references that are recognized without a semicolon is:
AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil,
ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT,
Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN,
Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig,
agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy,
curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14,
frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt,
macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf,
ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg,
sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc,
ugrave, uml, uuml, yacute, yen, yuml
However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.
For the full list of named character references with or without ending semicolons, see here.
This is a very messy business and depends on context (text content vs. attribute value).
Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. &region= would be syntactically correct but undefined, as entity region has not been defined. XHTML makes the trailing semicolon required.
Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar&region=catnip" so that &region is not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.
Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar&region=catnip" actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don’t normally write URLs in text. In text, &region= gets processed so that &reg is recognized as an entity reference (for “®”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character references describes the “double standard”:
If the character reference is being consumed as part of an attribute,
and the last character matched is not a ";" (U+003B) character, and
the next character is either a "=" (U+003D) character or in the range
ASCII digits, uppercase ASCII letters, or lowercase ASCII letters,
then, for historical reasons, all the characters that were matched
after the U+0026 AMPERSAND character (&) must be unconsumed, and
nothing is returned.
Thus, in an attribute value, even &reg= would not be treated as containing a character reference, and still less &region=. (But reg_test= is a different case, due to the underscore character.)
In text content, other rules apply. The construct &region= causes then a parse error (by HTML5 CR rules), but with well-defined error handling: &reg is recognized as a character reference.
Maybe try replacing your & as &? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.
1: The following markup is invalid in the first place (use the W3C Markup Validation Service to verify):
In the above example, the & character should be encoded as &, like so:
2: Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.
Here is a simple solution and it may not work in all instances.
So from this:
http://ravercats.com/meow?status=Online&region=Atlantis
To This:
http://ravercats.com/meow?region=Atlantis&status=Online
Because the &reg as we know triggers the special character ®
Caveat: If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.
Escape your output!
Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url) or htmlentities($url) in php).
See your test case and then the correctly encoded html at this jsfiddle:
http://jsfiddle.net/tchalvakspam/Fp3W6/
Inactive code here:
<div>
Unescaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.
I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.
To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.

W3 validator errors in Magento 1.7.0.2 header

I am validating my store in W3 validator and I am gettin some errors in this line:
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,700italic,300,700,400&subset=latin,latin-ext' rel='stylesheet' type='text/css'/>
The problem seem to lie in &subset= code.
The w3 validator returns:
An entity reference was found in the document, but there is no
reference by that name defined. Often this is caused by misspelling
the reference name, unencoded ampersands, or by leaving off the
trailing semicolon (;). The most common cause of this error is
unencoded ampersands in URLs as described by the WDG in "Ampersands in
URLs".
Entity references start with an ampersand (&) and end with a semicolon
(;). If you want to use a literal ampersand in your document you must
encode it as "&" (even inside URLs!). Be careful to end entity
references with a semicolon or your entity reference may get
interpreted in connection with the following text. Also keep in mind
that named entity references are case-sensitive; &Aelig; and æ
are different characters.
If this error appears in some markup generated by PHP's session
handling code, this article has explanations and solutions to your
problem.
Note that in most documents, errors related to entity references will
trigger up to 5 separate messages from the Validator. Usually these
will all disappear when the original problem is fixed.
I have four questions:
where is this line generated? I cant seem to find it.
what is the correct syntax for it?
is meta name="keywords" tag obsolete? I think so, but even the newest magento version still auto generates it. Can it be removed?
i would like to add custom SEO stuff in header. What would be the correct location for it, since the header parses from multiple locations.
Thank you in advance.
You need to encode & in links. See for example the question: Do I encode ampersands in <a href...>?
So & should become & (as long as you don't want to reference a different entity).

What other characters beside ampersand (&) should be encoded in HTML href/src attributes?

Is the ampersand the only character that should be encoded in an HTML attribute?
It's well known that this won't pass validation:
Because the ampersand should be &. Here's a direct link to the validation fail.
This guy lists a bunch of characters that should be encoded, but he's wrong. If you encode the first "/" in http:// the href won't work.
In ASP.NET, is there a helper method already built to handle this? Stuff like Server.UrlEncode and HtmlEncode obviously don't work - those are for different purposes.
I can build my own simple extension method (like .ToAttributeView()) which does a simple string replace.
Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:
http://query.com/?q=foo&lt=bar&gt=baz
Even though there aren't trailing semi-colons, since < is the entity for < and > is the entity for >, some old browsers would translate this URL to:
http://query.com/?q=foo<=bar>=baz
So you need to specify & as & to prevent this from occurring for links within an HTML parsed document.
The purpose of escaping characters is so that they won't be processed as arguments. So you actually don't want to encode the entire url, just the values you are passing via the querystring. For example:
http://example.com/?parameter1=<ENCODED VALUE>&parameter2=<ENCODED VALUE>
The url you showed is actually a perfectly valid url that will pass validation. However, the browser will interpret the & symbols as a break between parameters in the querystring. So your querystring:
?q=whatever&lang=en
Will actually be translated by the recipient as two parameters:
q = "whatever"
lang = "en"
For your url to work you just need to ensure that your values are being encoded:
?q=<ENCODED VALUE>&lang=<ENCODED VALUE>
Edit: The common problems page from the W3C you linked to is talking about edge cases when urls are rendered in html and the & is followed by text that could be interpreted as an entity reference (&copy for example). Here is a test in jsfiddle showing the url:
http://jsfiddle.net/YjPHA/1/
In Chrome and FireFox the links works correctly, but IE renders &copy as ©, breaking the link. I have to admit I've never had a problem with this in the wild (it would only affect those entity references which don't require a semicolon, which is a pretty small subset).
To ensure you're safe from this bug you can HTML encode any of your URLS you render to the page and you should be fine. If you're using ASP.NET the HttpUtility.HtmlEncode method should work just fine.
You do not need HTML escapement here:
According to the HTML5 spec:
http://www.w3.org/TR/html5/tokenization.html#character-reference-in-attribute-value-state
&lang= should be parsed as non-recognized character reference and value of the attribute should be used as it is: http://domain.com/search?q=whatever&lang=en
For the reference: added question to HTML5 WG: http://lists.w3.org/Archives/Public/public-html/2011Sep/0163.html
In HTML attribute values, if you want ", '&' and a non-breaking space as a result, you should (as an author who is clear about intent) have ", & and in the markup.
For " though, you don't have to use " if you use single quotes to encase your attribute values.
For HTML text nodes, in addition to the above, if you want < and > as a result, you should use < and >. (I'd even use these in attribute values too.)
For hfnames and hfvalues (and directory names in the path) for URIs, I'd used Javascript's encodeURIComponent() (on a utf-8 page when encoding for use on a utf-8 page).
If I understand the question correctly, I believe this is what you want.