HTML escape codes and differences between W3C HTML4 and HTML5 validator - html

The W3C HTML 4 fragment validator accepts this code:
<a href='http://www.sparql.org/sparql?query=+PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0A+SELECT+%3FAgent%0D%0A+FROM+%3Chttp%3A%2F%2Fwww.w3.org%2F2012%2FpyRdfa%2Fextract%3Furi%3Dhttp%3A%2F%2Fontomatica.com%2Fpublic%2Ftest%2F2_infotext.html%3E%0D%0A+WHERE%0D%0A+{%0D%0A+%3Fs+foaf%3AAgent+%3FAgent+.%0D%0A+}%0D%0A&default-graph-uri=&output=text&stylesheet=%2Fxml-to-html.xsl' title='Click here to query the page using SPARQL'><img src='http://www.example.com/public/bin/logo_sparql.png' alt='Run SPARQL query'/></a>
When the same code is in a document submitted to the W3C validator using the HTML5 template, it reports:
Bad value for attribute href on element a: Illegal character in query: not a URL code point.
at the location SPARQL'><img
To my eye there is no char that must be escaped (e.g. a pipe char).
What do I change so the element is accepted by the HTML5 validator?

If you escape { and } as %7B and %7D, it works.
<a href='http://www.sparql.org/sparql?query=+PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0A+SELECT+%3FAgent%0D%0A+FROM+%3Chttp%3A%2F%2Fwww.w3.org%2F2012%2FpyRdfa%2Fextract%3Furi%3Dhttp%3A%2F%2Fontomatica.com%2Fpublic%2Ftest%2F2_infotext.html%3E%0D%0A+WHERE%0D%0A+%7B%0D%0A+%3Fs+foaf%3AAgent+%3FAgent+.%0D%0A+%7B%0D%0A&default-graph-uri=&output=text&stylesheet=%2Fxml-to-html.xsl' title='Click here to query the page using SPARQL'><img src='http://www.example.com/public/bin/logo_sparql.png' alt='Run SPARQL query'/></a>

Related

Which part of the HTML specification cause a URL within angle-brackets to be parsed as an <http:> element with attributes?

Here is my HTML code.
<!DOCTYPE html>
<html>
<meta charset="UTF-8">
<head>
<title>Bar</title>
<script>
window.onload = function() {
console.log(document.body.innerHTML)
}
</script>
</head>
<body>
<http://www.example.com/foo/bar/baz.html>
</body>
</html>
I save this code in a file named bar.html and then open the page with Firefox or Chrome. This is the output I see in the console.
<http: www.example.com="" foo="" bar="" baz.html="">
</http:>
Now I understand that my code was incorrect because it had a URL enclosed within < and >.
I want to understand how exactly did the browser parse it as an http: tag with parts of the URL interpreted as HTML attributes.
Is there some part of the HTML specification that leads to this kind of behavior? If so, could you please quote such parts of the HTML specification?
Everything you need to know is in section 8.2.4. In particular:
Up to <http:, the parser is in the tag name state. The element's tag name is http:, including the colon, as evidenced by the </http:> end tag.
The first / switches the parser to the self-closing start tag state.
The second / causes a parse error as described in the link in step 2, switching the parser to the before attribute name state.
The parser enters the attribute name state and continues consuming the URL. This is what causes paths of the path to be treated as attribute names.
When the parser reaches the next /, it switches back to the self-closing start tag state and repeats steps 2 and 3, except that it's not a second / but a different character (that isn't >) that causes the parse error and switches the parser back to the before attribute name state in step 3.
Once the parser finally sees a >, it closes the start tag, emits it, and proceeds as normal.
HTML specification defines HTML concept, but not how to parse it, as far as I know. Parsing algorithm is internal subject of browser's programmers and they do their best to parse HTML with errors even.

W3C validation error for use of <html> in quoted string: 'document type does not allow element "html" here'

I receive the following error using HTML in a quoted string:
Error Line 63, Column 39: document type does not allow element "html" here
mywindow.document.write("<html><head><title>mydiv</title>");
The element named above was found in a context where it is not allowed. This could mean that you have incorrectly nested elements -- such as a "style" element in the "body" section instead of inside "head" -- or two elements that overlap (which is not allowed).
Am I doing something wrong or is the W3C validator giving bad results?
You wouldn't get that error in an HTML document.
Presumably, therefore, you are writing XHTML.
My first piece of advice is: Don't. XHTML is usually more trouble than it is worth. Use HTML 5 instead and that won't be invalid.
If you need to use XHTML then see the specification:
Differences from HTML: Script and Style elements
In XHTML, the script and style elements are declared as having #PCDATA
content. As a result, < and & will be treated as the start of markup,
and entities such as < and & will be recognized as entity
references by the XML processor to < and & respectively. Wrapping the
content of the script or style element within a CDATA marked section
avoids the expansion of these entities.
<script type="text/javascript">
<![CDATA[
... unescaped script content ...
]]>
</script>
HTML compatibility guidelines: Embedded Style Sheets and Scripts:
Use external style sheets if your style sheet uses < or & or ]]> or
> --. Use external scripts if your script uses < or & or ]]> or --. Note that XML parsers are permitted to silently remove the contents of
comments. Therefore, the historical practice of "hiding" scripts and
style sheets within "comments" to make the documents backward
compatible is likely to not work as expected in XML-based user agents.

Is it valid to escape html in a href attribute?

Assuming I have the following link:
<a href='http://google.com/bla'>http://google.com/bla</a>
Is this one also valid?
<a href='http://google.com/bla'>http://google.com/bla</a>
It works in Firefox, but I'm not sure if this is standardized behavior. I hope the question isn't super dumb!
Yes, it is perfectly valid to do that. In fact, the ampersand (&) character must be escaped into & in order to be valid HTML, even inside the href attribute (and all attributes for that matter).

Assigning Itemscope Attribute a Value in Rich Snippet

So we are using some rich snippets and they use the html5 markup scheme.
Our problem is the itemscope attribute doesn't have a value.
<div itemscope itemtype="http://schema.org/LocalBusiness">
This cause our old products html validation to fail because it thinks it's an empty tag. Does google and markup rules work the same if you assign it a value of 1 like so.
<div itemscope="1" itemtype="http://schema.org/LocalBusiness">
It's a work around for now until we can properly update our validation methods but that is a farther out project.
So basically is there a proper syntax to make this still valid for Googles Rich Snippet rules, html5 and older validation engines prior to html5?
(This answer is basically copied from Peter Murray, specifically these two comments.)
The HTML5 spec allows for boolean attributes with a value of an empty string or the attribute name:
If the attribute is present, its value must either be the empty string or a value that is an ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.
So either this:
<div itemscope="" itemtype="http://schema.org/LocalBusiness">
or this:
<div itemscope="itemscope" itemtype="http://schema.org/LocalBusiness">
is valid HTML5.
To be sure that Google recognizes itemscope="itemscope" correctly, he (Peter Murray) created an example page and ran it through Google's rich snippet validator. From the results, you can see that Google picked up the data (an Event item) correctly.

Google +1 code validation error: missing attribute

My code: <g:plusone annotation="inline"></g:plusone>
I get this error:
there is no attribute "annotation"
element "g:plusone" undefined
Why?
You have two choices
Change to using this style markup <div class="g-plusone" data-size="tall" ... ></div>
Add the XML NS for google's <g: syntax to the <html tag of the document. Now if only Google would share where that xml namespace is located... (So really, just try option 1)
There is no attribute called "annotation" for any element recommended in the W3C standards. Google probably uses it to parse for some backend processing. If you want the +1 element, you cannot have a W3C standards code.
Check this thread. So what if custom HTML attributes aren't valid XHTML?