When are HTML entities left unparsed? - html

I've run into something that I find a bit strange... There is a website with a <table> of audio file links, author, etc.
The link looks like:
This & That
I found it strange that the & character is not written as & in the source file. If I use the Chrome Dev Tools (or Firefox) and change the content to This & That, the string & is visible, instead of just the & character. Why is that?
The table is using TableSorter, which I thought could be doing this under the hood... but even if I open it with Javascript disabled, the raw & character is still displayed, so that doesn't seem to the be culprit.

When are HTML entities left unparsed?
They aren't. The HTML is parsed as it's being read. This is completely independent of JavaScript.
I found it strange that the & character is not written as & in the source file.
Indeed, it should be, but it works the way it is because browsers handle invalid markup gracefully, and in fact that particular aspect is now encoded in the specification: An ampersand that isn't ambiguous is taken as a literal ampersand.
If I use the Chrome Dev Tools (or Firefox) and change the content to This & That, the string & is visible, instead of just the & character. Why is that?
When you view the DOM through Chrome's devtools and similar, you're seeing the live DOM represented in valid form, not the actual source file, so naturally the browser shows you that & as & (the entity it should have been in the source file).

This is independent to Javascript.
& and & has the same HTML number as : &, which is also the ASCII Dec provision of those symbols. You can take a look at here : http://www.ascii.cl/htmlcodes.htm
So I guess there is a problem with your code, they should be seen as the same symbol (&) in your screen.

Related

AEM/sightly : How to remove unicode in a page source code

when checking at the source code of my page, i can see that some special characters such as " ' " or " & " have been replaced by their unicode value. This cause some problem SEO wise and i would like to make sure that unicode symbols get appropriately rendered. Where do i start from there ?
The page is rendered via AEM using sightly as a templating engine
You could use a different display context for your title, for example <title>${page.title # context="html"}</title>, if that works for your application/site.

Telling the HTML editor the "&" mark is not part of the code

So I have set up rules in my signature application which is using an HTML editor at the backside, now I am trying to write a normale phrase(not coded) with a AND mark(&). But the HTML editor keeps reading this as an HTML and keeps converting it to & a m p ;
Now how do I tell the HTML that the "&" mark is not part of the HTML code?
THanks, if u need more explanation I might be able to post some print screens.
Now how do I tell the HTML that the "&" mark is not part of the HTML code?
You don't. & is a specific symbol with a specific meaning in HTML. It denotes the start of an entity in the code. In particular, it's one of four which according to the spec must always be specified as entities because the symbols carry specific meaning in HTML code.
To represent an & as text in HTML, you have to use the entity definition:
&
For example:
This & That
renders as:
This & That
If you want an HTML editor to conform to a custom HTML spec that you define, you're going to have to build your own HTML editor.

Should I use & in href="" or & is enough in HTML4 and HTML5?

Should I use & in href="" or & is enough in HTML4 and HTML5?
The most of browser not have problem with this but how it should be done?
Call()
Or
Call()
& is correct. The bare ampersand generally works because if a browser sees an & that isn't followed by a valid entity reference, it passes it through, but there's no reason you should count on this behavior in code, especially if you can't guarantee the contents of the thing following the &. In particular, it's forbidden for &foo; to appear where "foo" is an alphanumeric string that isn't valid named character (this is called an "ambiguous ampersand" in HTML5), and it's obviously undesirable for &foo; to appear where "foo" is a valid named character, because it will be parsed as that character.
See the relevant spec requirements: https://html.spec.whatwg.org/multipage/syntax.html#attributes-2

Using a "&" in <a></a>

Currently, I have:
Start Process
However, I ran this code through the W3 HTML validator (https://validator.w3.org), and it comes up with this:
& did not start a character reference. (& probably should have been escaped as &.)
Is there another proper way to put a "&" into an <a></a> tag, or should I just leave it like how it is?
Handling ampersands (&) in URLs is explained in the Web Design Group's Common Validator Problems page:
Ampersands (&'s) in URLs
Another common error occurs when including a URL which contains an ampersand ("&"):
<!-- This is invalid! --> ...
This example generates an error for "unknown entity section" because the "&" is assumed to begin an entity reference. Browsers often recover safely from this kind of error, but real problems do occur in some cases. In this example, many browsers correctly convert &copy=3 to ©=3, which may cause the link to fail. Since 〈 is the HTML entity for the left-pointing angle bracket, some browsers also convert &lang=en to 〈=en. And one old browser even finds the entity §, converting &section=2 to §ion=2.
To avoid problems with both validators and browsers, always use & in place of & when writing URLs in HTML:
...
Note that replacing & with & is only done when writing the URL in HTML, where "&" is a special character (along with "<" and ">"). When writing the same URL in a plain text email message or in the location bar of your browser, you would use "&" and not "&". With HTML, the browser translates "&" to "&" so the Web server would only see "&" and not "&" in the query string of the request.

What other characters beside ampersand (&) should be encoded in HTML href/src attributes?

Is the ampersand the only character that should be encoded in an HTML attribute?
It's well known that this won't pass validation:
Because the ampersand should be &. Here's a direct link to the validation fail.
This guy lists a bunch of characters that should be encoded, but he's wrong. If you encode the first "/" in http:// the href won't work.
In ASP.NET, is there a helper method already built to handle this? Stuff like Server.UrlEncode and HtmlEncode obviously don't work - those are for different purposes.
I can build my own simple extension method (like .ToAttributeView()) which does a simple string replace.
Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:
http://query.com/?q=foo&lt=bar&gt=baz
Even though there aren't trailing semi-colons, since < is the entity for < and > is the entity for >, some old browsers would translate this URL to:
http://query.com/?q=foo<=bar>=baz
So you need to specify & as & to prevent this from occurring for links within an HTML parsed document.
The purpose of escaping characters is so that they won't be processed as arguments. So you actually don't want to encode the entire url, just the values you are passing via the querystring. For example:
http://example.com/?parameter1=<ENCODED VALUE>&parameter2=<ENCODED VALUE>
The url you showed is actually a perfectly valid url that will pass validation. However, the browser will interpret the & symbols as a break between parameters in the querystring. So your querystring:
?q=whatever&lang=en
Will actually be translated by the recipient as two parameters:
q = "whatever"
lang = "en"
For your url to work you just need to ensure that your values are being encoded:
?q=<ENCODED VALUE>&lang=<ENCODED VALUE>
Edit: The common problems page from the W3C you linked to is talking about edge cases when urls are rendered in html and the & is followed by text that could be interpreted as an entity reference (&copy for example). Here is a test in jsfiddle showing the url:
http://jsfiddle.net/YjPHA/1/
In Chrome and FireFox the links works correctly, but IE renders &copy as ©, breaking the link. I have to admit I've never had a problem with this in the wild (it would only affect those entity references which don't require a semicolon, which is a pretty small subset).
To ensure you're safe from this bug you can HTML encode any of your URLS you render to the page and you should be fine. If you're using ASP.NET the HttpUtility.HtmlEncode method should work just fine.
You do not need HTML escapement here:
According to the HTML5 spec:
http://www.w3.org/TR/html5/tokenization.html#character-reference-in-attribute-value-state
&lang= should be parsed as non-recognized character reference and value of the attribute should be used as it is: http://domain.com/search?q=whatever&lang=en
For the reference: added question to HTML5 WG: http://lists.w3.org/Archives/Public/public-html/2011Sep/0163.html
In HTML attribute values, if you want ", '&' and a non-breaking space as a result, you should (as an author who is clear about intent) have ", & and in the markup.
For " though, you don't have to use " if you use single quotes to encase your attribute values.
For HTML text nodes, in addition to the above, if you want < and > as a result, you should use < and >. (I'd even use these in attribute values too.)
For hfnames and hfvalues (and directory names in the path) for URIs, I'd used Javascript's encodeURIComponent() (on a utf-8 page when encoding for use on a utf-8 page).
If I understand the question correctly, I believe this is what you want.