HTML5: which is better - using a character entity vs using a character directly?

HTML5: which is better - using a character entity vs using a character directly? - html

I've recently noticed a lot of high profile sites using characters directly in their source, eg:
<q>“Hi there”</q>
Rather than:
<q>“Hi there”</q>
Which of these is preferred? I've always used entities in the past, but using the character directly seems more readable, and would seem to be OK in a Unicode document.

If the encoding is UTF-8, the normal characters will work fine, and there is no reason not to use them. Browsers that don't support UTF-8 will have lots of other issues while displaying a modern webpage, so don't worry about that.
So it is easier and more readable to use the characters and I would prefer to do so.
It also saves a couple of bytes which is good, although there is much more to gain by using compression and minification.

The main advantage I can see with encoding characters is that they'll look right, even if the page is interpreted as ASCII.
For example, if your page is just a raw HTML file, the default settings on some servers would be to serve it as text/html; charset=ISO-8859-1 (the default in HTTP 1.1). Even if you set the meta tag for content-type, the HTTP header has higher priority.
Whether this matters depends on how likely the page is to be served by a misconfigured server.

It is better to use characters directly. They make for: easier to read code.
Google's HTML style guide advocates for the same. The guide itself can be found here:
Google HTML/CSS Style guide.

Using characters directly. They are easier to read in the source (which is important as people do have to edit them!) and require less bandwidth.

The example given is definitely wrong, in theory as well as in practice, in HTML5 and in HTML 4. For example, the HTML5 discussions of q markup says:
“Quotation punctuation (such as quotation marks) that is quoting the contents of the element must not appear immediately before, after, or inside q elements; they will be inserted into the rendering by the user agent.”
That is, use either ´q’ markup or punctuation marks, not both. The latter is better on all practical accounts.
Regarding the issue of characters vs. entity references, the former are preferable for readability, but then you need to know how to save the data as UTF-8 and declare the encoding properly. It’s not rocket science, and usually better. But if your authoring environment is UTF-8 hostile, you need not be ashamed of using entity references.

Related

Html codes for ▐?

Is there a way to put a ▐ (ascii value of 222) in html character codes? (e.g. Þ)?
Is this possible? If not, is there some way to make sure it is reliably rendered by a browser?

Using the official W3 Character Reference sheet, we can find what you're looking for. You have several options:
█
&block;
█
█
▮
&marker;
▮
▮
❘
&VerticalSeparator;
❘
❘
You can then take this a step further by looking into various Unicode regions. This region has a few similar lines:
▌
▌
▍
▍
▎
▎
▏
▏
▐
▐
Note that you'll have to perform browser tests yourself, as not all browsers will be able to render these symbols. Ensuring that your page is in an appropriate UTF format (i.e. UTF-8) will help greatly.

I don't know from where you got "ascii value of 222". ASCII only goes to 127.
The character appears (to me) to be a U+2590 RIGHT HALF BLOCK. In HTML you can use ▐.

is there some way to make sure it is reliably rendered by a browser?
Yes: encode your document as UTF-8 (really – it’s the default on the web and the best choice nowadays) and include the character directly in the document.
Every modern text editor / IDE supports saving documents as UTF-8. To serve it to the browser, specify the encoding in the <head> section:
<meta charset="utf-8">
(This is HTML5; older versions are slightly different) and specify it in the HTTP header when serving the document from a server. Most servers are already configured to do this correctly.
HTML escape sequences, while still useful in certain scenarios, are by no means the easiest way of using arbitrary characters in HTML code.
Also, as others have noted, there’s some confusion here: ASCII only goes up to 127, there’s no ASCII character 222. Furthermore, ASCII is severely outdated and used almost nowhere nowadays. Most of the time, when somebody says “ASCII” they mean something else, and unfortunately they always mean different things. This is another reason to use Unicode and UTF-8 throughout: it avoids confusion.

I believe you have answered your own question. The other HTML entity for that character is Þ
Depending on your doctype, any compliant browser should render the character correctly.

Is it bad to set a webpage's Content Type?

I'm brazilian and work / live here. For those who don't know, Portuguese has lots of words with accented letters. And I also know that, if you don't work with the charset correctly, all your accented letters become garbage upon rendering to the browser.
So, in my daily work, I always see in my boss' code (which isn't the most carefully-well-writen kind) the ampersand pattern (if this has a name, please let me know). So, for example, these are all over the place:
Formulário
Relatório
Exclusão
I, knowing you can set the charset of a web page both on the server and the client, have been doing this in my web pages (BTW we've been working with ASP.NET WebForms)...
<%# Page (...) ContentType="text/html; charset=utf-8" %>
and then, in the <head>:
<meta charset="utf-8" />
But my boss saw this and said this was a bad practice. I googled a bit and found no resources saying it is, in fact, a bad practice. And there have been some times when my boss said a good practice was a bad one. He then told me to replace all my accented letters with their ampersand counterparts. If it really is a better practice, I'll do it.
So, TL;DR: Is it better to set the web page's Content Type or to use the "ampersand pattern"?

The W3C has an article on character encoding that is quite useful. Their take on this is pretty much:
You should always specify the encoding used for an HTML or XML page.
If you don't, you risk that characters in your content are incorrectly
interpreted. This is not just an issue of human readability,
increasingly machines need to understand your data too.
Further, according to the MDN article on the <meta> element, it is good practice to specify the charset, since it protect your users against certain cross-scripting attacks:
It is good practice, and strongly recommended, to define the character
set using this attribute. If no character set is defined for a page,
several cross-scripting techniques may become practical to harm the
page user, like the UTF-7 fallback cross-scripting technique. Always
setting this meta will protect against these risks.
Even though there might be rare cases where it isn't possible to specify the content type, the general opinion appear to be that it is good practice to specify the content type. And if you do so properly, then there isn't much need for using the HTML-representation of special characters (which in my opinion make your code much harder to read).

"better" is somewhat subjective. There are pros and cons to both approaches.
Using a sensible character encoding:
Gives you more readable code
Gives you smaller code
Using character references:
Means you don't have to care about the encoding
Allows the page to be copied somewhere with wrong HTTP headers and still work

Should the percent symbol (%) always be HTML-escaped?

I know the percent symbol has to be URL-encoded when being passed around, but when I display it in the browser, is it also necessary to escape it like so: %?

In URLs, the percent sign (%) has a special meaning, so it should be escaped. In HTML, it does not, so it is not necessary to escape it.

I agree with the chosen answer, but would like to qualify the statement “it is not necessary to escape it.”
If you have a need (or desire) to escape a percentage sign in HTML code, (and there are good reasons to do this with any potentially ambiguous character or symbol) then I would highly recommend using the percentage entity code &percnt; as opposed to any numeric code. (those I use when there is no entity name you could use)
That was the answer I was looking for when I found this page, because I forgot it looses the final "e".
We should probably all be using at least the entities kindly listed here. (whoever Webmasterish is; thank you)
Reasoning: Numeric codes (and particularly byte codes from unencoded characters) change with code–pages, on systems using different default languages, and / or different operating systems. (Windows and Mac using slightly different code sets for “English” being the classic, which still plagues plain–text eMail sent between Apple Mail and Outlook) This is slowing down, and should stop with UTF, but I'm still seeing it pop up.
If you're converting HTML to some other mark–up, (note, I used "–" not a "-", or even "−" for the same reason) such as LaTeX, DVI, PostScript or even MarkDown, then it's useful to completely squash any ambiguity… And those processes tend to happen on the information you least expect to be used in such a way when you initially write it. So just get used to doing it everywhere and be grateful to your former self for having had the foresight to do so. Probably years down the line, when you're looking to update formulae to be more readable by utilising MathJax or such, and keep picking up hyphenated words. <swearmarks>

I'd like to add this - if you use javascript in href, you are in troubles too. Check this example:
http://jsfiddle.net/cs4MZ/
One of the workarounds might be using onclick instead of href.

If you're talking about in HTML text, visible to the reader, no. It can't do anything harmful, there.
...if you're talking about inside of HTML attributes, then yes, that would be good to consider.
URLs and HTML are different languages, as weird as that might seem, so they have different weaknesses.

Content type vs HTML encoding

I'm bulding a site and I've set its content type to use charset UTF-8. I'm also using HTML encoding for the special characters, ie: instead of having á I've got á.
Now I wonder (still bulding the site) if it was really necesary to do both things. Looking for the answer I found this:
http://www.w3.org/International/questions/qa-escapes.en.php
It says that I shoud not use HTML encoding for any special characters but for >, < and &. But the reason is that escapes
can make it difficult to read and maintain source code, and can also significantly increase file size.
I think that's true but very poor argument. Is it really THE SAME thing using the escapes and the special characters?

The article is in fact correct. If you have proper UTF-8 encoded data, there is no reason to use HTML entities for special characters on normal web pages any more.
I say "on normal web pages", because there are highly exotic borderline scenarios where using entities is still the safest bet (e.g. when serving JavaScript code to an external page with unknown encoding). But for serving pages to a browser, this doesn't apply.

When should one use HTML entities?

This has been confusing me for some time. With the advent of UTF-8 as the de-facto standard in web development I'm not sure in which situations I'm supposed to use the HTML entities and for which ones should I just use the UTF-8 character. For example,
em dash (–, &emdash;)
ampersand (&, &)
3/4 fraction (¾, ¾)
Please do shed light on this issue. It will be appreciated.

Based on the comments I have received, I looked into this a little further. It seems that currently the best practice is to forgo using HTML entities and use the actual UTF-8 character instead. The reasons listed are as follows:
UTF-8 encodings are easier to read and edit for those who understand what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings for those who don't understand them, but they have the advantage of rendering as special characters rather than hard to understand decimal or hex encodings.
As long as your page's encoding is properly set to UTF-8, you should use the actual character instead of an HTML entity. I read several documents about this topic, but the most helpful were:
UTF-8: The Secret of Character Encoding
Wikipedia Special Characters Help
From the UTF-8: The Secret of Character Encoding article:
Wikipedia is a great case study for an
application that originally used
ISO-8859-1 but switched to UTF-8 when
it became far too cumbersome to support
foreign languages. Bots will now
actually go through articles and
convert character entities to their
corresponding real characters for the
sake of user-friendliness and
searchability.
That article also gives a nice example involving Chinese encoding. Here is the abbreviated example for the sake of laziness:
UTF-8:
這兩個字是甚麼意思
HTML Entities:
這兩個字是甚麼意思
The UTF-8 and HTML entity encodings are both meaningless to me, but at least the UTF-8 encoding is recognizable as a foreign language, and it will render properly in an edit box. The article goes on to say the following about the HTML entity-encoded version:
Extremely inconvenient for those of us
who actually know what character
entities are, totally unintelligible
to poor users who don't! Even the
slightly more user-friendly,
"intelligible" character entities like
θ will leave users who are
uninterested in learning HTML
scratching their heads. On the other
hand, if they see θ in an edit box,
they'll know that it's a special
character, and treat it accordingly,
even if they don't know how to write
that character themselves.
As others have noted, you still have to use HTML entities for reserved XML characters (ampersand, less-than, greater-than).

You don't generally need to use HTML character entities if your editor supports Unicode. Entities can be useful when:
Your keyboard does not support the character you need to type. For example, many keyboards do not have em-dash or the copyright symbol.
Your editor does not support Unicode (very common some years ago, but probably not today).
You want to make it explicit in the source what is happening. For example, the code is clearer than the corresponding white space character.
You need to escape HTML special characters like <, &, or ".

Entities may buy you some compatibility with brain-dead clients that don't understand encodings correctly. I don't believe that includes any current browsers, but you never know what other kinds of programs might be hitting you up.
More useful, though, is that HTML entities protect you from your own errors: if you misconfigure something on the server and you end up serving a page with an HTTP header that says it's ISO-8859-1 and a META tag that says it's UTF-8, at least your —es will always work.

I would not use UTF-8 for characters that are easily confused visually. For example, it is difficult to distinguish an emdash from a minus, or especially a non-breaking space from a space. For these characters, definitely use entities.
For characters that are easily understood visually (such as the chinese examples above), go ahead and use UTF-8 if you like.

Personally I do everything in utf-8 since a long time, however, in an html page, you always need to convert ampersands (&), greater than (>) and lesser then (<) characters to their equivalent entities, &, > and <
Also, if you intend on doing some programming using utf-8 text, there are a few thing to watch for.
XML needs some extra lines to validate when using entities.
Some libraries do not play along nice with utf-8. For instance, PHP in some Linux distributions dropped full support for utf-8 in their regular expression libraries.
It is harder to limit the number of characters in a text that uses html entities, because a single entity uses many characters. Also there's always the risk of cutting the entity in half.

HTML entities are useful when you want to generate content that is going to be included (dynamically) into pages with (several) different encodings. For example, we have white label content that is included both into ISO-8859-1 and UTF-8 encoded web pages...
If character set conversion from/to UTF-8 wasn't such a big unreliable mess (you always stumble over some characters and some tools that don't convert properly), standardizing on UTF-8 would be the way to go.

If your pages are correctly encoded in utf-8 you should have no need for html entities, just use the characters you want directly.

All of the previous answers make sense to me.
In addition: It mostly depends on the editor you intent to use and the document language. As a minimum requirement for the editor is that it supports the document language. That means, that if your text is in japanese, beware of using an editor which does not show them (i.e. no entities for the document itself). If its english, you can even use an old vim-like editor and use entities only for the relative seldom © and friends.
Of course: > for > and other HTML-specials still need escapes.
But even with the other latin-1 languages (german, french etc.) writing ä is a pain in you know where...
In addition, I personally write entities for invisible characters and those which are looking similar to standard-ascii and are therefore easily confused. For example, there is u1173 (looking like a dash in some charsets) or u1175, which looks like the vertical bar. I'd use entities for those in any case.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008