Is there a need to use HTML entities when using Unicode? - html

I am building a website for a German client, so the text on the website will regularly contain characters like:
ä
ö
ü
ß
Is it necessary for to convert all those characters to their HTML Entities while the website uses UTF-8 character encoding everywhere?
Or maybe there's no relation between the two areas?
When (if at all) should I convert those to their HTML Entities, then?

You should convert to HTML entity or character references when:
a. you are stuck with some editor or processing component that doesn't support Unicode properly;
b. you have manually-edited markup with confusable characters. For example, if you have a non-breaking-space that is important to lay out correctly, you might want to write it as or   so that it's obvious and doesn't get replaced with a normal space when someone edits the file.
Other than that, no, just go with the raw versions.

Related

Can I use HTML-entities without a fallback?

I am wondering, if I can use html-entities like
<h5><em>⇆</em> Headline</h5>
without any fallback if I use utf-8? (because on my systems this works totally fine). Are all these chars from http://dev.w3.org/html5/html-author/charref really all embedded into the utf-8-charset by default?
And how would I use it correctly, like this:
<h5><em>⇆</em> Headline</h5>
that
<h5><em>&lrarr; </em> Headline</h5>
or
<h5><em>⇆</em> Headline</h5>
There are two separate issues here:
get the browser to understand which character you want
render that character visually
For the first point, there are two options:
Embed the character directly as is, for which you will need to serve the HTML in an encoding that can encode that character. Yes, "⇆" is a Unicode character and can be encoded by any Unicode encoding. UTF-8 is the best choice here. The browser then simply needs to understand that the document is encoded in UTF-8 and it will be able to read and understand the character correctly. Set the appropriate HTTP header to denote the encoding.
Embed the character as an HTML entity. HTML entities is a way to embed any arbitrary character using only ASCII characters, e.g. &lrarr;. To encode this, your encoding of choice only needs to be able to encode &, l, r, a and ;, which are very standard characters in any encoding. This special sequence of characters is understood by the browser to mean the character "⇆". By embedding characters as HTML entities you can largely ignore the intricacies of managing encodings correctly, but it makes your source code rather unreadable. You should not do this in this day and age.
Whether you use named entities (&lrarr;) or refer to the character by its Unicode code (⇆) doesn't really matter, they both result in the same thing.
Having handled this, the character needs to be actually rendered as a glyph on screen. For this, an appropriate font is necessary. You'll have to test whether most of your target audience uses a system which has a font installed by default which contains this character. You can also provide your own font to the browser which contains this character as a web font.

HTML special characters for languages

Is there any way to type this word "हिन्दी中文(简体)" in html?
I see that there's codes for special characters in html for example for "العربية" العبية
But I can't find these codes for this "हिन्दी中文(简体)"
There are tools out there that can do this conversion from raw Unicode symbols to encoded HTML entities.
हिन्दी中文(简体)
Yes, you can write “हिन्दी中文(简体)” as “हिन्दी中文(简体)” in HTML. Naturally, you need a character encoding that lets you do that, primarily UTF-8, but that’s a good idea anyway.
You can write any character using a character reference like ह (for U+0939 DEVANAGARI LETTER HA, “ह”), but this increases the data size and makes the HTML code look very obscure.

HTML encoding of Japanese text

I'm making a static HTML page that displays courtesy text in multiple languages. I noticed that if I paste ウェブサイトのメンテナンスの下で into Expression Blend, that text appears the same in the code. I think it's bad for compatibility and should be replaced by proper HTML entities.
I have tried http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx but it returns me the same Japanese text.
Is it correct, from the point of view of browser compatibility, to paste that Japanese right into the source code of an HTML page?
Else, what is the correct HTML encoding of that text? Or, better, is there any tool that I can use to convert non-ASCII characters to HTML entities, possibly online and possibly free?
I think it's bad for compatibility and should be replaced by proper
HTML entities.
Quite the opposite actually, your preference should be to not use html entities but rather correctly declare document encoding as UTF-8 and use the actual characters. There are quite a few compelling reasons to do so, but the real question is why not use it since it's a well- and widely supported standard?
Some of those points have been summarised previously:
UTF-8 encodings are easier to read and edit for those who understand
what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings
for those who don't understand them, but they have the advantage of
rendering as special characters rather than hard to understand decimal
or hex encodings.
[For example] Wikipedia... actually go through articles and convert
character entities to their corresponding real characters for the sake
of user-friendliness and searchability.
As long as you mark your web-page as UTF-8, either in the http headers or the meta tags, having foreign characters in your web-pages should be a non-issue. Alternately you could encode/decode these strings using encodeURI/decodeURI functions in JavaScript
encodeURI('ウェブサイトのメンテナンスの下で')
//returns"%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7"
decodeURI("%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7")
//returns ウェブサイトのメンテナンスの下で
If you are looking for a tool to convert a bunch of static strings to unicode characters, you could simply use encodeURI/decodeURI functions from a web-page developer console (firebug for mozilla/firefox). Hope this helps!
HTML entities are only useful if you need to represent a character that cannot be represented in the encoding your document is saved in. For example, ASCII has no specification for how to represent "€". If you want to use that character in an ASCII encoded HTML document, you have to encode it as € or not use it at all.
If you are using a character encoding for your document that can represent all the characters you need though, like UTF-8, there's no need for HTML entities. You simply need to make sure the browser knows what encoding the document is in so it can interpret it correctly. This is really the preferable method, since it simply keeps the source code readable. It really makes no sense to want to work with HTML entities if you can simply work with the actual characters.
See http://kunststube.net/frontback for some more information.

Why can some HTML documents display special chars written plainly (e.g. as ä) without the need for codes (e.g. ä)

I'm making a little website with german and french content. Some of the documents display text correctly, even though all umlauts are written as äöü and not with codes. Other docs need the codes but I can't find the difference between the documents.
When trying to google for an answer, I can only find tons of code references but no explanation why some docs don't need them.
Any HTML document (or any text document for that matter) is encoded to a certain encoding - this is a mapping between the characters and the values representing them. Different encodings mean different characters.
Many pages use UTF-8 a Unicode encoding and they state so either in the HTTP header or in a Meta tag (Content-Type) on the page itself - such pages can use most characters directly.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
1) charset-declaration in the html-code (meta)
2) the encoding of your documents.
For example... if you're working with UTF-8 and there is ONE document (for example a js-file) in ISO 8859-1 then some browsers will show you the site in ISO 8859-1 wich destroys your äöüß, ...
Because, per the HTML specification:
Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice
Some documents use an encoding (such as iso‑8859‑1, or Windows‑1252, or utf‑8) that can represent the character ä directly; others use an encoding (such as us‑ascii) that cannot, and therefore need to use the character entity reference ä.

When should one use HTML entities?

This has been confusing me for some time. With the advent of UTF-8 as the de-facto standard in web development I'm not sure in which situations I'm supposed to use the HTML entities and for which ones should I just use the UTF-8 character. For example,
em dash (–, &emdash;)
ampersand (&, &)
3/4 fraction (¾, ¾)
Please do shed light on this issue. It will be appreciated.
Based on the comments I have received, I looked into this a little further. It seems that currently the best practice is to forgo using HTML entities and use the actual UTF-8 character instead. The reasons listed are as follows:
UTF-8 encodings are easier to read and edit for those who understand what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings for those who don't understand them, but they have the advantage of rendering as special characters rather than hard to understand decimal or hex encodings.
As long as your page's encoding is properly set to UTF-8, you should use the actual character instead of an HTML entity. I read several documents about this topic, but the most helpful were:
UTF-8: The Secret of Character Encoding
Wikipedia Special Characters Help
From the UTF-8: The Secret of Character Encoding article:
Wikipedia is a great case study for an
application that originally used
ISO-8859-1 but switched to UTF-8 when
it became far too cumbersome to support
foreign languages. Bots will now
actually go through articles and
convert character entities to their
corresponding real characters for the
sake of user-friendliness and
searchability.
That article also gives a nice example involving Chinese encoding. Here is the abbreviated example for the sake of laziness:
UTF-8:
這兩個字是甚麼意思
HTML Entities:
這兩個字是甚麼意思
The UTF-8 and HTML entity encodings are both meaningless to me, but at least the UTF-8 encoding is recognizable as a foreign language, and it will render properly in an edit box. The article goes on to say the following about the HTML entity-encoded version:
Extremely inconvenient for those of us
who actually know what character
entities are, totally unintelligible
to poor users who don't! Even the
slightly more user-friendly,
"intelligible" character entities like
θ will leave users who are
uninterested in learning HTML
scratching their heads. On the other
hand, if they see θ in an edit box,
they'll know that it's a special
character, and treat it accordingly,
even if they don't know how to write
that character themselves.
As others have noted, you still have to use HTML entities for reserved XML characters (ampersand, less-than, greater-than).
You don't generally need to use HTML character entities if your editor supports Unicode. Entities can be useful when:
Your keyboard does not support the character you need to type. For example, many keyboards do not have em-dash or the copyright symbol.
Your editor does not support Unicode (very common some years ago, but probably not today).
You want to make it explicit in the source what is happening. For example, the code is clearer than the corresponding white space character.
You need to escape HTML special characters like <, &, or ".
Entities may buy you some compatibility with brain-dead clients that don't understand encodings correctly. I don't believe that includes any current browsers, but you never know what other kinds of programs might be hitting you up.
More useful, though, is that HTML entities protect you from your own errors: if you misconfigure something on the server and you end up serving a page with an HTTP header that says it's ISO-8859-1 and a META tag that says it's UTF-8, at least your —es will always work.
I would not use UTF-8 for characters that are easily confused visually. For example, it is difficult to distinguish an emdash from a minus, or especially a non-breaking space from a space. For these characters, definitely use entities.
For characters that are easily understood visually (such as the chinese examples above), go ahead and use UTF-8 if you like.
Personally I do everything in utf-8 since a long time, however, in an html page, you always need to convert ampersands (&), greater than (>) and lesser then (<) characters to their equivalent entities, &, > and <
Also, if you intend on doing some programming using utf-8 text, there are a few thing to watch for.
XML needs some extra lines to validate when using entities.
Some libraries do not play along nice with utf-8. For instance, PHP in some Linux distributions dropped full support for utf-8 in their regular expression libraries.
It is harder to limit the number of characters in a text that uses html entities, because a single entity uses many characters. Also there's always the risk of cutting the entity in half.
HTML entities are useful when you want to generate content that is going to be included (dynamically) into pages with (several) different encodings. For example, we have white label content that is included both into ISO-8859-1 and UTF-8 encoded web pages...
If character set conversion from/to UTF-8 wasn't such a big unreliable mess (you always stumble over some characters and some tools that don't convert properly), standardizing on UTF-8 would be the way to go.
If your pages are correctly encoded in utf-8 you should have no need for html entities, just use the characters you want directly.
All of the previous answers make sense to me.
In addition: It mostly depends on the editor you intent to use and the document language. As a minimum requirement for the editor is that it supports the document language. That means, that if your text is in japanese, beware of using an editor which does not show them (i.e. no entities for the document itself). If its english, you can even use an old vim-like editor and use entities only for the relative seldom © and friends.
Of course: > for > and other HTML-specials still need escapes.
But even with the other latin-1 languages (german, french etc.) writing ä is a pain in you know where...
In addition, I personally write entities for invisible characters and those which are looking similar to standard-ascii and are therefore easily confused. For example, there is u1173 (looking like a dash in some charsets) or u1175, which looks like the vertical bar. I'd use entities for those in any case.