HTML character entities and character encoding set - html

When including HTML entities in an HTML document, do the entities need to be from the same character encoding set that the document is specified to be using?
For example, if I am going to use the copyright sign in an HTML document that is specified as UTF-8, is it necessary to use the Unicode HTML entity (©) or is it okay to use other entities, such as the ASCII HTML entity (©)?
Please explain your answer. I am aware that it will "work", but is there a case where it will not work?
Thanks!

© and © specify the same character - 169 is equivalent to hexadecimal A9. These both specify a copyright symbol. Character entities in HTML always refer to Unicode code points, this is covered in the HTML 4 Standard. Thus, even if your character set changes, your entities still refer to the same characters.
This also means that you can encode characters that don't actually appear within your character set of choice. I just created a document in the ISO-8859-1 character set, but it includes a Greek lambda. Also, ASCII is not able to directly encode a copyright symbol, but it can through character entities.
Edit: Reading the comments on the other answer, I want to clarify this a bit. If you are using UTF-8 as the character encoding for your document, you can, within the raw HTML source, write a copyright symbol just as-is. (You need to find some way to input it, of course: copy-paste being the usual.) UTF-8 will allow you to directly encode any symbol you want. ISO-8859-1 is much more limited, and ASCII even more so. For example, within my HTML, if my document is a UTF-8 document, I can do:
<p>Hi there. This document is ©2010. Good day!</p>
or:
<p>Hi there. This document is ©2010. Good day!</p>
or:
<p>Hi there. This document is ©2010. Good day!</p>
The first is only valid if the character set supports "©". The other two are always valid, but less readable. Whatever text editor you're using, if it is worth its weight, should be able to tell you what character set it is encoding the document in.
If you do this, you need to make sure your web server informs the client of the correct character set, or that your document declares it with something like:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I've used UTF-8 there as an example. XHTML should have the character set in the opening <?xml ... ?> tag.

The beauty of the UTF-8 encoding is that you can actually just include the binary character. You don't need to encode it as an entity at all. Thusly: ©
Oh, you just want to know the difference between the two entities? There is none. One describes the byte in Hex and the other in decimal.

Related

Can I use HTML-entities without a fallback?

I am wondering, if I can use html-entities like
<h5><em>⇆</em> Headline</h5>
without any fallback if I use utf-8? (because on my systems this works totally fine). Are all these chars from http://dev.w3.org/html5/html-author/charref really all embedded into the utf-8-charset by default?
And how would I use it correctly, like this:
<h5><em>⇆</em> Headline</h5>
that
<h5><em>&lrarr; </em> Headline</h5>
or
<h5><em>⇆</em> Headline</h5>
There are two separate issues here:
get the browser to understand which character you want
render that character visually
For the first point, there are two options:
Embed the character directly as is, for which you will need to serve the HTML in an encoding that can encode that character. Yes, "⇆" is a Unicode character and can be encoded by any Unicode encoding. UTF-8 is the best choice here. The browser then simply needs to understand that the document is encoded in UTF-8 and it will be able to read and understand the character correctly. Set the appropriate HTTP header to denote the encoding.
Embed the character as an HTML entity. HTML entities is a way to embed any arbitrary character using only ASCII characters, e.g. &lrarr;. To encode this, your encoding of choice only needs to be able to encode &, l, r, a and ;, which are very standard characters in any encoding. This special sequence of characters is understood by the browser to mean the character "⇆". By embedding characters as HTML entities you can largely ignore the intricacies of managing encodings correctly, but it makes your source code rather unreadable. You should not do this in this day and age.
Whether you use named entities (&lrarr;) or refer to the character by its Unicode code (⇆) doesn't really matter, they both result in the same thing.
Having handled this, the character needs to be actually rendered as a glyph on screen. For this, an appropriate font is necessary. You'll have to test whether most of your target audience uses a system which has a font installed by default which contains this character. You can also provide your own font to the browser which contains this character as a web font.

Why can some HTML documents display special chars written plainly (e.g. as ä) without the need for codes (e.g. ä)

I'm making a little website with german and french content. Some of the documents display text correctly, even though all umlauts are written as äöü and not with codes. Other docs need the codes but I can't find the difference between the documents.
When trying to google for an answer, I can only find tons of code references but no explanation why some docs don't need them.
Any HTML document (or any text document for that matter) is encoded to a certain encoding - this is a mapping between the characters and the values representing them. Different encodings mean different characters.
Many pages use UTF-8 a Unicode encoding and they state so either in the HTTP header or in a Meta tag (Content-Type) on the page itself - such pages can use most characters directly.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
1) charset-declaration in the html-code (meta)
2) the encoding of your documents.
For example... if you're working with UTF-8 and there is ONE document (for example a js-file) in ISO 8859-1 then some browsers will show you the site in ISO 8859-1 wich destroys your äöüß, ...
Because, per the HTML specification:
Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice
Some documents use an encoding (such as iso‑8859‑1, or Windows‑1252, or utf‑8) that can represent the character ä directly; others use an encoding (such as us‑ascii) that cannot, and therefore need to use the character entity reference ä.

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?
A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.
Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:
Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.

HTML and character encoding vs HTML Entity

When writing an HTML document, is it acceptable to use the direct special character such as the captial letter C with a cedilla underneath as regular text: Ç or to use the HTML Entity name of this charecter, &Ccedil ?
I have seen both being used in practice, but surely there are rules governing the appropriate usage of this, as well as advantages to one way over another. For instance, this website maintains the raw-form of this character, but other websites may end up rendering it as a square block.
Real characters:
Are easier to type if your system is set up for a language that uses those characters
Produce more readable code
Save bytes
HTML entities:
Let you more or less forget about character encoding
Obviously, characters with special meaning in HTML (<, &, etc) still need to be represented by entities.
If you're using UTF-8 character encoding, then most entity characters (other than &, > and <) become redundant.
If you're not using UTF-8, then you need entities for everything.
It all depends on the character encoding of the document. If you're unsure of whether or not you should use the the regular text or the encoding version, you could run your page through the W3C Validator.
Consider this code:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Stuff</title>
</head>
<body>
<p>©</p>
<p>©</p>
</body>
</html>
The document encoding is set to UTF-8 and when it's validated, it returns an error:
Sorry, I am unable to validate this document because on line 7 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

What to do with umlauts (äöü) in metatags?

Declaring them as &xuml; etc. didn't work, just writing them as they are leads to display errors.
What to do?
If your page is encoded as UTF-8, you should be able to use special characters directly (i.e. without converting them into their HTML entity counterparts) without problems. Note that if you declare the encoding in a content-type meta tag, you should put that tag to the very beginning of the head section.
Use an encoding which can encode the characters. I'd recommend UTF-8, which is generally the preferred solution for western languages.
Keep in mind that HTTP headers have precedence over <meta http-equiv=...>, but you should set both to ensure using the correct encoding when loading the document from non-HTTP sources (eg when saving the file locally).
You should never have to use HTML entities for those characters, since they have no special meaning in HTML. Just make sure the character encoding of the text you're outputting matches your charset header.