HTML and character encoding vs HTML Entity

HTML and character encoding vs HTML Entity - html

When writing an HTML document, is it acceptable to use the direct special character such as the captial letter C with a cedilla underneath as regular text: Ç or to use the HTML Entity name of this charecter, &Ccedil ?
I have seen both being used in practice, but surely there are rules governing the appropriate usage of this, as well as advantages to one way over another. For instance, this website maintains the raw-form of this character, but other websites may end up rendering it as a square block.

Real characters:
Are easier to type if your system is set up for a language that uses those characters
Produce more readable code
Save bytes
HTML entities:
Let you more or less forget about character encoding
Obviously, characters with special meaning in HTML (<, &, etc) still need to be represented by entities.

If you're using UTF-8 character encoding, then most entity characters (other than &, > and <) become redundant.
If you're not using UTF-8, then you need entities for everything.

It all depends on the character encoding of the document. If you're unsure of whether or not you should use the the regular text or the encoding version, you could run your page through the W3C Validator.
Consider this code:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Stuff</title>
</head>
<body>
<p>©</p>
<p>©</p>
</body>
</html>
The document encoding is set to UTF-8 and when it's validated, it returns an error:
Sorry, I am unable to validate this document because on line 7 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

Related

Single quotes showing as diamond shaped question mark in browsers (no database or PHP)

I am working with a web page in which I switched the character set from iso-8859-1 to utf-8. The top of the page reads like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>[title of site]</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I am only using ASCII characters in the page, and since utf-8 encoding supersets ASCII, this should be fine. However, single quotes in the text are showing up as question marks surrounded by black diamonds. I have verified these are are ASCII single quotes (not straight quotes).
I've read much online that describes solutions to the problem that involve PHP, magic quotes, database configuration, etc. However, this is a flat HTML page that isn't being rendered by any programs.
Also, many who have this problem are told to switch to UTF-8 to fix the problem. This is exactly how I introduced the problem.
Please look at http://mch.blackcatwebinc.com/src/events.html to see this problem.

The only quotes in ASCII are the single quote ' (0x27 or 39) and the double quote " (0x22 or 33). What you have there is an 8-bit encoding that places quotes at 145 (0x91) and 146 (0x92) called CP1252; it's the standard 8-bit Western European encoding for Windows. If what you want is UTF-8, you need to convert that to UTF-8, since it's not valid UTF-8; valid UTF-8 uses multiple bytes for characters above 127 (0x7F), and places the opening and closing quotes at U+2018 and U+2019 respectively.

According to the W3C, the meta charset
should appear as close as possible to the top of the head element
From http://www.w3.org/International/questions/qa-html-encoding-declarations#metacontenttype
So, I might try to place the meta tag above the title.
Also, as mentioned in the first answer by #user1505373, UTF is always capitalized and there is no space after the = in any of the examples I saw.

Your source code is not saved in UTF-8 but Latin1 CP1252, and those quotes are not simple quotes but U+2019 RIGHT SINGLE QUOTATION MARKS (encoded in Latin1). Save the source file in UTF-8 and it'll work.

The simplest fix is to change UTF-8 to windows-1252 in the meta tag. This works, because the server announces no encoding in the Content-Type header, so browsers and other clients will use the one specified in a meta tag.
The name windows-1252 is the preferred MIME name for the 8-bit Windows Latin-1 encoding, also known as cp1252 and some other names (often misrepresented as “ANSI”).
As #deceze explains, the actual encoding of the data is windows-1252, not UTF-8. You can alternatively change the actual encoding to UTF-8 by saving the file with a suitable command in your authoring software. But what really matters is that the declared encoding matches the real one.
Yet another possibility is to use “escapes” for the apostrophe, such as ’. They work independently of encoding, but they make the source code less legible.

The only difference I see between your tag and the one on the site I'm working on is the space after the semicolon and that utf is lowercase on yours. Try capitalizing UTF.

All ASCII printable characters have their equivalent HTML Entity Code. Some of these characters are generally supported by most common OS typefaces, some are categorized as Symbols that bring us to your rendering issue.
What you supposedly have there is a closing single quote, and in order to get it rightly printed you should use it's entity code, or  respectively.
If it turns to be an opening single quote, then you should use  instead.
Note, there's no HTML Entity Name for the two ASCII characters (and some more) so you're required to opt the entity code variant.

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?

A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.

Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:

Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.

how utf-8 identifies the different language character

I am really amazed to see the magic of utf-8 but couldn't understand the logic behind it. I went through several documents but still confused though i know the basic only.
please take a look first example. it converts from language character to utf-8. there are two text box, in first text box enter the chars, click the button and get the utf-8 values in second text box as utf-8.
please take a look of the second example . i have used the utf-8 char from the example 1 and put the value in html and here i really do not understand how it translates. as i tested three language chinese, Hindi and Russian.
used google translator to translate from english to several language
Hello = 您好(chinese)
Hello = नमस्ते (Hindi)
Hello = привет (Russian)
how does a web page identify the language character on the basis of utf-8 ? is it possible that different computer will show different character ?

The "magic" behind UTF-8 is called Unicode. It is one of several encodings of the standard.
Unicode does have character ranges that correspond to languages and many characters are specifically associated with a language.
I suggest reading this - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

UTF-8 is a variable-length byte encoding of Unicode, the character numbering system for all languages.
Internet web pages by default base on ISO-8859-1, so called Latin-1. Other charsets can be set by:
Header lines of text, preceding an empty line and then the HTML content text.
There a header line:
Content-Type: text/html; charset=UTF-8
A Java EE server needs to do for this:
response.setContentType("text/html; charset=UTF-8");
In the HTML head a meta tag
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
...

Foreign characters in website

I found a website that contains the string "donâ€™t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "donâ€™t". A Google search yielded nothing (expect lots of other sites that have the same problem!). Can anyone explain what's happening here?
Edit: Here's the meta tag that was used:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Would this not cause the page to be served up as Latin-1 in the HTTP header?

In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "â€™". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.

According to Character encondings in HTML a lemme in wikipedia:
HTML (Hypertext Markup Language) has
been in use since 1991, but HTML 4.0
(December 1997) was the first
standardized version where
international characters were given
reasonably complete treatment. When an
HTML document includes special
characters outside the range of
seven-bit ASCII two goals are worth
considering: the information's
integrity, and universal browser
display.
I suppose the site you checked, isn't impelemented with this in mind.

This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.

This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).

HTML character entities and character encoding set

When including HTML entities in an HTML document, do the entities need to be from the same character encoding set that the document is specified to be using?
For example, if I am going to use the copyright sign in an HTML document that is specified as UTF-8, is it necessary to use the Unicode HTML entity (©) or is it okay to use other entities, such as the ASCII HTML entity (©)?
Please explain your answer. I am aware that it will "work", but is there a case where it will not work?
Thanks!

© and © specify the same character - 169 is equivalent to hexadecimal A9. These both specify a copyright symbol. Character entities in HTML always refer to Unicode code points, this is covered in the HTML 4 Standard. Thus, even if your character set changes, your entities still refer to the same characters.
This also means that you can encode characters that don't actually appear within your character set of choice. I just created a document in the ISO-8859-1 character set, but it includes a Greek lambda. Also, ASCII is not able to directly encode a copyright symbol, but it can through character entities.
Edit: Reading the comments on the other answer, I want to clarify this a bit. If you are using UTF-8 as the character encoding for your document, you can, within the raw HTML source, write a copyright symbol just as-is. (You need to find some way to input it, of course: copy-paste being the usual.) UTF-8 will allow you to directly encode any symbol you want. ISO-8859-1 is much more limited, and ASCII even more so. For example, within my HTML, if my document is a UTF-8 document, I can do:
Hi there. This document is ©2010. Good day!
or:
Hi there. This document is ©2010. Good day!
or:
Hi there. This document is ©2010. Good day!
The first is only valid if the character set supports "©". The other two are always valid, but less readable. Whatever text editor you're using, if it is worth its weight, should be able to tell you what character set it is encoding the document in.
If you do this, you need to make sure your web server informs the client of the correct character set, or that your document declares it with something like:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I've used UTF-8 there as an example. XHTML should have the character set in the opening <?xml ... ?> tag.

The beauty of the UTF-8 encoding is that you can actually just include the binary character. You don't need to encode it as an entity at all. Thusly: ©
Oh, you just want to know the difference between the two entities? There is none. One describes the byte in Hex and the other in decimal.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008