Charset for foreign languages - html

I'm currently doing some HTML that with urdu, farsi and chinese simplified characters. I'm having problems finding good resources online on what charset to use:
<meta http-equiv="Content-Type" content="text/html; charset=???" />
Any suggestions?

UTF-8 can encode any character in any language in the Unicode standard, is ASCII-compatible and is well-supported these days. There's little reason not to use it for everything.

I suggest using UTF-8, that can encode any Unicode character.
But apart from declaring the encoding in the document itself, it’s more important that your code is actually encoded in UTF-8. So get yourself a editor that can handle this encoding properly and declare the encoding in the HTTP header as it has a higher priority.

UTF-8

Related

Displaying UTF-16 characters on web browser

I printed some UTF-16 encoded characters and tried to display it in Firefox and it displayed it as �.
So I went to Tools->Encoding and changed the encoding from UTF-8 to UTF-16 (I also tried changing charset directly in the HTML) However, when I did that, my page was completely flooded with symbols:
਍ℼ佄呃偙⁅瑨汭ാ㰊瑨汭ാഊ㰊敨摡ാ †ഠ †㰠楴汴㹥楬畮⁸‭楆敲潦⁸楤灳慬獹朠牡慢敧挠慨慲瑣牥⁳湩氠敩⁵景眠扥 瀠条⁥‭畓数⁲獕牥⼼楴汴㹥਍††氼湩敲㵬猢潨瑲畣⁴捩湯•牨晥∽瑨灴⼺振湤献瑳瑡捩渮瑥猯灵牥獵牥椯杭是癡捩湯椮潣㸢਍††氼湩敲㵬愢灰敬琭畯档椭潣≮栠敲㵦栢瑴㩰⼯摣⹮獳慴楴⹣敮............
How can web browsers display UTF-16 characters without wrecking the page?
The “flooded with symbols” excerpt looks like an HTML document that is UTF-8 encoded but treated as if it were UTF-16 encoded. Or it might contain mostly UTF-8 data with some UTF-16 encoded data thrown in, which won’t work.
If you save your data as properly UTF-16 encoded and declare the encoding in HTTP headers and/or meta tags, then some browsers will display it OK, some won’t. Search engines generally fail to process UTF-16, and UTF-16 is mostly not used and should not be used on the web, except by mutual agreement between consenting well-informed partners.
Firefox could not figure the correct charset in your document.
For web pages head meta tag should be used to indicate the content's charset.
It should be placed in the beginning of the HTML file indicating which charset the browser should use for the rest of the file.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
So the browser is charset blind until it reads that line. But using utf-8 is no problem. Because every character up to that point is encoded in utf-8 the same way it would be in ASCII (same goes for latin-1 and others). That's not the case in utf-16.
W3C says:
There are three different Unicode character encodings: UTF-8, UTF-16
and UTF-32. Of these three, only UTF-8 should be used for Web content.
So you should use utf-8. But if you still want to try something with utf-16 use the BOM in the begging of your file. You're going to give your browser a better chance of figuring it out and properly decode the content.
This other answer is very succinct about utf-16 usage.
While Joel gives a full lesson on character encoding and why HTML uses it declaration inside the content and not as a header information.
Sending UTF-16 data as a Web page to browsers is an XSS risk in older browsers. (See another answer.) Don’t do it. Instead, convert the data to UTF-8 on the server and send UTF-8 over HTTP.
The way to make this work is for the page to say what encoding it's in. In the case of UTF-16, it also helps to include a BOM. The "flooded with Chinese" effect is most likely because your page is UTF-16LE but the browser treated it as UTF-16BE or vice versa...

£ sign not working in utf-8 charset (HTML5)

I recently updated a page I'm working on to work in HTML 5. For some reason when I changed my headers the £ sign that is included on all of the prices is no longer recognised and is showing as a white '?' in a black diamond.
Can anyone explain how to fix this? I have a feeling it has something to do with the <meta charset="utf-8"> line in my head, but could be mistaken.
Any help would be much appreciated!
Thanks!
You need to actually encode your HTML document in UTF-8. <meta charset="utf-8"> tells the browser that the document is supposedly encoded in UTF-8 and that the browser should treat it as such. A UTF-8 replacement character � means an invalid UTF-8 byte sequence was found at that point, which means your document is not actually encoded in UTF-8.
If you tell the browser it's UTF-8, then it must be UTF-8 that you send. It sounds like you're not sending valid UTF-8 sequences. You can probably fix this by doing one of the following:
Make sure you're saving the script(s) as UTF-8 in your editor. (Recommended)
Save the script(s) as ISO-8859-1, and use utf8_encode() on any output.

Foreign characters in website

I found a website that contains the string "don’t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "don’t". A Google search yielded nothing (expect lots of other sites that have the same problem!). Can anyone explain what's happening here?
Edit: Here's the meta tag that was used:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Would this not cause the page to be served up as Latin-1 in the HTTP header?
In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "’". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.
According to Character encondings in HTML a lemme in wikipedia:
HTML (Hypertext Markup Language) has
been in use since 1991, but HTML 4.0
(December 1997) was the first
standardized version where
international characters were given
reasonably complete treatment. When an
HTML document includes special
characters outside the range of
seven-bit ASCII two goals are worth
considering: the information's
integrity, and universal browser
display.
I suppose the site you checked, isn't impelemented with this in mind.
This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.
This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).

HTML and character encoding vs HTML Entity

When writing an HTML document, is it acceptable to use the direct special character such as the captial letter C with a cedilla underneath as regular text: Ç or to use the HTML Entity name of this charecter, &Ccedil ?
I have seen both being used in practice, but surely there are rules governing the appropriate usage of this, as well as advantages to one way over another. For instance, this website maintains the raw-form of this character, but other websites may end up rendering it as a square block.
Real characters:
Are easier to type if your system is set up for a language that uses those characters
Produce more readable code
Save bytes
HTML entities:
Let you more or less forget about character encoding
Obviously, characters with special meaning in HTML (<, &, etc) still need to be represented by entities.
If you're using UTF-8 character encoding, then most entity characters (other than &, > and <) become redundant.
If you're not using UTF-8, then you need entities for everything.
It all depends on the character encoding of the document. If you're unsure of whether or not you should use the the regular text or the encoding version, you could run your page through the W3C Validator.
Consider this code:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Stuff</title>
</head>
<body>
<p>©</p>
<p>©</p>
</body>
</html>
The document encoding is set to UTF-8 and when it's validated, it returns an error:
Sorry, I am unable to validate this document because on line 7 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

How can I properly display German characters in HTML?

My pages contain German characters and I have typed the text in between the
HTML tag, but the browser views some characters differently. Do I need to include anything in HTML to properly display German characters?
<label> ausgefüllt </label>
It seems you need some basic explanations about something that unfortunately even most programmers don't understand properly.
Files like your HTML page are saved and transmitted over the Internet as a sequence of bytes, but you want them displayed as characters. In order to translate bytes into characters, you need a set of rules called a character encoding. Unfortunately, there are many different character encodings that have historically emerged to handle different languages. Most of them are based on the American ASCII encoding, but as soon as you have characters outside of ASCII such as German umlauts, you need to be very careful about which encoding you use.
The source of your problem is that in order to correctly decode an HTML file, the browser needs to know which encoding to use. You can tell it so in several ways:
The "Content-Type" HTTP header
The HTML META tag
The XML encoding attribute if you use XHTML
So you need to pick one encoding, save the HTML file using that encoding, and make sure that you declare that encoding in at least one of the ways listed above (and if you use more than one make damn sure they agree). As for what encoding to use, Germans often use ISO/IEC 8859-15, but UTF-8 is increasingly becoming the norm, and can handle any kind of non-ASCII characters at the same time.
UTF-8 is your friend.
Try
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
and check which encoding your webserver sends in the header.
If you use PHP, you can send your own headers in this way (you have to put this before any other output):
<?php header('Content-Type: text/html; charset=utf-8'); ?>
Also doublecheck that you saved your document in UTF-8.
Try the solution in blog post German characters encoding issue (2012-05-10):
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
Have you tried ü (ü) and Ü (Ü)?
You can find how to type other letters here.
Declare <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
and when saving the file, for example in notepad, choose the save as to be UTF-8 and not just .txt.
This should render the characters ok.
you may try utf8_encode() or utf8_decode() functions.Check if any of these works.
For example <?php echo utf8_encode('ausgefüllt'); ?>
Hope it will work.
Sounds like a character encoding issue, in that the file is saved as a different character encoding to what the webserver is saying it is.
I don't like the use of HTML entities (like %uuml;), they are only needed when there is something wrong with your characterset.
In short:
The RIGHT way is to fix your characterset.
The EASY way is to just use entities. You may not ever see any problems with this.
Tracking down characterset error can be very difficult. If you give us an URL where we can see the problem, we can probably give you a good hint where to look.
save as your file with UTF8, and use this META:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>