HTML encoding: eastern european languages - html

My program is fetching messages from a database, which contains English, German and several Eastern European languages. My Python script sets the encoding via:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
and use the values fetched correctly from the database (if I check within my logs).
Unfortunately all browsers I tested (IE8, Firefox 3.0.10, Opera 9.64) switch based on my local language settings to:
Western ISO-8859-1 in Firefox
Western European (Windows) in IE
Automatic in Opera
Everything works fine as soon as I switch the character encoding manually in the browser.
The same happens if I manually generate the HTML file using UTF-8 (tested with TextMate respective jEdit), although both editors display the content correctly.
That works fine for English and German, but i.e. not for Russian. How can I force the "correct" character encoding?
ANSWER
The following entry within the VirtualHost (Apache configuration) section did the trick for me:
AddDefaultCharset utf-8
Many thanks for pointing me into the right direction, that helped a lot!

When the document is transfered over HTTP, the HTTP header information are the crutial information:
[…] conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):
An HTTP "charset" parameter in a "Content-Type" field.
A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
The charset attribute set on an element that designates an external resource.
So make sure you declare the character encoding in the Content-Type header field and not just inside the document.

Related

Why are Arabic and Chinese characters rendered correctly even without using meta charset="utf-8"?

I am very new to HTML. Recently I encountered the <meta charset="utf-8>" tag which ensures letters and characters are rendered properly in a browser.
But I was wondering why even if I do not specify UTF-8 all letters and characters are displayed perfectly anyway?
The page you're sending to the browser uses a specific character encoding (e.g. UTF-8). The browser must interpret the page in the correct encoding to read it correctly (i.e. as intended) and display the correct characters. There are several ways in which the browser determines what encoding to use, which it falls back to successively:
HTTP Content-Type header
HTML meta tags
any built-in heuristics
the browser/system default encoding
If the page displays correctly without an HTML meta tag, that means one of the other mechanisms caused the browser to choose to interpret the page as UTF-8. Probably your web server is outputting an HTTP Content-Type header, or your browser/system's default is UTF-8.
This is because the default character encoding for HTML5 is UTF-8.
See also this documentation:
The default character-set for HTML5 is UTF-8.
Example <meta charset="UTF-8">
The Unicode Consortium developed the UTF-8 and UTF-16 standards,
because the ISO-8859 character-sets are limited, and not compatible a multilingual environment.
The Unicode Standard covers (almost) all the characters, punctuations, and symbols in the world.
All HTML5 and XML processors support UTF-8, UTF-16, Windows-1252, and
ISO-8859.

japanese character is not supported in localhost

I have a project which has japanese characters in it. When I run the project which is already on the server (the live version) japanese characters are displayed. However, the same files with no changes in the code, if I run on localhost then japanese characters are displaying something like this "レストラン".
all files includes . and i'm using google chrome.
what should I do to make it support japanese characters?
any help would be appreciated.
Add following:
<meta charset="Shift-JIS"/>
to your file if you use HTML5, or
<meta http-equiv="Content-Type" content="text/html; charset=Shift-JIS" />
if you use HTML 4.01.
The reason why it works over the webserver is because it transmits HTTP headers with correct encoding. On your local copy of .html, there are no such headers, so the browser checks the <meta/> tags (which are missing), and if it fails to find these, it guesses encoding, which in this case is UTF8. (If you use webserver in localhost too, it might be misconfigured. In any case, it's a good practice to always include charset information.)
#AshishAcharya I'm pretty sure OP uses Shift-JIS rather than UTF8. The page is rendered as UTF8 though.

How come the following characters are displayed in ISO-8859-1?

I have the following html:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</head>
<body>
会意字 / 會意字 huìyìzì
</body>
When I run it in firefox, it displays the Chinese characters just fine. How come it works with the ISO-8859-1 characterset? I thought you needed UTF-8?
I can't reproduce your successful rendering:
… but HTML 5 defines a fairly complex character encoding detection method which doesn't pay any attention to <meta> until step 9.
In general, you should avoid encodings other than UTF-8 and definitely should not lie about the encoding of the document.
The most probable explanation is that the document is in fact UTF-8 encoded and the browser treats it that way, despite the meta tag. According to HTML5 encoding sniffing algorithm, which largely reflects browser behavior, the meta tag is ignored if any of the following is true:
The user has instructed (via e.g. a View → Encoding command) the browser to use a specific encoding.
The page starts with bytes that represent the Byte Order Mark in UTF-8 or UTF-16. In practice, it starts that way if the file was saved in an editor with a command like “Save as UTF-8 (with BOM)”.
HTTP headers specify an encoding in a Content-Type header.
You can find out which of these is the cause by using e.g. Rex Swain’s HTTP viewer. It lets you see both the HTTP response headers and the actual data as bytes. Developer Tools in browsers have similar features.

HTML5 Encoding & Cyrillic

Something that made me curious - supposedly the default character encoding in HTML5 is UTF-8. However if I have a plain simple HTML file with an HTML5 doctype like the code below, I get:
"hello" in Russian: "ЗдраÑтвуйте"
In Chrome 33+, Safari 6, IE11, etc.
<!DOCTYPE html>
<html>
<head></head>
<body>
<p>"hello" in Russian is "здраствуйте"</p>
</body>
</html>
What gives? Shouldn't the browser utilize the UTF-8 unicode standard and display the text correctly? I'm using Coda which is set to save html files with UTF-8 encoding by default so that's not the problem.
The text data in the example is UTF-8 encoded text misinterpreted as window-1252 encoded. The reason is that the encoding has not been specified and browsers are forced to make a guess. To fix this, specify the encoding; see the W3C page Character encodings. Two simple ways that work independently of server settings, as long as the server does not send wrong encoding information in HTTP headers:
1) Save the file as UTF-8 with BOM (there is probably an option for this in your authoring program.
2) Add the following tag into the head part:
<meta charset=utf-8>
There is no single default encoding specified for HTML5. On the contrary, browsers are expected to make guesses when no encoding has been declared. This is a fairly complex process, described in 8.2.2.2 Determining the character encoding.
If you want to be sure which charset will be used by browser you must have in your page head
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
otherwise you are at the mercy of local settings and browser automation.

Html charset and support for special (national) characters

I have a website in HTML5. Most of the content there is in Czech, which has some special symbols like "ř, č, š" etc...
I searched the internet for recommended charsets and I got these answers: UTF-8, ISO 8859-2 and Windows-1250.
<meta http-equiv="Content-Type" content="text/html; charset=ISO 8859-2" />
I tried UTF-8 which didnt work at all and then settled up with ISO 8859-2. I tested my website on my computer in the latest versions of Chrome, Firefox, IE and Opera. Everything worked fine but when I tested my website at http://browsershots.org/ , these characters were not displayed correctly (in the same browsers that I used for testing!).
How is that possible? How can I ensure, that all characters are displayed correctly in all web browsers. Is it possible that usage of HTML5 causes these problems (since its not fully supported by all browsers, but I am not using any advanced functions)?
Thanks for any hints and troubleshooting tips!
If you using HTML5, try this short declaration of charset:
<meta charset="UTF-8">
Additionally check you html file encoding. You can do it in Notepad++, menu Encoding -> Encode in UTF-8.
The important thing is that the actual encoding of the data coincides with the declared encoding. From the description, it seems that the actual encoding is ISO-8859-2, so you should declare it. Note that the name of the encoding has no space but hyphens. (I wonder whether you used it with a space – I would expect browsers to ignore the tag then.) The following is the simplest declaration:
<meta charset=ISO-8859-2>
I would not trust on browsershots.org getting things like this right. Testing on actual browsers is more useful.
UTF-8 is the best-supported character set for international usage. If it does not display correctly, you should ensure that your file is saved in UTF-8 format. Even Notepad has a "UTF-8" option in its save dialog.