Latin ISO recognize character but not UTF8 in html document - html

I have the following code:
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<p>Schrödinger
</body>
When I run it in my browser I get:
Schr�dinger
When I change the encoding to latin ISO:
<html>
<head>
<meta charset="ISO-8859-1">
</head>
<body>
<p>Schrödinger
</body>
It works good:
Schrödinger
Curiously, using the code snippet tool on this site, utf-8 works good:
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<p>Schrödinger
</body>
</html>
Using UTF8 should work even better than Latin ISO (it supports more characters).
What can the problem be?
I tested both in Chrome and Firefox. I am using Windows 7 in an old PC.

You are right that UTF-8 can represent more characters than ISO-8859-1, but it also represents the same characters differently.
To understand what that means, you need to think about the binary representation that a computer uses for text. When you save some text to a file, what you are actually doing is writing some sequence of ones and zeroes to disk; when you load that file in a web browser, it has to look at that sequence of ones and zeroes and decide what to display.
A character encoding is the way that the browser decides what to display for each sequence of ones and zeroes.
In ISO-8859-1, the character "ö" is written as the sequence 111101110. In UTF-8, that same character would instead be written 1100001110110110, and 111101110 would mean something else (in fact, because of the way UTF-8 works, it represents half of something, so can't be displayed).
Your file contains 111101110, so the correct thing to tell the browser is "read this as ISO 8859-1 please". Alternatively, you can open the file in an editor that "knows" both encodings, and tell that editor to rewrite it as UTF-8, so the character will be saved as 1100001110110110 instead.
This is what happens when you paste the character here: your browser knows that Stack Overflow wants the UTF-8 version, and converts it to 1100001110110110 for you.

The encoding is basically how the data are written in binary. The same character (e.g. ö ) has different binary representation depending on the charset : if your file is written latin-1 and you declare your charset as latin-1, the browser will decode it fine. If your file is written in UTF-8 and you declare your charset as utf-8, the browser will decode it fine. But if you "lie" to the browser by telling him your file is in utf-8 while it is encoded in latin-1, it will be unable to decode some characters correctly.
Basic ASCII characters have usually the same binary representation whatever the encoding, so it is generally fine, but with accentued characters, it matters to declare the correct encoding.
You must take into account how you wrote the file to declare the appropriate charset, it is not a wish on what character set you want.

Here is a slightly different approach from the other answers, using a hands-on demonstration to recreate the issue, and then fix it.
(my example uses Notepad++).
1) Create a new text file, and before adding any data or saving it, change the encoding to ANSI (menu: Encoding > ANSI). This assumes UTF-8 is the default.
2) Enter the following text and save as "cat.htm".
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<div>Schrödinger</div>
</body>
</html>
3) Open the file with Firefox, Chrome, etc.
You will see Schr�dinger.
If you take the above example and change the file's encoding back to UTF-8 in Notepad++ (and reinstate the ö) then you get the expected output: Schrödinger. So, yes, it's all about how the source file was saved - the binary representation.

Related

Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8?

I've following HTML5 document :
<!DOCTYPE html>
<html>
<head> </head>
<body>
<p>Beträge: 20€</p>
</body>
</html>
The output of above cod is as below :
Beträge: 20€
The I tried below HTML5 code :
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<p>Beträge: 20€</p>
</body>
</html>
The above code gave me the following output as I was expecting :
Beträge: 20€
As per my knowledge, the default character encoding for HTML5 is UTF-8. It's default means it should not be specified explicitly inside <meta> tag.
So, in my first code snippet I skipped the code <meta charset="UTF-8"> But I got some weird unexpected result.
Then, I tried by adding the code <meta charset="UTF-8"> in between <head> pair of tags and it worked perfectly fine and I got the expected result.
So, my question is since the default character encoding in HTML5 has been set to UTF-8 why it's not working if it's not been specified explicitly?
Why there is a need to specify the character encoding "UTF-8" in an HTML5 document?
This answer relied on some now obsolete
documentation, see jon1000's
answer for update.
(thanks #blazee for pointing this out in the comments) I'll leave this
answer here, because the part about how the string "Beträge: 20€" is
mutating in various encodings still seems accurate.
The HTTP1.1 specifies that the browsers should treat all text as ISO-8859-1, unless told otherwise (this referenced RFC-2616, but it was superseded later, see #jon1000's answer):
When no explicit charset
parameter is provided by the sender, media subtypes of the "text"
type are defined to have a default charset value of "ISO-8859-1"
At the same time, HTML5 specifies that
If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.
So, HTTP1.1 defaults to ISO-8859-1, and overrides everything else.
If you encode
Beträge: 20€
with UTF-8, and then decode it as ISO-8859-1, you get exactly the garbled output:
Beträge: 20â¬
as the following code snippet demonstrates (Java, doesn't really matter):
new String("Beträge: 20€".getBytes("utf-8"), "iso-8859-1")
// result: Beträge: 20â¬
The browser actually does warn you about it. E.g. Firefox displays the following warning in the console:
The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol.
To obtain the correct output, you have to manually override the ISO-8859-1 by UTF-8 (in case of Firefox, it's under View -> Text Encoding -> Unicode (instead of "Western")).
So, to conclude: I don't see where it even says that "the default character encoding for HTML5 is UTF-8". All it says seems to be:
Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings.
Because the statement "the default character encoding for HTML5 is UTF-8" is wrong. The statement is distributed by websites like this. But as Marcel Dopita writes at Don’t be fooled by w3schools, UTF-8 is not the default HTML5 charset, it is wrong and in fact the W3C recommendation has a "suggested default encoding" of Windows-1252 for English locales.
It is sometimes stated that "HTTP/1.1 defaults to ISO-8859-1". This was true in the 1999 standard (RFC 2616), but in the 2014 version (RFCs 7230-7329) the default charset has been removed, and so the default behaviour is now just specified by the HTML5 recommendation. Also, even if the transport layer does specify "iso-8859-1", it is not a supported encoding in HTML5 and the encoding specification says it should be treated as a label for Windows-1252.

Basics on encoding

Today I've started my first HTML page. Where is the page encoding stored exactly?
At first, é turned into é. Then I used my text editor to save the file with an encoding. "UTF-8" didn't work. Then I used "ISO 8859-1", which did work. How did my browser know it was encoded with "ISO 8859-1"?
I can't see it anywhere in my file, so I'm very curious about where the info is stored.
The encoding is stored in the header of the file itself. Notepad++ and similar programs usually provide a number of options to change and view it.
Additionally, you can provide a value by using the meta tag:
<meta charset="UTF-8"> (HTML5)
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> (HTML4)
Those tags are used by browsers to parse your file. However, they do not define the encoding of the file itself (and that's what seems to be happening in your case: your file has encoding A, and the browser is trying to read encoding B), and browsers can ignore those conditions.
The default encoding can also be defined (and overwritten) by your server. A sample .htaccess encoding configuration:
AddDefaultCharset utf-8
AddType 'text/html; charset=utf-8' .html .htm .shtml
UTF-8 is the recommended encoding standard for the web.
The UTF-8 encoding for é is the two hex bytes C3A9.
C3 A9, when interpreted as ISO 8859-1 is two characters: é.
Browsers tend to guess correctly at the encoding. Or you can explicitly tell it how to interpret the bytes. Try that out -- you will probably see the text change between é and é.
A third case is when "double encoding" occurs. That is, somehow, the é is seen as UTF-8, hex C383 C2A9.
So, to really be sure of what is going on, you need to get the HEX.

Chinese text encoding missing characters when viewed in web browser

I have a HTML file which contains Chinese text. When I open the file in any web browser, there are characters which appear to be missing.
Here's an example copied from the browser window:
本函旨在邀請您參�� 定於
I know for a fact that all other characters seen here are correct aside from the missing ones (confirmed by a native Chinese speaker).
In the HTML header, I have a tag which signifies the file contains UTF-8 encoded characters:
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
I've already tried some other charsets in this META tag, but so far it seems any encoding method I try aside from UTF-8 ends up looking worse.
I also considered the possibility that it is a font issue, so I installed 3 different traditional Chinese fonts on my system and forced Chrome to use them. None of them made any difference - missing characters were still present.
If I open the HTML file with Notepad++, here's what I can see:
http://i.imgur.com/GoS07WX.png
If I select and copy-paste this text into regular MS Notepad, I get this:
本函旨在邀請您參劦nbsp;定於
So you can see here that the "xE5 x8A" visible in Notepad++ seems to have been replaced by 劦.
Is there any reason why the browser would be showing �� instead of 劦 in this scenario?
Look again at the HTML file.
I see the first 2 bytes of a character encoded in UTF-8, followed by ... let's imagine there was originally a \xA0, and this was mutated to when the file was created by applying global substitutions to the UTF-8-encoded data.
However, \xE5\x8A\xA0 UTF-8 decodes to U+52A0 which is not the same as the alien character which is U+52A6 ... not close enough to an answer.

Special character not displaying as expected

I have the following simple HTML page:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<div>
méywe
</div>
</body>
</html>
When displaying it in Chrome or Firefox (I did not test other browsers), I see the following:
m�ywe
What did I miss? The html file is saved in UTF-8 encoding. The server is Apache. My machine is Windows 7 pro. The text editor is UltraEdit.
Thanks!
Update
Initially, I used UltraEdit for editing this html file and I got the problem. Based on cmbuckley's input and install of Notepad++ (from Heatmanofurioso's suggestion), I thought about the possibility of my file being corrupt somehow (even though it looks fine in both UltraEdit and Notepad). So I saved my file with Notepad in utf-8 encoding. Still saw the problem (maybe due to cache???). Then I used UltraEdit to save it again. See the page in the browser and the problem is gone.
Lesson Learned
Have two text editors if that that is your tool, and try the different one if you see unexplainable problem. No tool is perfect, even though you use one everyday. In my case, Notepad++ fixed the utf8 issue with my file that UltraEdit somehow failed.
Thanks to folks for helping!!!
1 - Replace your
<meta charset="utf-8">
with
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
2 - Check if your HTML Editor's encoding is in UTF8. Usually this option is found on the tabs on the top of the program, like in Notepad++.
3 - Check if your browser is compatible with your font, if you're somehow importing a font. Or try and add a css to set your fonts to a default/generally accepted one like
body
{
font-family: "Times New Roman", Times, serif;
}
Hope it helps :)
The reason for having saved the file with Windows-1252 encoding (most likely) instead of UTF-8 encoding resulting in getting the non-ASCII character displayed wrong in the browsers was missing knowledge about UTF-8 detection by UltraEdit and perhaps also appropriate UTF-8 configuration.
How currently latest version 22.10 of UltraEdit detects UTF-8 encoding is explained in detail in user-to-user forum topic UTF-8 not recognized, largish file. This forum topic contains also recommendations on how to configure UltraEdit best for HTML writers who use mainly UTF-8 encoding for all HTML files. The UTF-8 detection was greatly improved with UltraEdit v24.00 which detects UTF-8 encoded characters also on in very large files on scrolling to a block containing a UTF-8 encoded character.
Unfortunately the regular expression search used by currently latest UltraEdit v22.10 and previous versions to detect a UTF-8 HTML character set declaration does not work for short HTML5 variant as reported in forum topic Short UTF-8 charset declaration in HTML5 header. The reason is the double quote character between charset= and utf-8. I reported this by email to IDM Computer Solutions, Inc. as the referenced topic was created with the suggestion to make the small change in the regular expression to detect also short HTML5 UTF-8 declaration. The UTF-8 detection was updated later by the developers of UltraEdit for UE v24.00 and UES v17.00 as a post on referenced forum topic explains in detail.
However, when an HTML5 file is declared as UTF-8 encoded, but UltraEdit loaded it as ANSI file, the user can see the wrong loading in the status bar at bottom of main window. A small (less than 64 KB) UTF-8 encoded HTML file should result in getting
either U8- and line terminator type (DOS/UNIX/MAC) displayed for users of UE < v19.00 or when using basic status bar in later versions of UE
or UTF-8 selected in encoding selector in status bar for users of UE v19.00 or later versions not using basic status bar.
If this is not the case, the UltraEdit user can use
Save As from menu File and select UTF-8 - NO BOM for Encoding (Windows Vista or later) respectively Format (Windows 2000/XP) to convert the file from ANSI to UTF-8 without byte order mark, or
ASCII to UTF-8 (Unicode editing) from submenu Conversions in menu File to convert the file from ASCII/ANSI to UTF-8 without an immediate save, or
select Unicode - UTF-8 via encoding selector in status bar (UE v19.00 or later only) resulting also in an immediate conversion from ASCII/ANSI to UTF-8 and enabling Unicode editing.
For the last two options the UTF-8 BOM settings at Advanced - Settings or Configuration - File Handling - Save determine saving the file without or with byte order mark on next save.
Once the word méywe is saved into the file using UTF-8 encoding resulting in byte stream 6D C3 A9 79 77 65 (hexadecimal) which would be displayed as méywe when UTF-8 encoded file is opened in ASCII/ANSI mode (option in File - Open dialog) using Windows-1252 as code page, UltraEdit detects this file on next opening automatically as UTF-8 encoded file although <meta charset="utf-8"> is not recognized because there is now at least one UTF-8 encoded character in the first 64 KB of the file.
To answer the question:
What did I miss?
You missed to save the file as UTF-8 encoded file after having it opened or created as ANSI file (or more precise single byte per character encoded text file using a code page) and having it declared as UTF-8 encoded. This is a common problem of many users writing into an HTML file
<meta charset="utf-8">
or
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
or
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
or into an XML file
<?xml version="1.0" encoding="UTF-8"?>
or
<?xml version="1.0" encoding='utf-8'?>
and other variations depending on usage of ' or " and writing either UTF-8 or utf-8 (and other spellings) without really knowing what this string means for the applications interpreting the bytes of the file.
What's the best default new file format? contains lots of useful information and links to web pages with useful information about text encoding, which one to use for which file types and how to configure UltraEdit accordingly.
Check and see if the server is sending a charset in the Content-type header. The encoding specified in that will take precedence over what you specify with the meta element.
Changing font-family to Calibri (or any other generally accepted font) worked for me.
Example:
<span style="font-family:Calibri"># My_Text</span>
I am using MS access accdb database and PHP. It had problem in displaying the "±" character . It was displaying "�".
I added the following line in PHP at the beginning to get it right. My problem is solved now.
header('Content-type: text/html; charset=ASCII');
Another method is to use mb_convert_encoding($row,'UTF-8','ASCII' );
The header declaration is not required.
In my case I converted the special character to decimal NCR and it worked. I have to do this because using meta tag do not work and I do not want to change my font.
There are many online unicode to decimal or hex converter.
Χαίρετε -> Χαίρετε
Replace meta charset="utf-8" with meta http-equiv="Content-Type" content="text/html; charset=utf-8". Maybe it will help.
Otherwise, what is your font?

Why does a diamond with a questionmark in it � appear in my HTML?

I have an unorder list, and � often (but not always!) appears where I have have two spaces between characters. What is causing this, and how do I prevent it?
This specific character � is usually the sign of an invalid (non-UTF-8) character showing up in an output (like a page) that has been declared to be UTF-8. It happens often when
a database connection is not UTF-8 encoded (even if the tables are)
a HTML or script source file is stored in the wrong encoding (e.g. Windows-1252 instead of UTF-8) - make sure it's saved as a UTF-8 file. The setting is often in the "Save as..." dialog.
an online source (like a widget or a RSS feed) is fetched that isn't serving UTF-8
I had the same issue ....
You can fix it by adding the following line in your template !
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
It's a character-set issue. Get a tool that inspects the response headers of the server (like the Firebug extension if you're using Mozilla Firefox) to see what character set the server response is sending with the content. If the server's character-set and the HTML character set of the actual content don't match up, you will see some strange looking characters like those little black diamond squares.
I had the same issue when getting an HTML output from an XSLT. Along with Pradip's solution I was also able to resolve the issue using UTF-32.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-32" />