Today I've started my first HTML page. Where is the page encoding stored exactly?
At first, é turned into é. Then I used my text editor to save the file with an encoding. "UTF-8" didn't work. Then I used "ISO 8859-1", which did work. How did my browser know it was encoded with "ISO 8859-1"?
I can't see it anywhere in my file, so I'm very curious about where the info is stored.
The encoding is stored in the header of the file itself. Notepad++ and similar programs usually provide a number of options to change and view it.
Additionally, you can provide a value by using the meta tag:
<meta charset="UTF-8"> (HTML5)
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> (HTML4)
Those tags are used by browsers to parse your file. However, they do not define the encoding of the file itself (and that's what seems to be happening in your case: your file has encoding A, and the browser is trying to read encoding B), and browsers can ignore those conditions.
The default encoding can also be defined (and overwritten) by your server. A sample .htaccess encoding configuration:
AddDefaultCharset utf-8
AddType 'text/html; charset=utf-8' .html .htm .shtml
UTF-8 is the recommended encoding standard for the web.
The UTF-8 encoding for é is the two hex bytes C3A9.
C3 A9, when interpreted as ISO 8859-1 is two characters: é.
Browsers tend to guess correctly at the encoding. Or you can explicitly tell it how to interpret the bytes. Try that out -- you will probably see the text change between é and é.
A third case is when "double encoding" occurs. That is, somehow, the é is seen as UTF-8, hex C383 C2A9.
So, to really be sure of what is going on, you need to get the HEX.
Related
I have a HTML file which contains Chinese text. When I open the file in any web browser, there are characters which appear to be missing.
Here's an example copied from the browser window:
本函旨在邀請您參�� 定於
I know for a fact that all other characters seen here are correct aside from the missing ones (confirmed by a native Chinese speaker).
In the HTML header, I have a tag which signifies the file contains UTF-8 encoded characters:
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
I've already tried some other charsets in this META tag, but so far it seems any encoding method I try aside from UTF-8 ends up looking worse.
I also considered the possibility that it is a font issue, so I installed 3 different traditional Chinese fonts on my system and forced Chrome to use them. None of them made any difference - missing characters were still present.
If I open the HTML file with Notepad++, here's what I can see:
http://i.imgur.com/GoS07WX.png
If I select and copy-paste this text into regular MS Notepad, I get this:
本函旨在邀請您參劦nbsp;定於
So you can see here that the "xE5 x8A" visible in Notepad++ seems to have been replaced by 劦.
Is there any reason why the browser would be showing �� instead of 劦 in this scenario?
Look again at the HTML file.
I see the first 2 bytes of a character encoded in UTF-8, followed by ... let's imagine there was originally a \xA0, and this was mutated to when the file was created by applying global substitutions to the UTF-8-encoded data.
However, \xE5\x8A\xA0 UTF-8 decodes to U+52A0 which is not the same as the alien character which is U+52A6 ... not close enough to an answer.
In Dreamweaver I have the option "Include Unicode Signature (BOM)".
If I check this box and save the file the HTML file it looks good when viewed in the web browser. If not it gives me strange symbols for Swedish letters like åäö.
If I serve this HTML file with strange letters using the header respond "Content-Type: text/html; charset=utf-8" it still gives me strange symbols.
Q1) Does that mean that it's not a UTF-8 encoded file (the one without BOM that shows strange symbols)?
Q2) What makes a file UTF-8 encoded, is it just the Unicode signature (BOM)?
Q3) Should I or should I not add the Include Unicode Signature (BOM) in my files (HTML, Javascript, CSS, PHP)?
I know that I can add <meta charset="UTF-8"> in the HTML code or type AddDefaultCharset UTF-8 in my .htaccess. I just figure the optimal solution would be to have a header respond that says "it's a UTF-8 encoded file" and then also actually serve a UTF-8 encoded file. Nothing else.
Q4) I thought HTML files were plain text-files. What other information is hidden in those files and how can I read this information?
The BOM is entirely optional for UTF-8. The Unicode consortium points out that it can create problems while offering no real advantage; the W3C says that it can be a substitute for other forms of declaring the encodings and should work on all modern browsers.
The BOM is only there to clarify the endianness of the encoding. Since UTF-8 only has one kind of endianness it is superfluous. It's only useful for UTF-16 and other encodings. A UTF-8 encoded file is UTF-8 encoded regardless of the presence of the BOM.
HTML files do not "hide" any other information, they're plain text.
My recommendation would be:
encode as UTF-8 without BOM
add the HTTP Content-Type header to denote the encoding of the file
also add the <meta> tag into the HTML itself as a fallback, should the file be interpreted outside of an HTTP context (meaning where no HTTP header exists because the file is not read over HTTP)
This gives you the best compatibility with the least potential for issues. If your characters are still appearing funny, then your file is not actually UTF-8 encoded or the HTTP header is not being set correctly.
I'm having some trouble getting a special character properly encoded.
® keeps coming through instead of the registered trademark symbol. I've tried changing the meta tag to UTF-8 and Windows-1252, but it still comes through in the encoded format? Can I add a meta tag to fix this?
Make sure to save your file with the proper encoding:
.
Here is an example; on the left side, the file is saved with Window-1252 encoding.
On the right side, it's saved with UTF-8 encoding
HTML options
For such characters, encoding with ISO-8859-1 might do it too, but UTF-8 is greatly encouraged.
Make sure your DOCTYPE is clearly defined : <!DOCTYPE HTML>.
Make sure your meta tag is written properly: <meta charset="UTF-8">.
PHP options
If you use PHP within your page, add the following at the beginning of the page:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
If the content is output from a database, you might want to use utf8_encode() to encode different encodings to UTF-8
utf8_encode()
Encodes an ISO-8859-1 string to UTF-8
The information about encoding should correspond to the actual encoding. So instead of making guesses and trial and error, find out what the encoding really is. It seems to be UTF-8, and if declaring UTF-8 in a meta tag does not help, the probable culprit is an HTTP header that the server sends and that declares a different encoding, trumping the meta tag. Use e.g. an HTTP header viewer to check out the situation.
If the server announces iso-8859-1 or windows-1252 and if you cannot change this, then you just have to use that encoding instead of UTF-8. Then save the page in your authoring program as windows-1252 encoded.
validator.w3.org reports for www.besaltnlight.ca:
Character Encoding Override in effect!
The detected character encoding "utf-8" has been suppressed and "iso-8859-1" used instead.
The php code outputs iso-8859-1 and php sets that as the default characterset.
What is causing this problem? Am I using the wrong doctype?
Oh, and would any of this cause quirks mode in IE?
Thanks for your help.
Gerry
The document is encoded in UTF-8. It has a byte order mark, smart quotes, and an ellipsis, all properly encoded in UTF-8. It begins with two byte order marks, which is invalid. You must remove one, and the validator also says that the presence of a BOM in a UTF-8 document may be confusing, so you may remove them both.
Since you’re outputting UTF-8, you must change the HTTP header to:
Content-type: text/html; charset=utf-8
Since you are missing that header, you force the browser to guess. Additionally, the meta tag must be changed to
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
for the same reason.
Your output starts with a Unicode byte order mark, encoded in UTF-8.
This is likely the first some bytes of your PHP file, or any PHP file included by your main file. Your editor may not even show them. Interpreted as ISO-8859-1 the start of the output looks like <!DOCTYPE html, which are even two byte order marks, one after each other.
As said by jleedev, either make sure your files are really encoded in Latin-1, or declare the encoding as UTF-8.
I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding.
I have set the meta tag to show utf-8 but obviously I'm missing something.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Edit/Solution: From comments made, using FireBug I found the headers being passed by my page were in fact "Content-Type: text/html" and not UTF-8. Looking at the file format using Notepad++ showed my file was formatted as "UTF-8 without BOM". Changing this to just UTF-8 the symbols now show correctly... but firebug still seems to indicate the same content-type.
You should ensure the HTTP server headers are correct.
In particular, the header:
Content-Type: text/html; charset=utf-8
should be present.
The meta tag is ignored by browsers if the HTTP header is present.
Also ensure that your file is actually encoded as UTF-8 before serving it, check/try the following:
Ensure your editor save it as UTF-8.
Ensure your FTP or any file transfer program does not mess with the file.
Try with HTML encoded entities, like &#uuu;.
To be really sure, hexdump the file and look as the character, for the ✔, it should be E2 9C 94 .
Note: If you use an unicode character for which your system can't find a glyph (no font with that character), your browser should display a question mark or some block like symbol. But if you see multiple roman characters like you do, this denotes an encoding problem.
I know an answer has already been accepted, but wanted to point a few things out.
Setting the content-type and charset is obviously a good practice, doing it on the server is much better, because it ensures consistency across your application.
However, I would use UTF-8 only when the language of my application uses a lot of characters that are available only in the UTF-8 charset. If you want to show a unicode character or symbol in one of cases, you can do so without changing the charset of your page.
HTML renderers have always been able to display symbols which are not part of the encoding character set of the page, as long as you mention the symbol in its numeric character reference (NCR). Sounds weird but its true.
So, even if your html has a header that states it has an encoding of ansi or any of the iso charsets, you can display a check mark by using its html character reference, in decimal - ✓ or in hex - ✓
So its a little difficult to understand why you are facing this issue on your pages. Can you check if the NCR value is correct, this is a good reference http://www.fileformat.info/info/unicode/char/2713/index.htm
Make sure that you actually save the file as UTF-8, alternatively use HTML entities (&#nnn;) for the special characters.
Unlike proposed by Nicolas, the meta tag isn’t actually ignored by the browsers. However, the Content-Type HTTP header always has precedence over the presence of a meta tag in the document.
So make sure that you either send the correct encoding via the HTTP header, or don’t send this HTTP header at all (not recommended). The meta tag is mainly a fallback option for local documents which aren’t sent via HTTP traffic.
Using HTML entities should also be considered a workaround – that’s tiptoeing around the real problem. Configuring the web server properly prevents a lot of nuisance.
I think this is a file problem, you simple saved your file in 1-byte encoding like latin-1. Google up your editor and how to set files to utf-8.
I wonder why there are editors that don't default to utf-8.