Ideas to get around a Character Encoding mismatch - html

I get this validation warning when I try and validate my page.
The character encoding specified in the HTTP header (iso-8859-1) is
different from the value in the element (utf-8). I will use the
value from the HTTP header (iso-8859-1) for this validation.
Here is the encoding set for the file:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
If I try and use characters like ĉĵŝĝ (iso-8859-3 compatible) they are rendered incorrectly. I think this is an issue with the server in my college because it is using the version one of the Latin encoding (iso-8859-1).
Is there a way I can get round this (if the problem lies with the encoding set with the college's server)?
Thanks.

You can also set this option in your .htaccess file
<FilesMatch "\.(htm|html|xhtml|php)$">
AddDefaultCharset utf-8
</FilesMatch>
That way, all the files with the above extensions will be served with utf-8 instead of iso-8859-1
Bye

Saluton,
you could try .xhtml. Maybe the college's server then uses another header. The other way is to use HTML entities for the Esperanto letters &#...;.
As the HTTP headers are normative, and not the HTML meta, you are stuck when you cannot change the server's config.
Another solution proposes a local directory for your .html with a httpd.conf file.

Related

Basics on encoding

Today I've started my first HTML page. Where is the page encoding stored exactly?
At first, é turned into é. Then I used my text editor to save the file with an encoding. "UTF-8" didn't work. Then I used "ISO 8859-1", which did work. How did my browser know it was encoded with "ISO 8859-1"?
I can't see it anywhere in my file, so I'm very curious about where the info is stored.
The encoding is stored in the header of the file itself. Notepad++ and similar programs usually provide a number of options to change and view it.
Additionally, you can provide a value by using the meta tag:
<meta charset="UTF-8"> (HTML5)
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> (HTML4)
Those tags are used by browsers to parse your file. However, they do not define the encoding of the file itself (and that's what seems to be happening in your case: your file has encoding A, and the browser is trying to read encoding B), and browsers can ignore those conditions.
The default encoding can also be defined (and overwritten) by your server. A sample .htaccess encoding configuration:
AddDefaultCharset utf-8
AddType 'text/html; charset=utf-8' .html .htm .shtml
UTF-8 is the recommended encoding standard for the web.
The UTF-8 encoding for é is the two hex bytes C3A9.
C3 A9, when interpreted as ISO 8859-1 is two characters: é.
Browsers tend to guess correctly at the encoding. Or you can explicitly tell it how to interpret the bytes. Try that out -- you will probably see the text change between é and é.
A third case is when "double encoding" occurs. That is, somehow, the é is seen as UTF-8, hex C383 C2A9.
So, to really be sure of what is going on, you need to get the HEX.

What is the difference between the charset in http header and html meta?

You can send the charset both in the http response headers and also you can define a charset in the html file you have sent..
What happens if these 2 are different charsets? How does the browser use the charset for what it received in the http headers and where does it matter what charset it provided in the html file itself?
The HTML 4.01 specification clearly says, in 5.2.2 Specifying the character encoding, that information in an HTTP header has precedence over a meta tag. HTML5 PR does not change this, but it adds, reflecting browser practice, in 8.2.2.2 Determining the character encoding that both of them are overridden by a Byte Order Mark (BOM) at the start of the HTML document (so if you have saved your .html file with “Save as UTF-8 with BOM”, it will be treated as UTF-8 no matter what).
A meta tag that specifies character encoding takes effect if the information is not provided in an HTTP header or with a BOM. A server might not include charset parameter in the Content-Type header, or the HTML document might be opened locally so that there are no HTTP headers at all. When a user saves an HTML document in his own device, the HTTP headers are not saved. This is the main reason for using a meta tag to specify character encoding; but it should then specify the correct encoding of course.

UTF-8 html without BOM displays strange characters

I have some HTML which contains some forign characters (€, ó, á). The HTML document is saved as UTF-8 without BOM. When I view the page in the browser the forign characters seem to get replaced with stranger character combinations (€, ó, Ã). It's only when I save my HTML document as UTF-8 with BOM that the characters then display properly.
I'd really rather not have to include a BOM in my files, but has anybody got any idea why it might do this? and a way to fix it? (other than including a BOM)
You are probably not specifying the correct character set in your HTML file. The BOM (thanks #Jukka) sends the browser into UTF-.8 mode; in its absence, you need to use other means to declare the document UTF.8.
If you have access to your server configuration, you may want to make sure the server isn't sending the wrong character set info. See e.g. How to change the default encoding to UTF-8 for Apache?
If you have access only to your HTML, adding this meta tag in your document's head should do the trick:
<meta http-equiv='Content-Type' content='Type=text/html; charset=utf-8'>
or as #Mathias points out, the new HTML 5
<meta charset="utf-8">
(valid only if you use a HTML 5 doctype, against which there is no good argument any more even if you don't use HTML 5 markup.)
Insert <meta charset="utf-8"> in <head>.
Or set the header Content-Type: text/html;charset=utf-8 on the server-side.
You can also do add in .htaccess: AddDefaultCharset UTF-8 more info here http://www.askapache.com/htaccess/setting-charset-in-htaccess.html

HTML in Russian

I have to design a Russian version of a web. I get the text from a translator. I copy it in the code of the Dreamweaver but it doesn't work.
I have the usual head:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
What should I do?
You should change encoding of your file to UTF-8. You can do this process when you Save As file in Notepad or you can use Notepad++(Encoding -> Encode in UTF-8) for it.
The document http://www.mig-marketing.com/proves/nando/ru/ contains Russian text in an image only, but it links to http://www.mig-marketing.com/proves/nando/ru/firma.html which contains (in addition to text in an image) Russian text in ISO-8859-5 (= ISO Latin/Cyrillic) encoding. This encoding is declared in a meta tag, but the problem is that the declaration has no effect, since HTTP headers take preference over them, and they say
Content-Type: text/html; charset=ISO-8859-1
(You can conveniently check the HTTP response headers using Firefox with Web Developer Extension and selecting Information → View Response Headers.)
To fix this, contact the web server admin or try and fix it yourself, if the Apache settings allow the use of per-directory .htaccess files, in which case just create a file with that name (including the leading dot) in the directory containing the Russian files and enter the text
AddType text/html;charset=ISO-8859-5 html
This would then make the server send all .html files in that directory with HTTP headers that specify them as ISO-8859-5 encoded.
Re-save all your files in UTF8 forcefully.
After trying so many tings I discovered that the problem was in the server. I don't know exacly how, but when I told them that I need a web in russian they changed something and it works!.

Displaying unicode symbols in HTML

I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding.
I have set the meta tag to show utf-8 but obviously I'm missing something.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Edit/Solution: From comments made, using FireBug I found the headers being passed by my page were in fact "Content-Type: text/html" and not UTF-8. Looking at the file format using Notepad++ showed my file was formatted as "UTF-8 without BOM". Changing this to just UTF-8 the symbols now show correctly... but firebug still seems to indicate the same content-type.
You should ensure the HTTP server headers are correct.
In particular, the header:
Content-Type: text/html; charset=utf-8
should be present.
The meta tag is ignored by browsers if the HTTP header is present.
Also ensure that your file is actually encoded as UTF-8 before serving it, check/try the following:
Ensure your editor save it as UTF-8.
Ensure your FTP or any file transfer program does not mess with the file.
Try with HTML encoded entities, like &#uuu;.
To be really sure, hexdump the file and look as the character, for the ✔, it should be E2 9C 94 .
Note: If you use an unicode character for which your system can't find a glyph (no font with that character), your browser should display a question mark or some block like symbol. But if you see multiple roman characters like you do, this denotes an encoding problem.
I know an answer has already been accepted, but wanted to point a few things out.
Setting the content-type and charset is obviously a good practice, doing it on the server is much better, because it ensures consistency across your application.
However, I would use UTF-8 only when the language of my application uses a lot of characters that are available only in the UTF-8 charset. If you want to show a unicode character or symbol in one of cases, you can do so without changing the charset of your page.
HTML renderers have always been able to display symbols which are not part of the encoding character set of the page, as long as you mention the symbol in its numeric character reference (NCR). Sounds weird but its true.
So, even if your html has a header that states it has an encoding of ansi or any of the iso charsets, you can display a check mark by using its html character reference, in decimal - ✓ or in hex - ✓
So its a little difficult to understand why you are facing this issue on your pages. Can you check if the NCR value is correct, this is a good reference http://www.fileformat.info/info/unicode/char/2713/index.htm
Make sure that you actually save the file as UTF-8, alternatively use HTML entities (&#nnn;) for the special characters.
Unlike proposed by Nicolas, the meta tag isn’t actually ignored by the browsers. However, the Content-Type HTTP header always has precedence over the presence of a meta tag in the document.
So make sure that you either send the correct encoding via the HTTP header, or don’t send this HTTP header at all (not recommended). The meta tag is mainly a fallback option for local documents which aren’t sent via HTTP traffic.
Using HTML entities should also be considered a workaround – that’s tiptoeing around the real problem. Configuring the web server properly prevents a lot of nuisance.
I think this is a file problem, you simple saved your file in 1-byte encoding like latin-1. Google up your editor and how to set files to utf-8.
I wonder why there are editors that don't default to utf-8.