Odd HTML/XML encoding issue - html

I'm having some real issues with a site we're building on our bespoke content management system. The system renders all views via XSLT, which may be the problem.
The problem we're experiencing appears to be the result of character encoding mismatches, but I'm struggling to work out which part of the process is breaking down.
The issue does not occur in Firefox or Chrome, and in IE is fine for the initial load of the page and when it is refreshed, however, when using the 'back' button or 'forward' button in IE, I find that any unicode characters are showing as a white question mark in a black diamond which implies that the wrong character set is being used. We've also seen odd results as a result of this with the page as indexed by google (it appears to index the DOCTYPE reference and the content of the head element rather than the content as would normally be the case).
All of the XSLT stylesheets are outputting UTF-16 and the XSLT files themselves are UTF-16 files (previously there was a mismatch). The site is serving the pages as UTF-16 and the HTML output has a meta tag setting the content type to use a charset of UTF-16.
I've checked the results using Fiddler to see what's coming from the server, however, Fiddler isn't logging a request/response when IE uses the back/forward buttons, so presumably it's got them cached somewhere.
Anyone got any ideas?

The site is serving the pages as UTF-16
Whoah! Don't do that.
There are several browser bugs to do with UTF-16 pages. I hadn't heard of this particular one before but it's common for UTF-16 to break form handling, for example. UTF-16 is very rarely used on the web, and as a consequence it turns up a lot of little-known bugs in browsers and other agents (like search engines and other tools written in one of the many scripting languages with poor Unicode support like PHP).
the HTML output has a meta tag setting the content type to use a charset of UTF-16
This has no effect. If the browser fails to detect UTF-16 then, because UTF-16 is not ASCII-compatible, it won't even be able to read the meta tag.
On the web, always use an ASCII-compatible encoding—usually UTF-8. UTF-8 is by far the best-supported encoding, and is almost always smaller in size than UTF-16. UTF-16 offers pretty much no advantage and I would avoid it in every case.

Possibly IE is corrupting the files when they are read from the cache. Could be related to this (unfotunately unanswered) question
Firefox & IE: Corrupted data when retrieved from cache
A few things you could check/try:
Make sure encoding is specified in both http Content-Type: header and <?xml encoding=...> declaration at the top of the XML
Are you specifing the endian of your UTF-16 or relying on byte order mark? If the latter try specifying. I think windows is usually fond of UTF-16LE.
Are you able to try another encoding? Namely UTF-8?
Are you able to disable caching from the server end (if its practical)? pragma: no-cache or whatever its modern day equivalent is? (sorry, been a while since I played with this stuff).
Sorry, no real answer here, but too much to write as a comment.

Related

How do browsers determine the encoding used?

I do understand there are 2 ways to set the encoding:
By using Content-Type header.
By using meta tags in HTML
Since Content-Type header is not mandatory and is required to be set explicitly (the server side can set it if it wants) and meta tag is also optional.
In case both of these are not present, how does the browser determine the encoding used for parsing the content?
They can guess it based on heuristic
I don't know how good are browsers today at encoding detection but MS Word did a very good job at it and recognizes even charsets I've never heard before. You can just open a *.txt file with random encoding and see.
This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection.
https://en.wikipedia.org/wiki/Charset_detection
Firefox uses the Mozilla Charset Detectors. The way it works is explained here and you can also change its heuristic preferences. The Mozilla Charset Detectors were even forked to uchardet which works better and detects more languages
[Update: As commented below, it moved to chardetng since Firefox 73]
Chrome previously used ICU detector but switched to CED almost 2 years ago
None of the detection algorithms are perfect, they can guess it incorrectly like this, because it's just guessing anyway!
This process is not foolproof because it depends on statistical data.
so that's how the famous Bush hid the facts bug occurred. Bad guessing also introduces a vulnerability to the system
For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS past filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.
http://htmlpurifier.org/docs/enduser-utf8.html#fixcharset-none
As a result, the encoding should always be explicitly stated.
I've encountered problem with output encoding of HTML. If you are creating website or webservice with .i.e nodejs or golang... and you're not sure just add Content-Type charset to header:
For example in golang: resp.Header.Set("Content-Type", "text/html; charset=GB18030");
It is set in the <head> like this:
<meta charset="UTF-8">
I think if this is not set in the head the browser will set a default encoding.

HTML5: which is better - using a character entity vs using a character directly?

I've recently noticed a lot of high profile sites using characters directly in their source, eg:
<q>“Hi there”</q>
Rather than:
<q>“Hi there”</q>
Which of these is preferred? I've always used entities in the past, but using the character directly seems more readable, and would seem to be OK in a Unicode document.
If the encoding is UTF-8, the normal characters will work fine, and there is no reason not to use them. Browsers that don't support UTF-8 will have lots of other issues while displaying a modern webpage, so don't worry about that.
So it is easier and more readable to use the characters and I would prefer to do so.
It also saves a couple of bytes which is good, although there is much more to gain by using compression and minification.
The main advantage I can see with encoding characters is that they'll look right, even if the page is interpreted as ASCII.
For example, if your page is just a raw HTML file, the default settings on some servers would be to serve it as text/html; charset=ISO-8859-1 (the default in HTTP 1.1). Even if you set the meta tag for content-type, the HTTP header has higher priority.
Whether this matters depends on how likely the page is to be served by a misconfigured server.
It is better to use characters directly. They make for: easier to read code.
Google's HTML style guide advocates for the same. The guide itself can be found here:
Google HTML/CSS Style guide.
Using characters directly. They are easier to read in the source (which is important as people do have to edit them!) and require less bandwidth.
The example given is definitely wrong, in theory as well as in practice, in HTML5 and in HTML 4. For example, the HTML5 discussions of q markup says:
“Quotation punctuation (such as quotation marks) that is quoting the contents of the element must not appear immediately before, after, or inside q elements; they will be inserted into the rendering by the user agent.”
That is, use either ´q’ markup or punctuation marks, not both. The latter is better on all practical accounts.
Regarding the issue of characters vs. entity references, the former are preferable for readability, but then you need to know how to save the data as UTF-8 and declare the encoding properly. It’s not rocket science, and usually better. But if your authoring environment is UTF-8 hostile, you need not be ashamed of using entity references.

How to display all non-English characters correctly in a web site?

It's annoying to see even the most professional sites do it wrong. Posted text turns into something that's unreadable. I don't have much information about encodings. I just want to know about the problem that's making such a basic thing so hard.
Does HTTP encoding limit some
characters?
Do users need to send info about the
charset/encoding they are using?
Assuming everything arrives to the
server as it is, is encoding used
saving that text causing the problem?
Is it something about browser
implementations?
Do we need some JavaScript tricks to
make it work?
Is there an absolute solution to this? It may have its limits but StackOverflow seems to make it work.
I suspect one needs to make sure that the whole stack handles the encoding with care:
Specify a web page font (CSS) that supports a wide range of international characters.
Specify a correct lang/charset HTML tag attributes and make sure that the Browser is using the correct encoding.
Make sure the HTTP requests are send with the appropriate charset specified in the headers.
Make sure the content of the HTTP requests is decoded properly in your web request handler
Configure your database/datastore with a internationalization-friendly encoding/Collation (such as UTF-9/UTF-16) and not one that just supports latin characters (default in some DBs).
The first few are normally handled by the browser and web framework of choice, but if you screw up the DB encoding or use a font with limited character set there will be no one to save you.

Content type vs HTML encoding

I'm bulding a site and I've set its content type to use charset UTF-8. I'm also using HTML encoding for the special characters, ie: instead of having á I've got á.
Now I wonder (still bulding the site) if it was really necesary to do both things. Looking for the answer I found this:
http://www.w3.org/International/questions/qa-escapes.en.php
It says that I shoud not use HTML encoding for any special characters but for >, < and &. But the reason is that escapes
can make it difficult to read and maintain source code, and can also significantly increase file size.
I think that's true but very poor argument. Is it really THE SAME thing using the escapes and the special characters?
The article is in fact correct. If you have proper UTF-8 encoded data, there is no reason to use HTML entities for special characters on normal web pages any more.
I say "on normal web pages", because there are highly exotic borderline scenarios where using entities is still the safest bet (e.g. when serving JavaScript code to an external page with unknown encoding). But for serving pages to a browser, this doesn't apply.

HTML: How do i debug why a language does not display correctly

i was recently asked why a tumblr theme of mine does not display Vietnamese correctly on this site. how do i debug whats the problem.
i wonder if its because of the use of
a custom font or cufon?
maybe its a character set issue? but
UTF-8 shld support most languages?
Debugging is difficult, especially if you don't read the language in question. There are some things you should check though:
1.) Fonts. This is the main cause of trouble. If you want to display a character you must have that character in the selected font. If you use standard fonts that may work on internationalised Windows but there are also "unicode" fonts (ie, Arial Unicode MS) you may want to specify explicitly.
2.) Encoding. Make sure the page is served in an appropriate character set. Check the HTTP and HTML headers "charset". UTF-8 is appropriate for most languages.
3.) Browser and OS Support. It's pretty much a given these days that browsers support non-latin character sets, however it's possible the client has a very old or unusual browser. Can't hurt to find out which browser/os combination they are using and what their "Regional Settings" are.