IE7 won't display bmp files with encoded filenames - html

I have a test page that displays two images. One called hello.bmp and another called 徘吐驴欸觰.bmp (this is a random collection of Chinese characters - apologies if it means something weird). For the latter image, I use an encoded format in the page's HTML.
The html is pretty straight forward:
<img src="%E5%BE%98%E5%90%90%E9%A9%B4%E6%AC%B8%E8%A7%B0.bmp" />
<img src="hello.bmp" />
In Internet explore 7, the encoded filepath does not display (Red x). All other browsers display it.
Does anyone know what would cause this? Can it be avoided?

Character encoding of file:/// URLs works differently across browsers on Windows.
Windows filenames are natively Unicode-based, so when you use a URL, which is byte-based, it has to convert that sequence of bytes to Unicode characters using an encoding. What encoding? There is no standard to say, but there are two obvious possibilities:
UTF-8, since it covers everything and is a popular default encoding, also used by the IRI standard for putting Unicode in URIs;
the (misleadingly-named) “ANSI” code page, which is an arbitrary default that varies from system to system. On a Western European Windows install it will be code page 1252 (which is similar to ISO-8859-1); on a Chinese Windows install it will be code page 936 (similar to GB2312).
The ANSI code page is a pain because you never know what it's going to be, it's never UTF-8, and if your filename contains characters that don't exist in ANSI—which will certainly be the case if you have the filename 徘吐驴欸觰.bmp on a Western Windows install—you can't access the file at all.
So which do the browsers use?
IE: ANSI code page
Safari/Opera: UTF-8
Chrome/Firefox: UTF-8, unless the bytes are not a valid UTF-8 sequence, in which case the ANSI code page is used instead.
So in conclusion, you can't reliably use non-ASCII characters in file:/// URLs at all.
This is in contrast to HTTP. The IIS web server, for example, has the same UTF-8-with-fallback-to-ANSI behaviour as Chrome and Firefox. Non-ASCII characters via IRI and a suitably-configured server are fine, but not the local filesystem.
(On non-Windows platforms filenames are natively bytes, usually representing UTF-8-encoded characters, but still bytes. Oo there is no ambiguity between the filesystem names and the byte-based URL %-sequences.)
die ANSI code page die. Why won't Microsoft kill you? You have long outstayed your welcome. You ruin everything.

Related

Weird behaviour HTML accented character [duplicate]

I'm setting up a new server and want to support UTF-8 fully in my web application. I have tried this in the past on existing servers and always seem to end up having to fall back to ISO-8859-1.
Where exactly do I need to set the encoding/charsets? I'm aware that I need to configure Apache, MySQL, and PHP to do this — is there some standard checklist I can follow, or perhaps troubleshoot where the mismatches occur?
This is for a new Linux server, running MySQL 5, PHP, 5 and Apache 2.
Data Storage:
Specify the utf8mb4 character set on all tables and text columns in your database. This makes MySQL physically store and retrieve values encoded natively in UTF-8. Note that MySQL will implicitly use utf8mb4 encoding if a utf8mb4_* collation is specified (without any explicit character set).
In older versions of MySQL (< 5.5.3), you'll unfortunately be forced to use simply utf8, which only supports a subset of Unicode characters. I wish I were kidding.
Data Access:
In your application code (e.g. PHP), in whatever DB access method you use, you'll need to set the connection charset to utf8mb4. This way, MySQL does no conversion from its native UTF-8 when it hands data off to your application and vice versa.
Some drivers provide their own mechanism for configuring the connection character set, which both updates its own internal state and informs MySQL of the encoding to be used on the connection—this is usually the preferred approach. In PHP:
If you're using the PDO abstraction layer with PHP ≥ 5.3.6, you can specify charset in the DSN:
$dbh = new PDO('mysql:charset=utf8mb4');
If you're using mysqli, you can call set_charset():
$mysqli->set_charset('utf8mb4'); // object oriented style
mysqli_set_charset($link, 'utf8mb4'); // procedural style
If you're stuck with plain mysql but happen to be running PHP ≥ 5.2.3, you can call mysql_set_charset.
If the driver does not provide its own mechanism for setting the connection character set, you may have to issue a query to tell MySQL how your application expects data on the connection to be encoded: SET NAMES 'utf8mb4'.
The same consideration regarding utf8mb4/utf8 applies as above.
Output:
UTF-8 should be set in the HTTP header, such as Content-Type: text/html; charset=utf-8. You can achieve that either by setting default_charset in php.ini (preferred), or manually using header() function.
If your application transmits text to other systems, they will also need to be informed of the character encoding. With web applications, the browser must be informed of the encoding in which data is sent (through HTTP response headers or HTML metadata).
When encoding the output using json_encode(), add JSON_UNESCAPED_UNICODE as a second parameter.
Input:
Browsers will submit data in the character set specified for the document, hence nothing particular has to be done on the input.
In case you have doubts about request encoding (in case it could be tampered with), you may verify every received string as being valid UTF-8 before you try to store it or use it anywhere. PHP's mb_check_encoding() does the trick, but you have to use it religiously. There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably.
Other Code Considerations:
Obviously enough, all files you'll be serving (PHP, HTML, JavaScript, etc.) should be encoded in valid UTF-8.
You need to make sure that every time you process a UTF-8 string, you do so safely. This is, unfortunately, the hard part. You'll probably want to make extensive use of PHP's mbstring extension.
PHP's built-in string operations are not by default UTF-8 safe. There are some things you can safely do with normal PHP string operations (like concatenation), but for most things you should use the equivalent mbstring function.
To know what you're doing (read: not mess it up), you really need to know UTF-8 and how it works on the lowest possible level. Check out any of the links from utf8.com for some good resources to learn everything you need to know.
I'd like to add one thing to chazomaticus' excellent answer:
Don't forget the META tag either (like this, or the HTML4 or XHTML version of it):
<meta charset="utf-8">
That seems trivial, but IE7 has given me problems with that before.
I was doing everything right; the database, database connection and Content-Type HTTP header were all set to UTF-8, and it worked fine in all other browsers, but Internet Explorer still insisted on using the "Western European" encoding.
It turned out the page was missing the META tag. Adding that solved the problem.
Edit:
The W3C actually has a rather large section dedicated to I18N. They have a number of articles related to this issue – describing the HTTP, (X)HTML and CSS side of things:
FAQ: Changing (X)HTML page encoding to UTF-8
Declaring character encodings in HTML
Tutorial: Character sets & encodings in XHTML, HTML and CSS
Setting the HTTP charset parameter
They recommend using both the HTTP header and HTML meta tag (or XML declaration in case of XHTML served as XML).
In addition to setting default_charset in php.ini, you can send the correct charset using header() from within your code, before any output:
header('Content-Type: text/html; charset=utf-8');
Working with Unicode in PHP is easy as long as you realize that most of the string functions don't work with Unicode, and some might mangle strings completely. PHP considers "characters" to be 1 byte long. Sometimes this is okay (for example, explode() only looks for a byte sequence and uses it as a separator -- so it doesn't matter what actual characters you look for). But other times, when the function is actually designed to work on characters, PHP has no idea that your text has multi-byte characters that are found with Unicode.
A good library to check into is phputf8. This rewrites all of the "bad" functions so you can safely work on UTF8 strings. There are extensions like the mb_string extension that try to do this for you, too, but I prefer using the library because it's more portable (but I write mass-market products, so that's important for me). But phputf8 can use mb_string behind the scenes, anyway, to increase performance.
Warning: This answer applies to PHP 5.3.5 and lower. Do not use it for PHP version 5.3.6 (released in March 2011) or later.
Compare with Palec's answer to PDO + MySQL and broken UTF-8 encoding.
I found an issue with someone using PDO and the answer was to use this for the PDO connection string:
$pdo = new PDO(
'mysql:host=mysql.example.com;dbname=example_db',
"username",
"password",
array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"));
In my case, I was using mb_split, which uses regular expressions. Therefore I also had to manually make sure the regular expression encoding was UTF-8 by doing mb_regex_encoding('UTF-8');
As a side note, I also discovered by running mb_internal_encoding() that the internal encoding wasn't UTF-8, and I changed that by running mb_internal_encoding("UTF-8");.
First of all, if you are in PHP before 5.3 then no. You've got a ton of problems to tackle.
I am surprised that none has mentioned the intl library, the one that has good support for Unicode, graphemes, string operations, localisation and many more, see below.
I will quote some information about Unicode support in PHP by Elizabeth Smith's slides at PHPBenelux'14
INTL
Good:
Wrapper around ICU library
Standardised locales, set locale per script
Number formatting
Currency formatting
Message formatting (replaces gettext)
Calendars, dates, time zone and time
Transliterator
Spoofchecker
Resource bundles
Convertors
IDN support
Graphemes
Collation
Iterators
Bad:
Does not support zend_multibyte
Does not support HTTP input output conversion
Does not support function overloading
mb_string
Enables zend_multibyte support
Supports transparent HTTP in/out encoding
Provides some wrappers for functionality such as strtoupper
ICONV
Primary for charset conversion
Output buffer handler
mime encoding functionality
conversion
some string helpers (len, substr, strpos, strrpos)
Stream Filter stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP')
DATABASES
MySQL: Charset and collation on tables and on the connection (not the collation). Also, don't use mysql - mysqli or PDO
postgresql: pg_set_client_encoding
sqlite(3): Make sure it was compiled with Unicode and intl support
Some other gotchas
You cannot use Unicode filenames with PHP and windows unless you use a 3rd part extension.
Send everything in ASCII if you are using exec, proc_open and other command line calls
Plain text is not plain text, files have encodings
You can convert files on the fly with the iconv filter
The only thing I would add to these amazing answers is to emphasize on saving your files in UTF-8 encoding, I have noticed that browsers accept this property over setting UTF-8 as your code encoding. Any decent text editor will show you this. For example, Notepad++ has a menu option for file encoding, and it shows you the current encoding and enables you to change it. For all my PHP files I use UTF-8 without a BOM.
Sometime ago I had someone ask me to add UTF-8 support for a PHP and MySQL application designed by someone else. I noticed that all files were encoded in ANSI, so I had to use iconv to convert all files, change the database tables to use the UTF-8 character set and utf8_general_ci collate, add 'SET NAMES utf8' to the database abstraction layer after the connection (if using 5.3.6 or earlier. Otherwise, you have to use charset=utf8 in the connection string) and change string functions to use the PHP multibyte string functions equivalent.
I recently discovered that using strtolower() can cause issues where the data is truncated after a special character.
The solution was to use
mb_strtolower($string, 'UTF-8');
mb_ uses MultiByte. It supports more characters but in general is a little slower.
In PHP, you'll need to either use the multibyte functions, or turn on mbstring.func_overload. That way things like strlen will work if you have characters that take more than one byte.
You'll also need to identify the character set of your responses. You can either use AddDefaultCharset, as above, or write PHP code that returns the header. (Or you can add a META tag to your HTML documents.)
I have just gone through the same issue and found a good solution at PHP manuals.
I changed all my files' encoding to UTF8 and then the default encoding on my connection. This solved all the problems.
if (!$mysqli->set_charset("utf8")) {
printf("Error loading character set utf8: %s\n", $mysqli->error);
} else {
printf("Current character set: %s\n", $mysqli->character_set_name());
}
View Source
Unicode support in PHP is still a huge mess. While it's capable of converting an ISO 8859 string (which it uses internally) to UTF-8, it lacks the capability to work with Unicode strings natively, which means all the string processing functions will mangle and corrupt your strings.
So you have to either use a separate library for proper UTF-8 support, or rewrite all the string handling functions yourself.
The easy part is just specifying the charset in HTTP headers and in the database and such, but none of that matters if your PHP code doesn't output valid UTF-8. That's the hard part, and PHP gives you virtually no help there. (I think PHP 6 is supposed to fix the worst of this, but that's still a while away.)
If you want a MySQL server to decide the character set, and not PHP as a client (old behaviour; preferred, in my opinion), try adding skip-character-set-client-handshake to your my.cnf, under [mysqld], and restart mysql.
This may cause trouble in case you're using anything other than UTF-8.
The top answer is excellent. Here is what I had to on a regular Debian, PHP, and MySQL setup:
// Storage
// Debian. Apparently already UTF-8
// Retrieval
// The MySQL database was stored in UTF-8,
// but apparently PHP was requesting ISO 8859-1. This worked:
// ***notice "utf8", without dash, this is a MySQL encoding***
mysql_set_charset('utf8');
// Delivery
// File *php.ini* did not have a default charset,
// (it was commented out, shared host) and
// no HTTP encoding was specified in the Apache headers.
// This made Apache send out a UTF-8 header
// (and perhaps made PHP actually send out UTF-8)
// ***notice "utf-8", with dash, this is a php encoding***
ini_set('default_charset','utf-8');
// Submission
// This worked in all major browsers once Apache
// was sending out the UTF-8 header. I didn’t add
// the accept-charset attribute.
// Processing
// Changed a few commands in PHP, like substr(),
// to mb_substr()
That was all!

Using HTML Symbol Entities instead of the actual symbol

Is there any particular reason I should use HTML symbol entities instead of the actual symbol (I mean the one which I can just type)? For example the symbol /; the HTML entity code for it is &#47.
Should I use the symbol's code or the symbol itself in my HTML code, and why?
Using an HTML entity reference allows the entity to be represented as intended regardless of the encoding applied to the document. That is the benefit.
Rather than strictly using entities for all non-US-ASCII characters, feel free to use an encoding for your document that supports the document's target language, preferably one also supporting other languages, like UTF-8.
However, please avoid using any system-specific encoding, especially regular Windows encoding. It is often the case that Windows-1252 text is sent to other systems with the wrong label of ISO-8859-1.
In the past there has certainly been been less reliable support for numeric HTML entities than for named HTML entities (based on my own first-person eye witness observation), but in theory a numeric HTML entity is still character encoding independent and "safe" because the numeric value refers directly to a code point registered in the UCS (http://en.wikipedia.org/wiki/Universal_Character_Set) and equivalent to its defined character name.
Caveat: the following describes my own experience, and yours may vary.
HTML documents transferred by clients for me to work on with symbols directly embedded are very often corrupted and cannot be recovered. This may be a weakness of U.S. infrastructure or a lack of knowledge on the part of my customers about how to send their documents. The infrastructure and people in a country whose primary language relies on non-ASCII characters would be much more likely to support and understand how to properly transfer their documents with no corruption.
If you are developing your own website and uploading the final copies of your own files to your server, then the risk of corruption is very small.
If you do not have control over your document from the point you edit it to the point that it is served to users, then you run the risk (perhaps not today, but certainly within recent years in the U.S., a likelihood more than mere risk) of having the document improperly converted at some point along the way and being permanently corrupted regardless of what encoding you attempt to view it in.
No.
Entities and character references are useful only if:
The character has special meaning in HTML at the point where you want to use the character (/ never will, it only has special meaning in places where you can't have a / as data anyway).
You can't type the character (e.g. because it doesn't appear on your keyboard).
You can't encode the file as UTF-8 (or in another encoding that includes it … and / appears in ASCII).
Unless you know for a fact that you will always be using the same software and computer system to edit your HTML, you will inevitably run into situations where you cannot edit your own code if you directly use symbols, regardless of what character encoding you specify in your document or with your HTTP headers. Only in a perfect world does the character encoding always properly transfer, and even then neither Macintosh nor Windows truly does it correctly.
If I open up a supposedly "properly" encoded document from either Macintosh or Windows in software that truly supports all available encoding systems, I see a message like this:
-=-J(DOS)**--F1 Top L3 (Text) ----------------------------------------
These default coding systems were tried to encode text
in the buffer:
(iso-2022-7bit-dos (284 . 4194194) (379 . 4194194) (462 . 4194195)
(492 . 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
. 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
(utf-8-dos (284 . 4194194) (379 . 4194194) (462 . 4194195) (492
. 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
. 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
However, each of them encountered characters it couldn't encode:
iso-2022-7bit-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...
utf-8-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...
Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.
Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).
thai-tis620
Remember that as soon as the data is off of your server, e.g., placed in an email, etc., there is no guarantee the encoding is passed along, and chances are that it is not. Byte marks and other invisible means of identifying documents do not work as promised, let alone transient methods such as HTTP headers which are lost as soon as the document moves beyond the context of your own carefully configured HTTP server.
The guiding principle of HTML is that it is a plain text markup language that, when properly used, is universally compatible with any system supporting the most basic of text. HTML documents should use HTML entities for any characters outside of the normal 7-bit US-ASCII character set. Any other characters have different binary definitions depending on the encoding used and may even vary between single-byte and multi-byte representations.
Within Non-HTML documents you can feel free to use raw symbols because when you embed them within either their native file format or within HTML you can ensure that you specify the "right" character encoding, i.e., the one that will be recognized by the system where you authored it and any system compatible with that.

Is it advisable to have non-ascii characters in the URL?

We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?
EDIT (in response to Maxym's answer):
The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?
It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:
the data should first be encoded as
octets according to the UTF-8
character encoding; then only those
octets that do not correspond to
characters in the unreserved set
should be percent-encoded. (...) For
example, the character A would be
represented as "A", the character
LATIN CAPITAL LETTER A WITH GRAVE
would be represented as "%C3%80", and
the character KATAKANA LETTER A would
be represented as "%E3%82%A2".
Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.
Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.
I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...
Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm
depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.
I was just wondering what are the
complications of having the non-ascii
characters in the URL.
but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.
for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.
NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...
http://en.wikipedia.org/wiki/Percent-encoding
You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this:
http://www.w3schools.com/tags/ref_urlencode.asp

How to store unicode data in a format that doesn't support utf-8

Okay, here's yet another character encoding question, demonstrating my ignorance of all things Unicode.
I am reading data out of Microsoft Excel .xls files, and storing it in ESRI shapefiles .shp. For versions of Excel > 5.0, text in excel files is stored as Unicode. However, Unicode (and specifically UTF-8 support for shapefiles is inconsistent and thus I think I should not use it at all. Shapefiles do support old-school codepages, however.
What is the best practice in a situation where you must convert a Unicode string to a string in an unknown but specific codepage?
As I understand it, a Unicode string can include characters from multiple "codepages". I would assume, therefore, that I must somehow estimate the "best" codepage to use, and then convert all non-supported characters into their closest approximation in that codepage (or the dreaded ?). Is this the usual approach?
I can definitely use more than just the system codepage. Because .shp files use the .dbf files to store their attribute data, at least all the codepages specified by the .dbf format should be supported (see the xBase format description). The supported codepages are: DOS USA, DOS Multilingual, Windows ANSI, Standard Macintosh, EE MS-DOS, Nordic MS-DOS, Russian MS-DOS, Icelandic MS-DOS, Kamenicky (Czech) MS-DOS, Mazovia (Polish) MS-DOS, Greek MS-DOS (437G), Turkish MS-DOS, Russian Macintosh, Eastern European Macintosh, Greek Macintosh, Windows EE, Russian Windows, Turkish Windows, Greek Windows
In addition, some applications support the use of an *.cpg file which specifies additional codepages to use (although I understand support for utf-8, and I suspect many other codepages, is limited).
Because I am trying to develop a general purpose tool, I can't assume anything about the content of the Unicode in the .xls files.
What is the best practice in a
situation where you must convert a
Unicode string to a string in an
unknown but specific codepage?
Depends on the file format. If it supports Unicode "escape sequences" like XML's € or JSON's \u20AC, then use those, and you won't lose any information. If not, a different approach is required.
I would assume, therefore, that I must
somehow estimate the "best" codepage
to use,
Generally, on a non-Unicode system, you'd convert characters into whatever the default encoding is, not an arbitrary code page.
Edit: So you do get a choice of code pages:
01h DOS USA code page 437
6Ah Greek MS-DOS (437G) code page 737
02h DOS Multilingual code page 850
64h EE MS-DOS code page 852
6Bh Turkish MS-DOS code page 857
67h Icelandic MS-DOS code page 861
65h Nordic MS-DOS code page 865
66h Russian MS-DOS code page 866
C8h Windows EE code page 1250
C9h Russian Windows code page 1251
03h Windows ANSI code page 1252
CBh Greek Windows code page 1253
CAh Turkish Windows code page 1254
04h Standard Macintosh code page 10000
98h Greek Macintosh code page 10006
96h Russian Macintosh code page 10007
68h Kamenicky (Czech) MS-DOS
69h Mazovia (Polish) MS-DOS
97h Eastern European Macintosh
To choose a code page, I would recommend:
Check if your data is plain ASCII. If so, it doesn't matter which code page you choose.
If not, try to find a code page that can exactly represent your data (or if you can't, one that minimizes the unrepresentable characters). Try code page 1252 first, then the other 125x code pages. Don't bother with the DOS code pages unless you have box-drawing characters.
and then convert all non-supported
characters into their closest
approximation in that codepage (or the
dreaded ?). Is this the usual
approach?
It's the approach we take at work when we need to convert a UTF-8 file into windows-1252 or into EBCDIC. I used Unidecode to help generate the "closest approximations".
We do, however, only replace letters and digits, not punctuation. Replacing “” with "" would break a few file formats.
What language is your text in? If the characters are mostly ASCII, it's probably best to write the original UTF-8 encoded text as such. A non-UTF-8-aware program will still read ASCII text correctly and display garbled ASCII for unknown characters.

How do I sanitize user input for proper content-encoding before I save it?

I've got an application where users input text into forms.
The data is saved into a MySQL database (collation: utf8_general_ci) and then output as XML (encoding: UTF-8).
The problem is that people tend to cut and paste their information from other sources, for instance, Microsoft Word documents or PDFs for instance.
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
This causes problems, obviously, when transforming or otherwise working on the XML because the characters are illegal.
So, how to sanitise the input?
Previously, I've used some fairly brute-force methods, things like the "de-moronize" script which consists of a long list of search-and-replace operations.
Is this still the best way to do it? Is there any other way?
Can I just set the accept-charset attribute on the form and have the browser do it for me?
If so, which browsers will do that and are there likely to be any problems?
Also, how come my database is accepting these characters, which are reserved/control characters in UTF-8?
As you can see, I know enough about encodings to know I have a problem, but I'm now a bit out of my depth...
TIA
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
“Smart quotes” (bytes 147 and 148 in cp1252) are perfectly valid Unicode characters, U+201C and U+201D. Your application should be capable of handling them seamlessly; if not, you're doing something wrong and most likely all non-ASCII characters will fail.
Regardless of whether the characters came from someone typing them or someone pasting them in from Word, the browser should be submitting UTF-8-encoded characters to your application, which should be storing the same UTF-8 bytes to the database.
If the browser is not submitting in UTF-8, chances are you're failing to set the charset of the HTML page containing the form. This can be done using the:
Content-Type: text/html;charset=utf-8
HTTP header and/or the:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
element in <head>.
Can I just set the accept-charset attribute on the form and have the browser do it for me?
No, accept-charset is basically useless thanks to IE, which misinterprets it to mean “try using this charset if the one on the page can't encode the characters we want”, instead of “always use this charset”. This means if you use accept-charset you can end up with a mixture of encodings submitted at once, with no way to figure out which is which. Nice!
how come my database is accepting these characters, which are reserved/control characters in UTF-8?
In MySQL UTF-8 is just a collation, used for comparison and ordering. It's still storing the data as bytes and doesn't really care if they're not valid UTF-8 sequences.
It's a good idea to decode and check incoming UTF-8 sequences in your app anyway, because “short sequences”, invalid in modern Unicode, can hide a ‘<’ character that will still be recognised by older browsers (at least IE6 pre-SP2, Opera 7).
ETA:
So, I entered a string containing byte 146
No, you entered a Unicode character U+201B. The browser deals with Unicode characters, not bytes, right up until the point it has to submit the serialised form to the server. It's then that it decides how to turn the characters into bytes, and if the page is being handled as UTF-8, it will always choose UTF-8.
(If it's not UTF-8, browsers tend to cheat in a non-standards-compliant way: for all characters that can't fit in the encoding, it'll encode them to HTML character references like ‘’’. This is wrong because you now can't tell the difference between a browser-escaped ‘&’ and a real, user-typed ‘&’, and it's insidiously wrong because if you then echo the reference as unescaped HTML it looks like you're getting it right, which in fact you've just made a big old security hole.)
It went into the database as 146
Really, a ‘\x92’ byte, not ‘\xC2\x92’, ‘\xE2\x80\x99’ or ‘’’?
it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser
Then it did not come out as a single 146-byte. A browser will complain when given a bare ‘\x92’ in an XML file. (Not an HTML file, in which invalid UTF-8 sequences come out as a missing-character glyph.)
I suspect it is coming out as a ‘’’ character reference, which is well-formed (though the character U+0092 is part of the C1 control set, so won't render as anything useful). If this is what's happening, your form page is not being picked up as UTF-8 after all, and you're suffering the browser-auto-escaping-submission problem described above.
You might try the Perl Encode module. It supports conversion between a number of character sets, including UTF-8 of couse. I just checked my install of Perl and it also supported "cp1252", which is just another name for Windows-1252 according to Wikipedia. You can check your own install with the following one liner:
perl -MEncode -e 'print map {"$_\n"} Encode->encodings(":all");'
"Can I just set the accept-charset attribute on the form and have the browser do it for me?"
Only if you're prepared to trust "the browser" - that might be suitable in some applications, but in general it's leaving yourself wide open to mischief (or worse).
(Also see bobince's warnings about IE...)
Iain