Related
I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++
I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++
Why won't mysql recognize é and a lot more characters including em dash (—) ?? This is driving me nuts. i keep getting such errors like Incorrect string value: '\xE9' for column
I am using mysql 5.5.6 , my tables are innodb and using collation utf8-default collation.
I don't know if this is important but I am doing bulk insert from a csv file which contains special characters and my fields are of type TEXT
I had a similar problem trying to SELECT ... WHERE table_col LIKE "%–%" (long dash) turned out it wasn't working because my .php file which was sending the query wasn't in UTF8 but instead in ANSI! Converting it to UTF8 did the trick!!
Your problem sounds like one I have dealt with in the past, and I concur with Synchro that the client connection settings may be where you need to look. You probably need to specify UTF8 character set when starting the connection.
I use PDO, and initiate the connection with this:
$this->dbConn = new PDO("mysql:host=$this->host;dbname=$this->dbname", $this->user, $this->pass, array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"));
Before I started using PDO, I used this:
mysql_query("SET NAMES 'utf8'");
See http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
Just make sure the CSV file is in UTF8 and not the default ANSI. To do this open the csv file in notepad and using the save as option, ensure the encoding is in UTF8.
It's probably down to your PHP MySQL client's connection settings. Rob Allen's post can probably sort you out.
Rather than using a SET NAMES utf8 query, which the PHP docs explicitly warns against, there is a built-in function to do this for you in the mysqli extension: $mysqli->set_charset('utf8');.
An alternative explanation for bad characters if you're already doing this is that MySQL's utf8 charset isn't actually proper UTF-8... It only supports up to 3-byte characters and there are some increasingly common ones that use 4, specifically Emojis. Fortunately MySQL has a fix for this as of version 5.5.3: use the utf8mb4 charset instead.
On a related note, the sort order in the default utf8 charset (with the utf8_general_ci collation) has a number of problems that may affect you in, for example, German. The fix here is to use the utf8mb4_unicode_ci collation, which provides a more accurate, though slightly slower collation.
In order to get Hebrew text into the DB, I have to use SET NAMES 'hebrew'. For which other languages should I do the same?
No; you should not SET NAMES 'hebrew'. This will lock you in to using a Hebrew-specific character set, making it impossible to store text in other non-Roman scripts.
Use SET NAMES 'utf8' to set MySQL to store text as Unicode. Always.
Best is to do it always, that's the only way to be sure your encoding is 100% correctly transmitted between Database and Client.
Always use SET NAMES with the character set your client is running on.
For example if your client code got strings and expects strings in utf-8 set it to utf-8, if you're running on latin1 set it to latin1.
Effectively it tells the Database what characterset you want to communicate in, what character set your data uses and what character set you want for the result.
I have a table with a field which contains strings in my MySQL database.
The MySQL version is 5.0.51a. The default character set for the table is 'utf8'.
Many of the strings have unicode characters such as \xae and \u21222 (registered symbol and trademark symbol respectively).
For example, suppose I have a row with a field this value:
"Bing® Blang™ Blaow"
The default character set of my mysql command line client is "latin1".
If I issue a SELECT statement in the mysql client program from the command line without specifying a character set, the output of the title shows up like so:
"Bing® Blang Blaow"
The (R) symbol is correct but the (TM) symbol is missing. If I cut and paste this string from the console into TextMate, the (TM) symbol appears, but is half-way behind the g in the word "Blang".
I am assuming that the half-way-behind-the-g thing is a just a display error in TextMate (though if anyone can provide further detail that'd be great, but that's not really the important part).
The main thing I am inferring from the its-there-after-you-cut-and-paste behavior is that the data is in the database but there's something wrong with some sort of character set setting somewhere.
If I override the default encoding of the mysql client on the command line like so:
mysql --default-character-set=utf8
Then do the same select, the string comes out as:
"Bing® Blang™ Blaow"
which is to say that both the (R) and (TM) symbols appear and are in the right place but both are preceded by the unicode character \xae which is an A with a circumflex on top.
(Incidentally this is also how the data is displayed when I pull it out using python and display it on a web page, which is what my real problem is).
Anyway, what is going on here? Everything we have done recently has used UTF8 everywhere possible, but it's possible that some of these rows were inserted prior to that change which means they would've been using the latin1 default... however neither encoding seems to produce the right result?
If the rows were inserted when the default encoding on the table was latin1 before it was switched to utf8, then the encoding was switched (via alter table..) then would the encoding have actually been updated? Should one of the encodings work now? Will unicode ever stop kicking my ass?
There are quite a number of issues here:
About the characters
You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.
U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).
(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)
What's probably wrong
Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.
Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement SET NAMES utf8_general_ci; in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. SET NAMES ... is specific to MySQL, but sets all the required character set variables (there are three!) at once.
The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:
If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <meta> tag.) Be sure also, that the <form> tag is not specifying a different character set.
When converting the data, be sure that you use iconv or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.
Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.
If, when you pull the data out with Python and display it in a web page, you see the string "™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.
Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.
About changing character sets of tables
Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.
Be careful to note the difference between ALTER TABLE foo CONVERT TO CHARACTER SET ... and ALTER TABLE foo CHARACTER SET ... The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)
I think it has to do with the settings of the mysql connection in your Python code.
try setting conn.character_set_name or something like that, depends on the mysql connection lib you are using.
in case of MySQLdb it should be smthng like this:
def character_set_name(*args, **kwargs): return 'utf-8'
conn.character_set_name = new.instancemethod(character_set_name, conn, conn.__class__)
Could it be that some of the columns have an explicitly different character set than the table default?
something like this...?
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci