Mixing character sets and collations - mysql

I am setting up a new MySQL database, and going through the default settings that has been provided to me, I was surprised to find that it uses a mix of character sets/collations.
SHOW VARIABLES LIKE '%character_set%';
SHOW VARIABLES LIKE '%collation%';
produces:
________________________________________________
| | |
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | latin1_swedish_ci |
| collation_database | utf8_general_ci |
| collation_server | latin1_swedish_ci |
|______________________________________________|
Is it risky to have a combination of latin1 and utf8 in these settings? I have always felt that potential problems will be avoided by using the same (preferably UTF-8/Unicode) everywhere.

Yes, you can get into trouble mixing character sets, and it's a big pain to clean it up. The risk is that you'll declare a table with latin1 encoding, and some app will try to store utf8 codes in it. The result is a mess.
Read http://www.mysqlperformanceblog.com/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/ for the proper way to get your data back under control if this happens. If it sounds kind of complicated and risky to you, then you understand it! :-/
But it's far better to avoid the mess in the first place. Just use utf8 everywhere from day one. If you need to support the full range of asian languages, then use utf8mb4.
Exception: if you have strings that you know will never need to store international symbols, for example a string of hexadecimal digits, then you can declare the character set ascii for individual columns in a MySQL table.
But use utf8 as the default at the database and table level, and also for the MySQL connection and on up into the application, Apache, HTTP, etc.

Related

Does characterEncoding in connection string set how values is stored?

I'm about to change the encoding for a database from latin1 to utf8mb4.
Due to privacy restrictions, I don't know what the database to be converted contains. I'm worried that by running below SQL, existing data may be changed.
ALTER TABLE table CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
However, the connection string from the grails application contains useUnicode=true&characterEncoding=UTF-8, does this mean that even though latin1_swedish_ci is used for a column, the actual value that has been saved is UTF-8 encoded?
And since this value is UTF-8 encoded, there is no risk that the data will be affected by the change from latin1 to utf8mb4?
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+```
That's Ώπα? That's the interpretation in UTF-8 (as the outside world calls it), utf8mb4 (MySQL's equivalent) or utf8 (MySQL's partial implementation of UTF-8).
It would not work well in latin1.
The encoding in the client and the encoding of a column in the database need not be the same. However, Greek in the client cannot be crammed into latin1 in the table, hence the error message.
What ALTER TABLE table CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci; does is to change all the text columns in that table to be utf8-encoded and convert from whatever encoding is currently used (presumably latin1). This is fine for Western European characters, all of which exist (with different encodings) in both latin1 and utf8.
To handle Emoji and some of Chinese, you may as well go for utf8mb4:
ALTER TABLE table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8_unicode_520_ci;

When does mysql throw an error instead of coercing text into the default column format?

I'm working with two mysql servers, trying to understand why they behave differently.
I've created identical tables on each:
| Field | Type | Collation |
+----------------+------------+-------------------+
| some_chars | char(45) | latin1_swedish_ci |
| some_text | text | latin1_swedish_ci |
and I've set identical character set variables:
| Variable_name | Value
+--------------------------+-------+
| character_set_client | utf8
| character_set_connection | utf8
| character_set_database | latin1
| character_set_filesystem | binary
| character_set_results | utf8
| character_set_server | latin1
| character_set_system | utf8
When I insert UTF-8 characters into the database on one server, I get an error:
DatabaseError: 1366 (HY000): Incorrect string value: '\xE7\xBE\x8E\xE5\x9B\xBD...'
The same insertion in the other server throws no error. The table just silently accepts the utf-8 insertion and renders a bunch of ? marks where the utf-8 characters should be.
Why is the behavior of the two servers different?
What command were you executing when you got the error?
Your data is obviously utf8 (good).
Your connection apparently is utf8 (good).
Your table/column is declared CHARACTER SET latin1? It should be utf8.
That is 美 - Chinese, correct? Some Chinese characters need 4-byte utf8. So you should use utf8mb4 instead of utf8 in all 3 cases listed above.
Other notes:
There is no substantive difference in this area in 5.6 versus 5.7.
##SQL_MODE is not relevant.
VARCHAR is usually advisable over CHAR.

If I change the character set of a MySQL database, do I need to modify Ruby clients?

I have a collection of ruby scripts that access my MySQL database. I need to modify the character set for this database, specifically change the tables from Latin1 to UTF8. Do I need to modify my scripts at all? I've looked and I see I can set the character set for a connection, is this mandatory?
Part of my hesitation in thinking I do not need to make any adjustments are the setting the database has today. Looking at how character sets are already set up:
mysql> SHOW VARIABLES LIKE "%char%";
+--------------------------+-------------------------------------------+
| Variable_name | Value |
+--------------------------+-------------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /rdsdbbin/mysql-5.6.21.R1/share/charsets/ |
+--------------------------+-------------------------------------------+
wouldn't this suggest that clients are already set up to utilize a UTF8 character set?
MySQL manages encodings per client and per column. This means, every column in which text is stored has an encoding setting, and every connected client individually also has an encoding setting. Text encodings will be converted on the fly between these two as necessary. If a client sends UTF-8 data to be stored in an SJIS column, MySQL will make that conversion automatically (and the other way around on the way back out).
As such, it only really matters what encoding the client specifies when connecting to the database. If you're not specifying this explicitly in your Ruby code, you will get an implicit default setting. Changing the encoding of a MySQL column will not modify this default encoding. As such: nothing to do.

Proper MySQL character set/collation variables in my.cnf?

I'm trying to switch my site over to UTF-8 completely, so I don't have to deal with utf8_encode() & utf8_decode() functions.
I have the collation of my tables set properly, and I'm temporarily using the query SET NAMES utf8 to override the my.cnf file.
My question is — there are a ton of character set and collation variables in my.cnf, and I suspect that some ought to be left alone... which ones should I change to achieve the effect of SET NAMES utf8?
(The collation of my tables is utf8_unicode_ci.)
character_set_client | latin1 |
character_set_connection | latin1 |
character_set_database | latin1 |
character_set_filesystem | binary |
character_set_results | latin1 |
character_set_server | latin1 |
character_set_system | utf8 |
collation_connection | latin1_swedish_ci |
collation_database | latin1_swedish_ci |
collation_server | latin1_swedish_ci |
Well, collation is primarily for sorting, so unless you're storing a lnaguage with specific sorting needs, utf8_unicode_ci should be fine.
The character_set_* values are used for all other string operations internally - value checks in places like WHERE clauses or IF/CASE statement, string functions like CHAR_LENGTH(), REPLACE(), SUBSTRING() - that sort of stuff.
Generally speaking, they should all be the same (in this case, utf8) except for filesystem - I'd recommend keeping that at binary unless you have a specific need to move away from that.

Migrating data between two MySQL with different character_set%, messed up with utf8

Migrating Data from MySQL server1 to MySQL server2
server1 Ver 14.12 Distrib 5.0.51a, for debian-linux-gnu (x86_64) using readline 5.2
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+------------------------------------------+
| Variable_name | Value |
+--------------------------+------------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /data/mysql/gabino/share/mysql/charsets/ |
+--------------------------+------------------------------------------+
8 rows in set
server2 Ver 14.12 Distrib 5.0.90, for pc-linux-gnu (x86_64) using readline 6.0
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set
Server1 MySQL is the backend of a Wordpress blog, everything works fine from the frontend, until I (the unlucky guy) has to migrate data so I logged into PhpMyAdmin and MySQL console. Now from the backend it seems that every east-Asian character in server1 is messed up, either in SELECT queries in console or mysqldump files. The symptom is, for example the Chinese character 看 turned into three latin1 characters 看, which is the same result SELECT _latin1'看'. The UTF8 presentation of 看 is \xe7\x9c\x8b so MySQL somehow directly displayed each byte as individual latin1 character instead of rendering 3 bytes as a Chinese character.
Even if I use the 'Data Transfer' function in Navicat 8 to copy two database from server1 to server2 identically, the new blog running on server2 get messed up characters. I tried various methods like SET NAMES utf8 etc. and still can not get it done.
So how can I tell/force server1 MySQL to handle the latin1 characters as utf8 and get them displayed and dumped correctly?
Do a hex dump (ie: SELECT HEX(columnname) FROM table) on both servers and see if the data is the same. If it is, then you'll know that at least the data didn't get corrupted.
In this case, you just need to set the correct charset and collation for the server(s). If not, you'll probably have to re-do the data transfer, and this time around make sure the settings are correct.
Another thing is make sure the browser's encoding is set to utf-8.
EDIT: So, data did get corrupted in the transfer. C3A7C593E280B9 is the UTF-8 representation of 看. This is probably because server1 is sending data as latin1, and server2 encodes that into UTF-8.
You have to change the connection settings on server1 before transferring data. To do that, run these queries:
SET CHARACTER SET utf8; SET NAMES utf8
Then try the data transfer again.
EDIT 2: Based on what you said, here's what I think is happening. The data sitting on your database is encoded in UTF-8. When PHP (Wordpress) fetches this data, it "thinks" it's encoded in latin1 (ISO-8859-1), which is (unfortunately) what PHP uses by default. PHP goes on to serve this data to the user's browser as if it was encoded in latin1, but sets the character encoding as UTF-8, and the user sees what he's supposed to see.
In short, it's a case of two wrongs making a right. You now have two options:
Fix the data. (ie: read it as UTF-8 and write it back as latin1)
Set server2's connection settings to the same as server1, which will result in data still being displayed correctly.