I'm about to change the encoding for a database from latin1 to utf8mb4.
Due to privacy restrictions, I don't know what the database to be converted contains. I'm worried that by running below SQL, existing data may be changed.
ALTER TABLE table CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
However, the connection string from the grails application contains useUnicode=true&characterEncoding=UTF-8, does this mean that even though latin1_swedish_ci is used for a column, the actual value that has been saved is UTF-8 encoded?
And since this value is UTF-8 encoded, there is no risk that the data will be affected by the change from latin1 to utf8mb4?
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+```
That's Ώπα? That's the interpretation in UTF-8 (as the outside world calls it), utf8mb4 (MySQL's equivalent) or utf8 (MySQL's partial implementation of UTF-8).
It would not work well in latin1.
The encoding in the client and the encoding of a column in the database need not be the same. However, Greek in the client cannot be crammed into latin1 in the table, hence the error message.
What ALTER TABLE table CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci; does is to change all the text columns in that table to be utf8-encoded and convert from whatever encoding is currently used (presumably latin1). This is fine for Western European characters, all of which exist (with different encodings) in both latin1 and utf8.
To handle Emoji and some of Chinese, you may as well go for utf8mb4:
ALTER TABLE table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8_unicode_520_ci;
Related
I'm working with two mysql servers, trying to understand why they behave differently.
I've created identical tables on each:
| Field | Type | Collation |
+----------------+------------+-------------------+
| some_chars | char(45) | latin1_swedish_ci |
| some_text | text | latin1_swedish_ci |
and I've set identical character set variables:
| Variable_name | Value
+--------------------------+-------+
| character_set_client | utf8
| character_set_connection | utf8
| character_set_database | latin1
| character_set_filesystem | binary
| character_set_results | utf8
| character_set_server | latin1
| character_set_system | utf8
When I insert UTF-8 characters into the database on one server, I get an error:
DatabaseError: 1366 (HY000): Incorrect string value: '\xE7\xBE\x8E\xE5\x9B\xBD...'
The same insertion in the other server throws no error. The table just silently accepts the utf-8 insertion and renders a bunch of ? marks where the utf-8 characters should be.
Why is the behavior of the two servers different?
What command were you executing when you got the error?
Your data is obviously utf8 (good).
Your connection apparently is utf8 (good).
Your table/column is declared CHARACTER SET latin1? It should be utf8.
That is 美 - Chinese, correct? Some Chinese characters need 4-byte utf8. So you should use utf8mb4 instead of utf8 in all 3 cases listed above.
Other notes:
There is no substantive difference in this area in 5.6 versus 5.7.
##SQL_MODE is not relevant.
VARCHAR is usually advisable over CHAR.
I have a collection of ruby scripts that access my MySQL database. I need to modify the character set for this database, specifically change the tables from Latin1 to UTF8. Do I need to modify my scripts at all? I've looked and I see I can set the character set for a connection, is this mandatory?
Part of my hesitation in thinking I do not need to make any adjustments are the setting the database has today. Looking at how character sets are already set up:
mysql> SHOW VARIABLES LIKE "%char%";
+--------------------------+-------------------------------------------+
| Variable_name | Value |
+--------------------------+-------------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /rdsdbbin/mysql-5.6.21.R1/share/charsets/ |
+--------------------------+-------------------------------------------+
wouldn't this suggest that clients are already set up to utilize a UTF8 character set?
MySQL manages encodings per client and per column. This means, every column in which text is stored has an encoding setting, and every connected client individually also has an encoding setting. Text encodings will be converted on the fly between these two as necessary. If a client sends UTF-8 data to be stored in an SJIS column, MySQL will make that conversion automatically (and the other way around on the way back out).
As such, it only really matters what encoding the client specifies when connecting to the database. If you're not specifying this explicitly in your Ruby code, you will get an implicit default setting. Changing the encoding of a MySQL column will not modify this default encoding. As such: nothing to do.
I am setting up a new MySQL database, and going through the default settings that has been provided to me, I was surprised to find that it uses a mix of character sets/collations.
SHOW VARIABLES LIKE '%character_set%';
SHOW VARIABLES LIKE '%collation%';
produces:
________________________________________________
| | |
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | latin1_swedish_ci |
| collation_database | utf8_general_ci |
| collation_server | latin1_swedish_ci |
|______________________________________________|
Is it risky to have a combination of latin1 and utf8 in these settings? I have always felt that potential problems will be avoided by using the same (preferably UTF-8/Unicode) everywhere.
Yes, you can get into trouble mixing character sets, and it's a big pain to clean it up. The risk is that you'll declare a table with latin1 encoding, and some app will try to store utf8 codes in it. The result is a mess.
Read http://www.mysqlperformanceblog.com/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/ for the proper way to get your data back under control if this happens. If it sounds kind of complicated and risky to you, then you understand it! :-/
But it's far better to avoid the mess in the first place. Just use utf8 everywhere from day one. If you need to support the full range of asian languages, then use utf8mb4.
Exception: if you have strings that you know will never need to store international symbols, for example a string of hexadecimal digits, then you can declare the character set ascii for individual columns in a MySQL table.
But use utf8 as the default at the database and table level, and also for the MySQL connection and on up into the application, Apache, HTTP, etc.
I'm trying to switch my site over to UTF-8 completely, so I don't have to deal with utf8_encode() & utf8_decode() functions.
I have the collation of my tables set properly, and I'm temporarily using the query SET NAMES utf8 to override the my.cnf file.
My question is — there are a ton of character set and collation variables in my.cnf, and I suspect that some ought to be left alone... which ones should I change to achieve the effect of SET NAMES utf8?
(The collation of my tables is utf8_unicode_ci.)
character_set_client | latin1 |
character_set_connection | latin1 |
character_set_database | latin1 |
character_set_filesystem | binary |
character_set_results | latin1 |
character_set_server | latin1 |
character_set_system | utf8 |
collation_connection | latin1_swedish_ci |
collation_database | latin1_swedish_ci |
collation_server | latin1_swedish_ci |
Well, collation is primarily for sorting, so unless you're storing a lnaguage with specific sorting needs, utf8_unicode_ci should be fine.
The character_set_* values are used for all other string operations internally - value checks in places like WHERE clauses or IF/CASE statement, string functions like CHAR_LENGTH(), REPLACE(), SUBSTRING() - that sort of stuff.
Generally speaking, they should all be the same (in this case, utf8) except for filesystem - I'd recommend keeping that at binary unless you have a specific need to move away from that.
my setup:
mysql 5.1
show variables:
| character_set_client | utf8
| character_set_connection | utf8
| character_set_database | utf8
| character_set_filesystem | binary
| character_set_results | utf8
| character_set_server | utf8
| character_set_system | utf8
| character_sets_dir | D:\Programme\MySQL\MySQL Server 5.1\share
charsets\
| collation_connection | utf8_general_ci
| collation_database | utf8_unicode_ci
| collation_server | utf8_general_ci
and even
| init_connect | SET collation_connection = utf8_general_ci; SET NAMES utf8;
the table table has character set utf8
tomcat 6.0
the jdbc connector uses characterEncoding="utf8" useUnicode="true"
now when i try
stmt.execute("UPDATE *table* SET *value*=\"ÿ\" WHERE ...)
it works but for
stmt.execute("UPDATE *table* SET *value*=\"Ā\" WHERE ...)
i get an
java.sql.SQLException: Incorrect string value: '\xC4\x80' for column
'value' at row 1
furthermore it works for all characters below ÿ, which can be encoded with 1 byte but as soon as 2 bytes are needed: bang!
why is that so? and how can i get it to work?
after i added another two tables to check if it's an MyISAM vs. InnoDB problem it just worked on the new tables and why?
in the new tables each column used the default charset while in my existing tables the charsets of each column were set to latin1. this was because i copied the db from a non-utf8 mysql instance and manually changed the table charset to utf-8. BUT while copying, HeidiSQL added a "CHARACTER SET latin1" to each column which wasn't changed when changing the charset AND it is not very easily visible in HeidiSQL that a column has an individual charset ...