Changing character set in MySQL - mysql

When I started my website I set all the table columns to "utf8_general_ci" thinking this would make everything store in UTF8.
According to the mysql_client_encoding() function in PHP, I've been using the latin1 for my connection all along.
I understand this isn't a new problem. My question is how do I correctly update my database so that it utf8 and without affecting the data that exists in my tables?
There are a bunch of answers StackOverflow but a lot I find vague. A couple more helpful ones were:
Query all data and update as UTF8 https://stackoverflow.com/a/2335254/158126
Use a script built to convert tables https://stackoverflow.com/a/13400547/158126
In your experience, what have you done to fix this issue and retain all user data in the MySQL tables?

For your situation, I'd suggest trying to following for each bad column (connected over a utf8 connection):
// create a new column to store the latin1
alter table <table> add <column-latin1> <length> character set latin1;
// copy the utf8 data into the latin1 column without charset conversion
update <table> set <column-latin1> = binary <column-utf8>;
// verify the latin1 data looks OK
select <column-latin1> from <table>;
// copy the latin1 column into the utf8 column WITH charset conversion
update <table> set <column-utf8> = <column-latin1>;
// verify the new, properly encoded UTF8 data looks OK
select <column-latin1> from <table>;
// remove the temporary columns
alter <table> drop <column-latin1>;
And set your clients to use a UTF8 connection.

Related

MySQL: data being mangled while changing column to UTF8

I am migrating a MySQL database that was created in LATIN1 to UTF8. To do that, I am first changing each column to the corresponding binary type and then to UTF8:
ALTER TABLE clientes CHARACTER SET utf8;
ALTER TABLE clientes change nombre nombre varbinary(255);
ALTER TABLE clientes change nombre nombre varchar(255) character set utf8;
Since, according to all docs, that's the right way to prevent data from being mangled...
...However, the data still gets mangled. I'll just put two examples:
The word "Larrasoaña" gets truncated at the "ñ", with the error: Warning: #1366 Incorrect string value: '\xF1a' for column 'nombre'
The word "Jesús y María" gets truncated at the "ú", with the error Warning: #1366 Incorrect string value: '\xFAs y M...' for column 'nombre'
How did that data get in? Well, the DB is the backend to a PHP web app, which used UTF8 for everything (including connecting to the MySQL server with "SET NAMES UTF8")... except creating the database properly. So I assume all the data that was added in was in UTF8.
To summarize: it seems that I had UTF8 text stored in LATIN1 columns, and now that I try to change the columns to UTF8, the text gets truncated.
Why is this happening? What can I do?
EDIT: forgot to mention, I'm doing all of this from PhpMyAdmin, since I don't have command line access.
F1 and FA are latin1 encodings. You need to tell MySQL that the data is latin1. One way is via SET NAMES latin1.
But note... That is independent of the setting for the column you are trying to store the data into. And, these days, utf8mb4 is the preferred setting for text. MySQL will convert between the column's encoding and the client's encoding. But you must tell it the client's encoding via connection parameters (or SET NAMES).
The pair of ALTER TABLEs works for certain situations, not all situations! You probably wanted the first entry in http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Table is CHARACTER SET latin1 and correctly encoded in latin1; want
utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
I don't happen to know if your data is irreparably hosed. Please provide one of the lines, together with HEX.
Hex
"Larrasoaña" is encoded as 4C61727261736F61F161, and "Jesús y María" as 4A6573FA732079204D6172ED6120
Those are latin1-encoded (or latin5 or dec8). If the table definition (SHOW CREATE TABLE) says latin1, then you could leave it alone. (latin1 handles Western European languages, but not Asian.)
If you want to convert all the text columns to utf8 or utf8mb4, do an ALTER like the one I presented above. Your 3-Alter approach will not work correctly; it assumes the bytes in the latin1 column are really UTF-8 bytes (which they aren't).
But... You must specify the client's encoding based on what the client wants. And it does not matter whether the client and the table agree since conversion will be provided.
Why the 3-step Alter fails
ALTER TABLE clientes CHARACTER SET utf8; -- This sets the default charset for new columns. It has no effect on the existing column definitions and any data in those columns.
ALTER TABLE clientes change nombre nombre varbinary(255); -- This says "forget about any text encoding". That is F1 is now just a bunch of bits, not the latin1 representation for ñ.
ALTER TABLE clientes change nombre nombre varchar(255) character set utf8; -- This takes those varbinary bits and says "let's treat them as utf8. And that gives the error message because F1 is not a valid encoding for utf8.
That procedure is appropriate if the bytes are already utf8 bytes. That is, if it were already the 2-byte C3B1 for ñ. (By the way, this usually manifests itself as 'Mojibake', displaying as ñ when interpreted as latin1.)
The 1-Alter procedure...
ALTER TABLE clientes CONVERT TO CHARACTER SET utf8; (to convert the entire table) or ALTER TABLE clientes MODIFY nombre varchar(255) character set utf8; (to convert just one column). They do the following things:
For each text (char/varchar/text) column, it reads the data according to its current encoding (latin1, F1), converts it to utf8 (or utf8mb4) (C3B1) and writes back into the row. Meanwhile, it has changed the declaration to be CHARACTER SET utf8.
That is, it is the 'right' process for changing the CHARACTER SET without changing the "text". True, the encoding changed (F1 -> C3B1), but that is in keeping with the change to the CHARACTER SET.
Recovery
Your first 2 ALTERs worked, correct? Did the 3rd one succeed, fail, or leave a messed up table?
If it aborted, leaving varbinary in place, then do 2 more alters: First go back to latin1; then go straight to utf8.
If it left you with a messed up column, especially if rows are truncated, then you need to go back to a backup, or otherwise reload the data.

"Convert to character set" doesn't convert tables having only integer columns to specified character set

I am working on 2 servers each having similar configurations, Including mysql variables specific to character set and collation and both are on running mysql server and client 5.6.x. By default all tables are in latin1 including tables with only integer columns, But when I run
ALTER TABLE `table_name` CONVERT TO CHARACTER SET `utf8` COLLATE `utf8_unicode_ci`
for all tables in each server only one of the servers is converting all tables to utf8.
What I already tried:
Converted the default database character (character_set_database) set to utf8 before running the above listed command
Solution already worked for me (but still unsure why it worked)
ALTER TABLE `table_name` CHARACTER SET = `utf8` COLLATE `utf8_unicode_ci`
Finally there are 2 questions:
CONVERT TO CHARACTER SET is working in one server and not in other
Solution already worked for me which is similar to CONVERT TO CHARACTER SET with only one difference I have come across is, it doesn't implicitly convert the all the columns to specified character set.
Can someone please help me understand what is happening?
Thank you in advance.
IIRC, that was a bug that eventually was fixed. See bugs.mysql.com . (The bug probably existed since version 4.1, when CHARACTER SETs were really added.)
I prefer to be explicit in two places, thereby avoiding the issue you raise:
When doing CREATE TABLE, I explicitly say what CHARACTER SET I need. This avoids depending on the default established when the database was created, perhaps years ago.
When adding a column (ALTER TABLE ADD COLUMN ...), I check (via SHOW CREATE TABLE) to see if the table already has the desired charset. Even so, I might explicitly state CHARACTER SET for the column. Again, I don't trust the history of the table.
Note: I am performing these queries from explicit SQL, not from some UI that might be "helping" me.
Follow on
#HBK found http://bugs.mysql.com/bug.php?id=73153 . From it, I suspect this is what 'should be' done by the user:
ALTER TABLE ...
CONVERT TO ...
DEFAULT CHARACTER SET ...; -- Do this also

Are there any downside effects when changing mysql table encoding?

I have a table encoded in latin1 and collate latin1_bin.
In my table there is a column comments of type 'TEXT', as you know this column inherits table's encoding and collation, but from now on I should change it to be utf8 and utf8_general_ci because I'm starting to store special characters in comments.
Would it cause any downside effect if I'd use a command like the following?
alter table notebooks modify comments text CHARACTER SET utf8 COLLATE utf8_general_ci;
Thank you for your answer.
Danger I think that that ALTER will destroy existing text.
Also, ... Your 'name' looks Chinese, so I would guess that you want to store Chinese characters? In that case, you should use utf8mb4, not just utf8. This is because some of the Chinese characters take 4 bytes (and are not in the Unicode BMP).
I believe you need 2 steps:
ALTER TABLE notebooks MODIFY comments BLOB;
ALTER TABLE notebooks MODIFY comments TEXT
CHARACTER SET utf8mb4 COLLATE utf8mb4_general_520_ci;
Otherwise the latin1 characters will be "converted" to ut8. But if you really have Chinese in the column, you do not have latin1. The 2-step alter, above, does (1) turn off any knowledge of character set, and (2) establish that the bytes are really utf8mb4-encoded.
To be safer, first do
RENAME TABLE notebooks TO old;
CREATE TABLE notebooks LIKE old;
INSERT INTO notebooks SELECT * FROM old;
Then do the two ALTERs and test the result. If there is trouble, you can RENAME to get back the old copy.
Specifying any collating sequence that does not involve direct integral comparison of a NATIVE character set will slow down your query. Whether it will slow it down noticeably is another issue. Looking up the ranking of this, and the ranking of that, in a table and comparing the two results is much, MUCH faster than retrieving on-disk information from a database, wouldn't you imagine?

Changing character set of MySQL tables

I have a MySQL table CHINESE with DEFAULT CHARSET=latin1 and it has a column NAME with CHARACTER SET latin1. I have huge amount to data stored in this table. Around a million rows. And, I want to execute the following commands on my database:
ALTER DATABASE <DATABASE_NAME> DEFAULT CHARACTER SET utf8
ALTER TABLE CHINESE DEFAULT CHARACTER SET utf8
ALTER TABLE CHINESE MODIFY NAME VARCHAR(30) CHARACTER SET utf8
Considering the fact that I have huge amount of data stored in this database. Should I run these commands on my database? Will these commands lock the database in any way?
I am using Java to query and insert values in database. Will appending ?useUnicode=yes&characterEncoding=UTF-8 in URI string help me?
It will take a long time,
I think is it is the best that you export a .sql out and don't the trans-code there replace latin1 to utf8 , then import back in a temp table. Finally swap the name of they.

phpmyadmin MySQL data stored as ascii values in certain table

my host has just updated phpmyadmin to version 4.0.3. I don't know if it is related to the following problem.
I have a table 'users' which stores user data for the site and all data is now being stored as numbers. Where I had a username of 'rich' it is now '72696368' which is it's ascii code.
Any ideas why this might have happened? I have a lot of tables and have checked them all, it is only the users table that has been modified. It is not critical as I can still log in and accept new users etc but I would like to know why this is happening.
Thanks a lot
EDIT The collation is utf8_general_c
try adding this snippet to the script you are running in the line in which you create the schema
DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci
for example :
CREATE SCHEMA IF NOT EXISTS `my_schema` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;