Utf8 and latin1 - mysql

Half of the tables in a database which has alot of data are set to Latin 1.
Within these Latin 1 tables most rows are set to utf8 if the row is expecting text input (anything not a integer).
Everything is in English .
How bad is my situation if I need to convert these Latin 1 tables to utf8?

First, a clarification: You said "most rows are set to utf8"; I assume you meant "most columns are set to utf8"?
The meaning of the latin1 on the table is just a default. It has no impact on performance, etc.
The only "harm" occurs if you do ALTER TABLE .. ADD COLUMN .. without specifying CHARACTER SET utf8.
You say that all the text is English? Then there is no encoding difference between latin1 and utf8. Problems can occur when you have accented letters, etc.
There is one performance issue: If you JOIN two tables together on a VARCHAR column, but the CHARACTER SET or COLLATION differs for that column in the two tables, it will be slower than if those settings are the same. (It does not sound like you have this problem.) Again, note that the table's default is not relevant, only the column settings themselves.
Yes, it would be 'cleaner' to have the table default set to utf8. This should be a way to do it without dumping, but by changing the tables one at a time:
ALTER TABLE t CONVERT TO CHARACTER SET utf8;
That will change the table default and any columns that are not already utf8.

mysqldump --add-drop-table database_to_correct | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | mysql database_to_correct

Related

Incorrect string value error for unconventional characters

So I'm using a wrapper to fetch user data from instagram. I want to select the display names for users, and store them in a MYSQL database. I'm having issues inserting some of the display names, dealing with, specifically, an incorrect string value error:
Now, I've dealt with this issue before with accent marks, letters with umlauts, etc. The solution would be to change the collation to utf8_general_ci under the utf8 charset.
So as you can see, some of the display names I'm pulling have very unique characters that I'm not sure mySQL can recognize at all, i.e.:
แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ
So I receive:
Error Code: 1366. Incorrect string value: '\xF0\x9D\x99\x87\xF0\x9D...' for column 'dummy' at row 1
Here's my sql code
CREATE TABLE test_table(
id INT AUTO_INCREMENT,
dummy VARCHAR(255),
PRIMARY KEY(id)
);
INSERT INTO test_table (dummy)
VALUES ('แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ');
Any thoughts on a proper charset + collation pair that can handle characters like this? Not sure where to look for a solution, so I come here to see if anyone dealt with this.
P.S., I've tried utf8mb4 charset with utf8mb4_unicode_ci and utf8mb4_bin collations as well.
The characters you show require the column use the utf8mb4 encoding. Currently it seems your column is defined with the utf8mb3 encoding.
The way MySQL uses the name "utf8" is complicated, as described in https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html:
Note
Historically, MySQL has used utf8 as an alias for utf8mb3;
beginning with MySQL 8.0.28, utf8mb3 is used exclusively in the output
of SHOW statements and in Information Schema tables when this
character set is meant.
At some point in the future utf8 is expected to become a reference to
utf8mb4. To avoid ambiguity about the meaning of utf8, consider
specifying utf8mb4 explicitly for character set references instead of
utf8.
You should also be aware that the utf8mb3 character set is deprecated
and you should expect it to be removed in a future MySQL release.
Please use utf8mb4 instead.
You may have tried to change your table in the following way:
ALTER TABLE test_table CHARSET=utf8mb4;
But that only changes the default character set, to be used if you add new columns to the table subsequently. It does not change any of the current columns. To do that:
ALTER TABLE test_table MODIFY COLUMN dummy VARCHAR(255) CHARACTER SET utf8mb4;
Or to convert all string or TEXT columns in a table in one statement:
ALTER TABLE test_table CONVERT TO CHARACTER SET utf8mb4;
That would be ๐™‡ - L MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL L
It requires the utf8mb4 Character set to even represent it. "F0" is the clue; it is the first of 4 bytes in a 4-byte UTF-8 character. It cannot be represented in MySQL's "utf8". Collation is (mostly) irrelevant.
Most, not all, of the characters in แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ also need utf8mb4. They are "MATHEMATICAL BOLD FRAKTUR" letters.
(Meanwhile, Bill gives you more of an answer.)

Database conversion from latin1 to utf8mb4, what about indexes?

I noticed that my MODX database still uses latin1 character set in the database and in its tables. I would like to convert them to utf8mb4 and update collations accordingly.
Not totally sure how I should do this. Is this correct?
I alter every table to use utf8mb4 and utf8_unicode_ci?
I update the default character set and collation of the database.
Are indexes updated automatically? Is there something else I should be aware of?
A bonus question: what would be the most suitable latest utf8_unicode collation? Western languages should work.
Changing the default character sets of a table or a schema does not change the data in the column itself, it only changes the default to apply the next time you add a table or add a column to a table.
To convert current data, alter one table at a time:
ALTER TABLE <name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
The collation utf8mb4_0900_ai_ci is faster than earlier collations (at least according to the documentation), and it's the most current and accurate. This collation requires MySQL 8.0.
The most current collation in MySQL 5.7 is utf8_unicode_520_ci.
A table-conversion like this rebuilds the indexes, so there's nothing else you need to do.

MySQL: data being mangled while changing column to UTF8

I am migrating a MySQL database that was created in LATIN1 to UTF8. To do that, I am first changing each column to the corresponding binary type and then to UTF8:
ALTER TABLE clientes CHARACTER SET utf8;
ALTER TABLE clientes change nombre nombre varbinary(255);
ALTER TABLE clientes change nombre nombre varchar(255) character set utf8;
Since, according to all docs, that's the right way to prevent data from being mangled...
...However, the data still gets mangled. I'll just put two examples:
The word "Larrasoaรฑa" gets truncated at the "รฑ", with the error: Warning: #1366 Incorrect string value: '\xF1a' for column 'nombre'
The word "Jesรบs y Marรญa" gets truncated at the "รบ", with the error Warning: #1366 Incorrect string value: '\xFAs y M...' for column 'nombre'
How did that data get in? Well, the DB is the backend to a PHP web app, which used UTF8 for everything (including connecting to the MySQL server with "SET NAMES UTF8")... except creating the database properly. So I assume all the data that was added in was in UTF8.
To summarize: it seems that I had UTF8 text stored in LATIN1 columns, and now that I try to change the columns to UTF8, the text gets truncated.
Why is this happening? What can I do?
EDIT: forgot to mention, I'm doing all of this from PhpMyAdmin, since I don't have command line access.
F1 and FA are latin1 encodings. You need to tell MySQL that the data is latin1. One way is via SET NAMES latin1.
But note... That is independent of the setting for the column you are trying to store the data into. And, these days, utf8mb4 is the preferred setting for text. MySQL will convert between the column's encoding and the client's encoding. But you must tell it the client's encoding via connection parameters (or SET NAMES).
The pair of ALTER TABLEs works for certain situations, not all situations! You probably wanted the first entry in http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Table is CHARACTER SET latin1 and correctly encoded in latin1; want
utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
I don't happen to know if your data is irreparably hosed. Please provide one of the lines, together with HEX.
Hex
"Larrasoaรฑa" is encoded as 4C61727261736F61F161, and "Jesรบs y Marรญa" as 4A6573FA732079204D6172ED6120
Those are latin1-encoded (or latin5 or dec8). If the table definition (SHOW CREATE TABLE) says latin1, then you could leave it alone. (latin1 handles Western European languages, but not Asian.)
If you want to convert all the text columns to utf8 or utf8mb4, do an ALTER like the one I presented above. Your 3-Alter approach will not work correctly; it assumes the bytes in the latin1 column are really UTF-8 bytes (which they aren't).
But... You must specify the client's encoding based on what the client wants. And it does not matter whether the client and the table agree since conversion will be provided.
Why the 3-step Alter fails
ALTER TABLE clientes CHARACTER SET utf8; -- This sets the default charset for new columns. It has no effect on the existing column definitions and any data in those columns.
ALTER TABLE clientes change nombre nombre varbinary(255); -- This says "forget about any text encoding". That is F1 is now just a bunch of bits, not the latin1 representation for รฑ.
ALTER TABLE clientes change nombre nombre varchar(255) character set utf8; -- This takes those varbinary bits and says "let's treat them as utf8. And that gives the error message because F1 is not a valid encoding for utf8.
That procedure is appropriate if the bytes are already utf8 bytes. That is, if it were already the 2-byte C3B1 for รฑ. (By the way, this usually manifests itself as 'Mojibake', displaying as รƒยฑ when interpreted as latin1.)
The 1-Alter procedure...
ALTER TABLE clientes CONVERT TO CHARACTER SET utf8; (to convert the entire table) or ALTER TABLE clientes MODIFY nombre varchar(255) character set utf8; (to convert just one column). They do the following things:
For each text (char/varchar/text) column, it reads the data according to its current encoding (latin1, F1), converts it to utf8 (or utf8mb4) (C3B1) and writes back into the row. Meanwhile, it has changed the declaration to be CHARACTER SET utf8.
That is, it is the 'right' process for changing the CHARACTER SET without changing the "text". True, the encoding changed (F1 -> C3B1), but that is in keeping with the change to the CHARACTER SET.
Recovery
Your first 2 ALTERs worked, correct? Did the 3rd one succeed, fail, or leave a messed up table?
If it aborted, leaving varbinary in place, then do 2 more alters: First go back to latin1; then go straight to utf8.
If it left you with a messed up column, especially if rows are truncated, then you need to go back to a backup, or otherwise reload the data.

Are there any downside effects when changing mysql table encoding?

I have a table encoded in latin1 and collate latin1_bin.
In my table there is a column comments of type 'TEXT', as you know this column inherits table's encoding and collation, but from now on I should change it to be utf8 and utf8_general_ci because I'mย starting to store special characters in comments.
Would it cause any downside effect if I'd use a command like the following?
alter table notebooks modify comments text CHARACTER SET utf8 COLLATE utf8_general_ci;
Thank you for your answer.
Danger I think that that ALTER will destroy existing text.
Also, ... Your 'name' looks Chinese, so I would guess that you want to store Chinese characters? In that case, you should use utf8mb4, not just utf8. This is because some of the Chinese characters take 4 bytes (and are not in the Unicode BMP).
I believe you need 2 steps:
ALTER TABLE notebooks MODIFY comments BLOB;
ALTER TABLE notebooks MODIFY comments TEXT
CHARACTER SET utf8mb4 COLLATE utf8mb4_general_520_ci;
Otherwise the latin1 characters will be "converted" to ut8. But if you really have Chinese in the column, you do not have latin1. The 2-step alter, above, does (1) turn off any knowledge of character set, and (2) establish that the bytes are really utf8mb4-encoded.
To be safer, first do
RENAME TABLE notebooks TO old;
CREATE TABLE notebooks LIKE old;
INSERT INTO notebooks SELECT * FROM old;
Then do the two ALTERs and test the result. If there is trouble, you can RENAME to get back the old copy.
Specifying any collating sequence that does not involve direct integral comparison of a NATIVE character set will slow down your query. Whether it will slow it down noticeably is another issue. Looking up the ranking of this, and the ranking of that, in a table and comparing the two results is much, MUCH faster than retrieving on-disk information from a database, wouldn't you imagine?

How to change the default collation of a table?

create table check2(f1 varchar(20),f2 varchar(20));
creates a table with the default collation latin1_general_ci;
alter table check2 collate latin1_general_cs;
show full columns from check2;
shows the individual collation of the columns as 'latin1_general_ci'.
Then what is the effect of the alter table command?
To change the default character set and collation of a table including those of existing columns (note the convert to clause):
alter table <some_table> convert to character set utf8mb4 collate utf8mb4_unicode_ci;
Edited the answer, thanks to the prompting of some comments:
Should avoid recommending utf8. It's almost never what you want, and often leads to unexpected messes. The utf8 character set is not fully compatible with UTF-8. The utf8mb4 character set is what you want if you want UTF-8. โ€“ Rich Remer Mar 28 '18 at 23:41
and
That seems quite important, glad I read the comments and thanks #RichRemer . Nikki , I think you should edit that in your answer considering how many views this gets. See here https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html and here What is the difference between utf8mb4 and utf8 charsets in MySQL? โ€“ Paulpro Mar 12 at 17:46
MySQL has 4 levels of collation: server, database, table, column.
If you change the collation of the server, database or table, you don't change the setting for each column, but you change the default collations.
E.g if you change the default collation of a database, each new table you create in that database will use that collation, and if you change the default collation of a table, each column you create in that table will get that collation.
It sets the default collation for the table; if you create a new column, that should be collated with latin_general_ci -- I think. Try specifying the collation for the individual column and see if that works. MySQL has some really bizarre behavior in regards to the way it handles this.
may need to change the SCHEMA not only table
ALTER SCHEMA `<database name>` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci ;
as Rich said - utf8mb4
(mariaDB 10)