Why changing the table collation in mysql breaks with unique index violation? - mysql

Problem:
Let's say that table customers (id, name, email, .. ) is encoded using utf-8 (utf8_general_ci collation).
This table also has a unique key constraint on column email.
When trying to change the collation to utf8mb4 using
ALTER TABLE customers CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;,
some entries cause the unique index constraint to flare up, because they are considered duplicates.
Example:
row1: (1, "strauss" "strauss#email.com")
row2: (10, "Strauss" "strauß#email.com")
same happens if two emails differ only by a zero width space char.
Tested with mysql version 5.7.20

In German, ß is considered basically equal to ss.
While the xxx_unicode_ci collation respects that, the xxx_general_ci does not, so you switched from a collation that considers "straus#email.com" not the same as "strauß#email.com" to one that does.
Actually, xxx_general_ci treats ß and s the same, so it would complain about "straus#email.com" and "strauß#email.com" instead.
See the documentation:
_general_ci Versus _unicode_ci Collations
For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, ß is equal to ss in German and some other languages. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

Related

Incorrect string value error for unconventional characters

So I'm using a wrapper to fetch user data from instagram. I want to select the display names for users, and store them in a MYSQL database. I'm having issues inserting some of the display names, dealing with, specifically, an incorrect string value error:
Now, I've dealt with this issue before with accent marks, letters with umlauts, etc. The solution would be to change the collation to utf8_general_ci under the utf8 charset.
So as you can see, some of the display names I'm pulling have very unique characters that I'm not sure mySQL can recognize at all, i.e.:
ᛘ𝕰𝖆𝖗𝖙𝖍 𝕾𝖕𝖎𝖗𝖎𝖙𝖚𝖘𐂂®
So I receive:
Error Code: 1366. Incorrect string value: '\xF0\x9D\x99\x87\xF0\x9D...' for column 'dummy' at row 1
Here's my sql code
CREATE TABLE test_table(
id INT AUTO_INCREMENT,
dummy VARCHAR(255),
PRIMARY KEY(id)
);
INSERT INTO test_table (dummy)
VALUES ('ᛘ𝕰𝖆𝖗𝖙𝖍 𝕾𝖕𝖎𝖗𝖎𝖙𝖚𝖘𐂂®');
Any thoughts on a proper charset + collation pair that can handle characters like this? Not sure where to look for a solution, so I come here to see if anyone dealt with this.
P.S., I've tried utf8mb4 charset with utf8mb4_unicode_ci and utf8mb4_bin collations as well.
The characters you show require the column use the utf8mb4 encoding. Currently it seems your column is defined with the utf8mb3 encoding.
The way MySQL uses the name "utf8" is complicated, as described in https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html:
Note
Historically, MySQL has used utf8 as an alias for utf8mb3;
beginning with MySQL 8.0.28, utf8mb3 is used exclusively in the output
of SHOW statements and in Information Schema tables when this
character set is meant.
At some point in the future utf8 is expected to become a reference to
utf8mb4. To avoid ambiguity about the meaning of utf8, consider
specifying utf8mb4 explicitly for character set references instead of
utf8.
You should also be aware that the utf8mb3 character set is deprecated
and you should expect it to be removed in a future MySQL release.
Please use utf8mb4 instead.
You may have tried to change your table in the following way:
ALTER TABLE test_table CHARSET=utf8mb4;
But that only changes the default character set, to be used if you add new columns to the table subsequently. It does not change any of the current columns. To do that:
ALTER TABLE test_table MODIFY COLUMN dummy VARCHAR(255) CHARACTER SET utf8mb4;
Or to convert all string or TEXT columns in a table in one statement:
ALTER TABLE test_table CONVERT TO CHARACTER SET utf8mb4;
That would be 𝙇 - L MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL L
It requires the utf8mb4 Character set to even represent it. "F0" is the clue; it is the first of 4 bytes in a 4-byte UTF-8 character. It cannot be represented in MySQL's "utf8". Collation is (mostly) irrelevant.
Most, not all, of the characters in ᛘ𝕰𝖆𝖗𝖙𝖍 𝕾𝖕𝖎𝖗𝖎𝖙𝖚𝖘𐂂® also need utf8mb4. They are "MATHEMATICAL BOLD FRAKTUR" letters.
(Meanwhile, Bill gives you more of an answer.)

Utf8 and latin1

Half of the tables in a database which has alot of data are set to Latin 1.
Within these Latin 1 tables most rows are set to utf8 if the row is expecting text input (anything not a integer).
Everything is in English .
How bad is my situation if I need to convert these Latin 1 tables to utf8?
First, a clarification: You said "most rows are set to utf8"; I assume you meant "most columns are set to utf8"?
The meaning of the latin1 on the table is just a default. It has no impact on performance, etc.
The only "harm" occurs if you do ALTER TABLE .. ADD COLUMN .. without specifying CHARACTER SET utf8.
You say that all the text is English? Then there is no encoding difference between latin1 and utf8. Problems can occur when you have accented letters, etc.
There is one performance issue: If you JOIN two tables together on a VARCHAR column, but the CHARACTER SET or COLLATION differs for that column in the two tables, it will be slower than if those settings are the same. (It does not sound like you have this problem.) Again, note that the table's default is not relevant, only the column settings themselves.
Yes, it would be 'cleaner' to have the table default set to utf8. This should be a way to do it without dumping, but by changing the tables one at a time:
ALTER TABLE t CONVERT TO CHARACTER SET utf8;
That will change the table default and any columns that are not already utf8.
mysqldump --add-drop-table database_to_correct | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | mysql database_to_correct

mysql character set utf 8 collation (dup key) not working properly

My database tables character-set and collation set to utf-8 and utf8_general_ci respectively. I inserted a record with value 'säî kîråñ' in varchar column. I have a unique column constraint on that. When I try to insert 'sai kiran' its giving duplicate entry referring to old inserted row of 'säî kîråñ'.
As you can see the characters in both the strings are completely different in utf8 character-set, couldn't understand why it is showing error as 'duplicate entry'.
I tried changing collation to utf8_unicode_ci but no use. I tried directly inserting in phpmyadmin to avoid prog lang encoding problems, still the problem persists.
In case of utf8_general_ci I thought only 'a' and 'A' would be treated same, but from here I found that even 'a' and 'ä' would be treated same. So utf8_bin is the ultimate solution to treat them different. Good learning though.
Because of the lack of a utf8_..._ci_as, collation, you should probably change the collation of (at least) that column to be utf8_bin. Then, instead of all of these being equal S=s=Ş=ş=Š=Š=š=š, they will all be different. Note, especially. that s and S will be distinct in utf8_bin.

MYSQL COLLATE Performance

I have a table called users with a column firstname and with collation utf8_bin
I want to know what happens under the hood when I execute an query like
SELECT * FROM `users` `a` WHERE `a`.`firstname` = 'sander' COLLATE utf8_general_ci
the column firstname isn't an index, what happens with the performance when the command executed?
And what if the default collation was utf8_general_ci and the query is executed without COLLATE
I want to know the impact it has on a big table (8 million+ records)
In this case, since the forced collation is defined over the same character set as the column's encoding, there won't be any performance impact.
However, if one forced a collation that is defined over a different character set, MySQL may have to transcode the column's values (which would have a performance impact); I think MySQL will do this automatically if the forced collation is over the Unicode character set, and any other situation will raise an "illegal mix of collations" error.
Note that the collation recorded against a column's definition is merely a hint to MySQL over which collation is preferred; it may or may not be used in a given expression, depending on the rules detailed under Collation of Expressions.

mysql utf8 encoding and unique keys

I am inporting data fom a user table (many users from many sites):
myisam default collation latin1_swedish....
Importing that data into a innodb table utf8_general
I have placed a unique key on the username,site_id combination but this is failing on 2 users of the same site:
user 1 dranfog,
user 2 drånfog
If I run:
SELECT IF('å' = 'a', 'yep', 'nope');
directly on the target db with utf8 encoding, I get 'yep'.
Any tips on resolving this most welcome. I was of impression utf8 would treat these as different charcters but that seem to not be the case.
Collations that end with _ci are case (and accent) insensitive.
You could change the collation to 'utf8_binary' to treat dranfog differently than drånfog.