mysql utf8 encoding and unique keys - mysql

I am inporting data fom a user table (many users from many sites):
myisam default collation latin1_swedish....
Importing that data into a innodb table utf8_general
I have placed a unique key on the username,site_id combination but this is failing on 2 users of the same site:
user 1 dranfog,
user 2 drånfog
If I run:
SELECT IF('å' = 'a', 'yep', 'nope');
directly on the target db with utf8 encoding, I get 'yep'.
Any tips on resolving this most welcome. I was of impression utf8 would treat these as different charcters but that seem to not be the case.

Collations that end with _ci are case (and accent) insensitive.
You could change the collation to 'utf8_binary' to treat dranfog differently than drånfog.

Related

Why changing the table collation in mysql breaks with unique index violation?

Problem:
Let's say that table customers (id, name, email, .. ) is encoded using utf-8 (utf8_general_ci collation).
This table also has a unique key constraint on column email.
When trying to change the collation to utf8mb4 using
ALTER TABLE customers CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;,
some entries cause the unique index constraint to flare up, because they are considered duplicates.
Example:
row1: (1, "strauss" "strauss#email.com")
row2: (10, "Strauss" "strauß#email.com")
same happens if two emails differ only by a zero width space char.
Tested with mysql version 5.7.20
In German, ß is considered basically equal to ss.
While the xxx_unicode_ci collation respects that, the xxx_general_ci does not, so you switched from a collation that considers "straus#email.com" not the same as "strauß#email.com" to one that does.
Actually, xxx_general_ci treats ß and s the same, so it would complain about "straus#email.com" and "strauß#email.com" instead.
See the documentation:
_general_ci Versus _unicode_ci Collations
For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, ß is equal to ss in German and some other languages. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

Utf8 and latin1

Half of the tables in a database which has alot of data are set to Latin 1.
Within these Latin 1 tables most rows are set to utf8 if the row is expecting text input (anything not a integer).
Everything is in English .
How bad is my situation if I need to convert these Latin 1 tables to utf8?
First, a clarification: You said "most rows are set to utf8"; I assume you meant "most columns are set to utf8"?
The meaning of the latin1 on the table is just a default. It has no impact on performance, etc.
The only "harm" occurs if you do ALTER TABLE .. ADD COLUMN .. without specifying CHARACTER SET utf8.
You say that all the text is English? Then there is no encoding difference between latin1 and utf8. Problems can occur when you have accented letters, etc.
There is one performance issue: If you JOIN two tables together on a VARCHAR column, but the CHARACTER SET or COLLATION differs for that column in the two tables, it will be slower than if those settings are the same. (It does not sound like you have this problem.) Again, note that the table's default is not relevant, only the column settings themselves.
Yes, it would be 'cleaner' to have the table default set to utf8. This should be a way to do it without dumping, but by changing the tables one at a time:
ALTER TABLE t CONVERT TO CHARACTER SET utf8;
That will change the table default and any columns that are not already utf8.
mysqldump --add-drop-table database_to_correct | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | mysql database_to_correct

"Convert to character set" doesn't convert tables having only integer columns to specified character set

I am working on 2 servers each having similar configurations, Including mysql variables specific to character set and collation and both are on running mysql server and client 5.6.x. By default all tables are in latin1 including tables with only integer columns, But when I run
ALTER TABLE `table_name` CONVERT TO CHARACTER SET `utf8` COLLATE `utf8_unicode_ci`
for all tables in each server only one of the servers is converting all tables to utf8.
What I already tried:
Converted the default database character (character_set_database) set to utf8 before running the above listed command
Solution already worked for me (but still unsure why it worked)
ALTER TABLE `table_name` CHARACTER SET = `utf8` COLLATE `utf8_unicode_ci`
Finally there are 2 questions:
CONVERT TO CHARACTER SET is working in one server and not in other
Solution already worked for me which is similar to CONVERT TO CHARACTER SET with only one difference I have come across is, it doesn't implicitly convert the all the columns to specified character set.
Can someone please help me understand what is happening?
Thank you in advance.
IIRC, that was a bug that eventually was fixed. See bugs.mysql.com . (The bug probably existed since version 4.1, when CHARACTER SETs were really added.)
I prefer to be explicit in two places, thereby avoiding the issue you raise:
When doing CREATE TABLE, I explicitly say what CHARACTER SET I need. This avoids depending on the default established when the database was created, perhaps years ago.
When adding a column (ALTER TABLE ADD COLUMN ...), I check (via SHOW CREATE TABLE) to see if the table already has the desired charset. Even so, I might explicitly state CHARACTER SET for the column. Again, I don't trust the history of the table.
Note: I am performing these queries from explicit SQL, not from some UI that might be "helping" me.
Follow on
#HBK found http://bugs.mysql.com/bug.php?id=73153 . From it, I suspect this is what 'should be' done by the user:
ALTER TABLE ...
CONVERT TO ...
DEFAULT CHARACTER SET ...; -- Do this also

mysql character set utf 8 collation (dup key) not working properly

My database tables character-set and collation set to utf-8 and utf8_general_ci respectively. I inserted a record with value 'säî kîråñ' in varchar column. I have a unique column constraint on that. When I try to insert 'sai kiran' its giving duplicate entry referring to old inserted row of 'säî kîråñ'.
As you can see the characters in both the strings are completely different in utf8 character-set, couldn't understand why it is showing error as 'duplicate entry'.
I tried changing collation to utf8_unicode_ci but no use. I tried directly inserting in phpmyadmin to avoid prog lang encoding problems, still the problem persists.
In case of utf8_general_ci I thought only 'a' and 'A' would be treated same, but from here I found that even 'a' and 'ä' would be treated same. So utf8_bin is the ultimate solution to treat them different. Good learning though.
Because of the lack of a utf8_..._ci_as, collation, you should probably change the collation of (at least) that column to be utf8_bin. Then, instead of all of these being equal S=s=Ş=ş=Š=Š=š=š, they will all be different. Note, especially. that s and S will be distinct in utf8_bin.

Does a utf8_unicode_cs collation exist?

Does anyone know if a utf8_unicode_cs collation for MySQL exists? So far, my searches have come up dry. If it simply doesn't exist yet, is it fairly straight-forward to create one? Or somehow use utf8_unicode_ci or utf8_bin but "simulate" what one would expect from a utf8_unicode_cs collation?
I came across the same issue and after some Googling, it seems that MySQL doesn't include it. To "simulate it", as you put it,
1) To ensure case-sensitivity in the DB: set the table column to utf8_bin collation
This allows:
strict SELECTs: SELECT "Joe" will NOT return rows with "joe" / "joE" / "jOe" / etc
strict UNIQUE index: a column with a UNIQUE index will treat case differences as different values. For example, if a utf8_unicode_ci collation is used, inserting "Joe" on a table that already has "joe" will trigger a "Duplicate key" error. If ut8_bin is used, inserting "Joe" will work fine.
2) To get the proper ordering in results: add the collation to the SQL query:
SELECT ... ORDER BY column COLLATE utf8_unicode_ci
This is an old question but does not seem to be superseded by any other, so I thought it worth posting that things have changed.
MySQL version 8 now has the following collations for utf8mb4:
utf8mb4_0900_ai_ci
utf8mb4_0900_as_ci
utf8mb4_0900_as_cs
... and many language-specific variants of same.
(no _ai_cs as far as I know, but that would in any case be less useful: few reasons to group [a] and [a-acute] and then separately group [A] and [A-acute]).
The purpose of the original question's hypothetical "utf8_unicode_cs" is fulfilled by utf8mb4_0900_as_cs. (The 0900 means it uses Unicode v 9.0.0 as opposed to 4.0.0 used by utf8_unicode_ci.)
To use these you'd need to change the field from utf8 to utf8mb4 character set - but that's a generally good idea anyway because the old 3-byte-max encoding can't handle e.g. emoji and other non-BMP characters.
Source: https://dev.mysql.com/doc/refman/8.0/en/charset-mysql.html