mysql character set utf 8 collation (dup key) not working properly - mysql

My database tables character-set and collation set to utf-8 and utf8_general_ci respectively. I inserted a record with value 'säî kîråñ' in varchar column. I have a unique column constraint on that. When I try to insert 'sai kiran' its giving duplicate entry referring to old inserted row of 'säî kîråñ'.
As you can see the characters in both the strings are completely different in utf8 character-set, couldn't understand why it is showing error as 'duplicate entry'.
I tried changing collation to utf8_unicode_ci but no use. I tried directly inserting in phpmyadmin to avoid prog lang encoding problems, still the problem persists.

In case of utf8_general_ci I thought only 'a' and 'A' would be treated same, but from here I found that even 'a' and 'ä' would be treated same. So utf8_bin is the ultimate solution to treat them different. Good learning though.

Because of the lack of a utf8_..._ci_as, collation, you should probably change the collation of (at least) that column to be utf8_bin. Then, instead of all of these being equal S=s=Ş=ş=Š=Š=š=š, they will all be different. Note, especially. that s and S will be distinct in utf8_bin.

Related

Incorrect string value error for unconventional characters

So I'm using a wrapper to fetch user data from instagram. I want to select the display names for users, and store them in a MYSQL database. I'm having issues inserting some of the display names, dealing with, specifically, an incorrect string value error:
Now, I've dealt with this issue before with accent marks, letters with umlauts, etc. The solution would be to change the collation to utf8_general_ci under the utf8 charset.
So as you can see, some of the display names I'm pulling have very unique characters that I'm not sure mySQL can recognize at all, i.e.:
ᛘ𝕰𝖆𝖗𝖙𝖍 𝕾𝖕𝖎𝖗𝖎𝖙𝖚𝖘𐂂®
So I receive:
Error Code: 1366. Incorrect string value: '\xF0\x9D\x99\x87\xF0\x9D...' for column 'dummy' at row 1
Here's my sql code
CREATE TABLE test_table(
id INT AUTO_INCREMENT,
dummy VARCHAR(255),
PRIMARY KEY(id)
);
INSERT INTO test_table (dummy)
VALUES ('ᛘ𝕰𝖆𝖗𝖙𝖍 𝕾𝖕𝖎𝖗𝖎𝖙𝖚𝖘𐂂®');
Any thoughts on a proper charset + collation pair that can handle characters like this? Not sure where to look for a solution, so I come here to see if anyone dealt with this.
P.S., I've tried utf8mb4 charset with utf8mb4_unicode_ci and utf8mb4_bin collations as well.
The characters you show require the column use the utf8mb4 encoding. Currently it seems your column is defined with the utf8mb3 encoding.
The way MySQL uses the name "utf8" is complicated, as described in https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html:
Note
Historically, MySQL has used utf8 as an alias for utf8mb3;
beginning with MySQL 8.0.28, utf8mb3 is used exclusively in the output
of SHOW statements and in Information Schema tables when this
character set is meant.
At some point in the future utf8 is expected to become a reference to
utf8mb4. To avoid ambiguity about the meaning of utf8, consider
specifying utf8mb4 explicitly for character set references instead of
utf8.
You should also be aware that the utf8mb3 character set is deprecated
and you should expect it to be removed in a future MySQL release.
Please use utf8mb4 instead.
You may have tried to change your table in the following way:
ALTER TABLE test_table CHARSET=utf8mb4;
But that only changes the default character set, to be used if you add new columns to the table subsequently. It does not change any of the current columns. To do that:
ALTER TABLE test_table MODIFY COLUMN dummy VARCHAR(255) CHARACTER SET utf8mb4;
Or to convert all string or TEXT columns in a table in one statement:
ALTER TABLE test_table CONVERT TO CHARACTER SET utf8mb4;
That would be 𝙇 - L MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL L
It requires the utf8mb4 Character set to even represent it. "F0" is the clue; it is the first of 4 bytes in a 4-byte UTF-8 character. It cannot be represented in MySQL's "utf8". Collation is (mostly) irrelevant.
Most, not all, of the characters in ᛘ𝕰𝖆𝖗𝖙𝖍 𝕾𝖕𝖎𝖗𝖎𝖙𝖚𝖘𐂂® also need utf8mb4. They are "MATHEMATICAL BOLD FRAKTUR" letters.
(Meanwhile, Bill gives you more of an answer.)

Why changing the table collation in mysql breaks with unique index violation?

Problem:
Let's say that table customers (id, name, email, .. ) is encoded using utf-8 (utf8_general_ci collation).
This table also has a unique key constraint on column email.
When trying to change the collation to utf8mb4 using
ALTER TABLE customers CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;,
some entries cause the unique index constraint to flare up, because they are considered duplicates.
Example:
row1: (1, "strauss" "strauss#email.com")
row2: (10, "Strauss" "strauß#email.com")
same happens if two emails differ only by a zero width space char.
Tested with mysql version 5.7.20
In German, ß is considered basically equal to ss.
While the xxx_unicode_ci collation respects that, the xxx_general_ci does not, so you switched from a collation that considers "straus#email.com" not the same as "strauß#email.com" to one that does.
Actually, xxx_general_ci treats ß and s the same, so it would complain about "straus#email.com" and "strauß#email.com" instead.
See the documentation:
_general_ci Versus _unicode_ci Collations
For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, ß is equal to ss in German and some other languages. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

"Convert to character set" doesn't convert tables having only integer columns to specified character set

I am working on 2 servers each having similar configurations, Including mysql variables specific to character set and collation and both are on running mysql server and client 5.6.x. By default all tables are in latin1 including tables with only integer columns, But when I run
ALTER TABLE `table_name` CONVERT TO CHARACTER SET `utf8` COLLATE `utf8_unicode_ci`
for all tables in each server only one of the servers is converting all tables to utf8.
What I already tried:
Converted the default database character (character_set_database) set to utf8 before running the above listed command
Solution already worked for me (but still unsure why it worked)
ALTER TABLE `table_name` CHARACTER SET = `utf8` COLLATE `utf8_unicode_ci`
Finally there are 2 questions:
CONVERT TO CHARACTER SET is working in one server and not in other
Solution already worked for me which is similar to CONVERT TO CHARACTER SET with only one difference I have come across is, it doesn't implicitly convert the all the columns to specified character set.
Can someone please help me understand what is happening?
Thank you in advance.
IIRC, that was a bug that eventually was fixed. See bugs.mysql.com . (The bug probably existed since version 4.1, when CHARACTER SETs were really added.)
I prefer to be explicit in two places, thereby avoiding the issue you raise:
When doing CREATE TABLE, I explicitly say what CHARACTER SET I need. This avoids depending on the default established when the database was created, perhaps years ago.
When adding a column (ALTER TABLE ADD COLUMN ...), I check (via SHOW CREATE TABLE) to see if the table already has the desired charset. Even so, I might explicitly state CHARACTER SET for the column. Again, I don't trust the history of the table.
Note: I am performing these queries from explicit SQL, not from some UI that might be "helping" me.
Follow on
#HBK found http://bugs.mysql.com/bug.php?id=73153 . From it, I suspect this is what 'should be' done by the user:
ALTER TABLE ...
CONVERT TO ...
DEFAULT CHARACTER SET ...; -- Do this also

MySQL considers 'е' and 'ё' equal, how do I set it to consider them different?

I have a table with a unique constraint on a varchar field. When I try to insert 'e' and 'ё' in two different rows I'm prompted with a unique constraint violation. Executing the following select shows that MySQL considers the letters equivalent in spite of their HEX values being D0B5 and D191 respectively.
select 'е' = 'ё',
hex('е'),
hex('ё');
Following a fair amount of Googling I came across this MySQL bug report which seems to deal with this issue. The very last response by Sveta Smirnova states that this behavior is by design and refers to the Collation chart for utf8_unicode_ci, European alphabets (MySQL 6.0.4).
How do I tell MySQL that 'е' is not equal to 'ё' for query purposes and how do I change the unique constraint to take note of this fact?
You may wish to check this answer: Is it possible to remove mysql table collation?
The behavior you're seeing is standard. In most cases it produces the best results. From a point of interest do you have an example of how this is causing a problem for you. Have you found two words which match except for the diacritic?
Either way the only thing you can do about it is to change the collation. This can be done at the server, database, table or even field level.
Rather than my duplicating the manual on how to do this; please follow this link: http://dev.mysql.com/doc/refman/5.7/en/charset-syntax.html
There's a listing here of the different collations supported: http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html
If you need that for an especific field you could add a duplicate of the column with a different collation for prevent this issue.
ALTER TABLE yourTable ADD COLUMN `copiedColumn` VARCHAR(100) CHARACTER SET 'binary' COLLATE 'binary';
Also, you can change the collation of your column if you don't need the your current collation in this field
ALTER TABLE yourTable CHANGE COLUMN yourColumn yourColumn VARCHAR(100) CHARACTER SET 'binary' COLLATE 'binary';

Does a utf8_unicode_cs collation exist?

Does anyone know if a utf8_unicode_cs collation for MySQL exists? So far, my searches have come up dry. If it simply doesn't exist yet, is it fairly straight-forward to create one? Or somehow use utf8_unicode_ci or utf8_bin but "simulate" what one would expect from a utf8_unicode_cs collation?
I came across the same issue and after some Googling, it seems that MySQL doesn't include it. To "simulate it", as you put it,
1) To ensure case-sensitivity in the DB: set the table column to utf8_bin collation
This allows:
strict SELECTs: SELECT "Joe" will NOT return rows with "joe" / "joE" / "jOe" / etc
strict UNIQUE index: a column with a UNIQUE index will treat case differences as different values. For example, if a utf8_unicode_ci collation is used, inserting "Joe" on a table that already has "joe" will trigger a "Duplicate key" error. If ut8_bin is used, inserting "Joe" will work fine.
2) To get the proper ordering in results: add the collation to the SQL query:
SELECT ... ORDER BY column COLLATE utf8_unicode_ci
This is an old question but does not seem to be superseded by any other, so I thought it worth posting that things have changed.
MySQL version 8 now has the following collations for utf8mb4:
utf8mb4_0900_ai_ci
utf8mb4_0900_as_ci
utf8mb4_0900_as_cs
... and many language-specific variants of same.
(no _ai_cs as far as I know, but that would in any case be less useful: few reasons to group [a] and [a-acute] and then separately group [A] and [A-acute]).
The purpose of the original question's hypothetical "utf8_unicode_cs" is fulfilled by utf8mb4_0900_as_cs. (The 0900 means it uses Unicode v 9.0.0 as opposed to 4.0.0 used by utf8_unicode_ci.)
To use these you'd need to change the field from utf8 to utf8mb4 character set - but that's a generally good idea anyway because the old 3-byte-max encoding can't handle e.g. emoji and other non-BMP characters.
Source: https://dev.mysql.com/doc/refman/8.0/en/charset-mysql.html