Does a utf8_unicode_cs collation exist? - mysql

Does anyone know if a utf8_unicode_cs collation for MySQL exists? So far, my searches have come up dry. If it simply doesn't exist yet, is it fairly straight-forward to create one? Or somehow use utf8_unicode_ci or utf8_bin but "simulate" what one would expect from a utf8_unicode_cs collation?

I came across the same issue and after some Googling, it seems that MySQL doesn't include it. To "simulate it", as you put it,
1) To ensure case-sensitivity in the DB: set the table column to utf8_bin collation
This allows:
strict SELECTs: SELECT "Joe" will NOT return rows with "joe" / "joE" / "jOe" / etc
strict UNIQUE index: a column with a UNIQUE index will treat case differences as different values. For example, if a utf8_unicode_ci collation is used, inserting "Joe" on a table that already has "joe" will trigger a "Duplicate key" error. If ut8_bin is used, inserting "Joe" will work fine.
2) To get the proper ordering in results: add the collation to the SQL query:
SELECT ... ORDER BY column COLLATE utf8_unicode_ci

This is an old question but does not seem to be superseded by any other, so I thought it worth posting that things have changed.
MySQL version 8 now has the following collations for utf8mb4:
utf8mb4_0900_ai_ci
utf8mb4_0900_as_ci
utf8mb4_0900_as_cs
... and many language-specific variants of same.
(no _ai_cs as far as I know, but that would in any case be less useful: few reasons to group [a] and [a-acute] and then separately group [A] and [A-acute]).
The purpose of the original question's hypothetical "utf8_unicode_cs" is fulfilled by utf8mb4_0900_as_cs. (The 0900 means it uses Unicode v 9.0.0 as opposed to 4.0.0 used by utf8_unicode_ci.)
To use these you'd need to change the field from utf8 to utf8mb4 character set - but that's a generally good idea anyway because the old 3-byte-max encoding can't handle e.g. emoji and other non-BMP characters.
Source: https://dev.mysql.com/doc/refman/8.0/en/charset-mysql.html

Related

What does [cs] means in DDL for MariaDb

I have a DDL for a table
create table AIRPORT (
AIRP_CODE varchar(10)[cs] not null,
AIRP_NAME nvarchar(60)[cs] not null,
GEOR_ID_LOCATED integer not null,
PRCC_CONST integer,
AIRP_TIME_ZONE char(5),
AIRP_TRANSLATION mediumtext,
LCOUNT integer default 0
);
I am trying to figure out what does [cs] mean in it. I think its for collation but I am not sure how it works. Table DDL isn't written by me and I can't figure it out.
In that position would be CHARACTER SET and/or COLLATION.
An "airport code" would best be CHARACTER SET ascii. Depending on whether you want to allow case folding, you could use COLLATION ascii_bin (disallow folding) or COLLATION ascii_ci (allow folding).
For the airport name, it would probably be best to use UTF-8:
AIRP_NAME varchar(60) CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci not null
Note: NVARCHAR is a notation from non-MySQL vendors; for MySQL the sharset is important.
Perhaps you also want to specify a charset for AIRP_TRANSLATION? Again, utf8mb4 is probably appropriate.
(I have never seen "[cs]"; my advice is aimed at what should be specified in that context.)
That SQL code is invalid, period.
You must be missing information. Wherever you got the code from, there's possibly some documentation that explains what it is and how to use it. If it was handed to you by some other person, he didn't consider sharing the information with you.
Judging from column names and types, I suspect the code comes from the README file of an airport database available for download, perhaps in CSV format, and it's just a recommended table structure you are meant to take as starting point and adapt to your own system. My educated guess about [cs] is that it's an annotation to imply that those fields are case sensitive, meaning that you application should use e.g. MAD and not mad.
In any case, having no further context it's impossible to tell.

"Convert to character set" doesn't convert tables having only integer columns to specified character set

I am working on 2 servers each having similar configurations, Including mysql variables specific to character set and collation and both are on running mysql server and client 5.6.x. By default all tables are in latin1 including tables with only integer columns, But when I run
ALTER TABLE `table_name` CONVERT TO CHARACTER SET `utf8` COLLATE `utf8_unicode_ci`
for all tables in each server only one of the servers is converting all tables to utf8.
What I already tried:
Converted the default database character (character_set_database) set to utf8 before running the above listed command
Solution already worked for me (but still unsure why it worked)
ALTER TABLE `table_name` CHARACTER SET = `utf8` COLLATE `utf8_unicode_ci`
Finally there are 2 questions:
CONVERT TO CHARACTER SET is working in one server and not in other
Solution already worked for me which is similar to CONVERT TO CHARACTER SET with only one difference I have come across is, it doesn't implicitly convert the all the columns to specified character set.
Can someone please help me understand what is happening?
Thank you in advance.
IIRC, that was a bug that eventually was fixed. See bugs.mysql.com . (The bug probably existed since version 4.1, when CHARACTER SETs were really added.)
I prefer to be explicit in two places, thereby avoiding the issue you raise:
When doing CREATE TABLE, I explicitly say what CHARACTER SET I need. This avoids depending on the default established when the database was created, perhaps years ago.
When adding a column (ALTER TABLE ADD COLUMN ...), I check (via SHOW CREATE TABLE) to see if the table already has the desired charset. Even so, I might explicitly state CHARACTER SET for the column. Again, I don't trust the history of the table.
Note: I am performing these queries from explicit SQL, not from some UI that might be "helping" me.
Follow on
#HBK found http://bugs.mysql.com/bug.php?id=73153 . From it, I suspect this is what 'should be' done by the user:
ALTER TABLE ...
CONVERT TO ...
DEFAULT CHARACTER SET ...; -- Do this also

mysql character set utf 8 collation (dup key) not working properly

My database tables character-set and collation set to utf-8 and utf8_general_ci respectively. I inserted a record with value 'säî kîråñ' in varchar column. I have a unique column constraint on that. When I try to insert 'sai kiran' its giving duplicate entry referring to old inserted row of 'säî kîråñ'.
As you can see the characters in both the strings are completely different in utf8 character-set, couldn't understand why it is showing error as 'duplicate entry'.
I tried changing collation to utf8_unicode_ci but no use. I tried directly inserting in phpmyadmin to avoid prog lang encoding problems, still the problem persists.
In case of utf8_general_ci I thought only 'a' and 'A' would be treated same, but from here I found that even 'a' and 'ä' would be treated same. So utf8_bin is the ultimate solution to treat them different. Good learning though.
Because of the lack of a utf8_..._ci_as, collation, you should probably change the collation of (at least) that column to be utf8_bin. Then, instead of all of these being equal S=s=Ş=ş=Š=Š=š=š, they will all be different. Note, especially. that s and S will be distinct in utf8_bin.

MySQL considers 'е' and 'ё' equal, how do I set it to consider them different?

I have a table with a unique constraint on a varchar field. When I try to insert 'e' and 'ё' in two different rows I'm prompted with a unique constraint violation. Executing the following select shows that MySQL considers the letters equivalent in spite of their HEX values being D0B5 and D191 respectively.
select 'е' = 'ё',
hex('е'),
hex('ё');
Following a fair amount of Googling I came across this MySQL bug report which seems to deal with this issue. The very last response by Sveta Smirnova states that this behavior is by design and refers to the Collation chart for utf8_unicode_ci, European alphabets (MySQL 6.0.4).
How do I tell MySQL that 'е' is not equal to 'ё' for query purposes and how do I change the unique constraint to take note of this fact?
You may wish to check this answer: Is it possible to remove mysql table collation?
The behavior you're seeing is standard. In most cases it produces the best results. From a point of interest do you have an example of how this is causing a problem for you. Have you found two words which match except for the diacritic?
Either way the only thing you can do about it is to change the collation. This can be done at the server, database, table or even field level.
Rather than my duplicating the manual on how to do this; please follow this link: http://dev.mysql.com/doc/refman/5.7/en/charset-syntax.html
There's a listing here of the different collations supported: http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html
If you need that for an especific field you could add a duplicate of the column with a different collation for prevent this issue.
ALTER TABLE yourTable ADD COLUMN `copiedColumn` VARCHAR(100) CHARACTER SET 'binary' COLLATE 'binary';
Also, you can change the collation of your column if you don't need the your current collation in this field
ALTER TABLE yourTable CHANGE COLUMN yourColumn yourColumn VARCHAR(100) CHARACTER SET 'binary' COLLATE 'binary';

mysql utf8 encoding and unique keys

I am inporting data fom a user table (many users from many sites):
myisam default collation latin1_swedish....
Importing that data into a innodb table utf8_general
I have placed a unique key on the username,site_id combination but this is failing on 2 users of the same site:
user 1 dranfog,
user 2 drånfog
If I run:
SELECT IF('å' = 'a', 'yep', 'nope');
directly on the target db with utf8 encoding, I get 'yep'.
Any tips on resolving this most welcome. I was of impression utf8 would treat these as different charcters but that seem to not be the case.
Collations that end with _ci are case (and accent) insensitive.
You could change the collation to 'utf8_binary' to treat dranfog differently than drånfog.