MySQL / MariaDB Case Insensitive Collation Still Case Sensitive?

MySQL / MariaDB Case Insensitive Collation Still Case Sensitive? - mysql

Using MariaDB 10.0.36, I have a user table with the collation of utf8_turkish_ci with a user_login column that stores a user's username that is also using the collation of utf8_turkish_ci with a unique index.
My understanding is that a select statement should be case insensitive, but it doesn't appear to be that way with certain usernames.
For example, I have a user with the login of GoDoIt
This statement returns no records:
SELECT * FROM user WHERE user_login = 'godoit'
However, this works:
SELECT * FROM user WHERE user_login = 'GoDoIt'
I find this strange because the username of Eric works both ways.
SELECT * FROM user WHERE user_login = 'eric'
SELECT * FROM user WHERE user_login = 'Eric'
Return the the same result. So why would capitals in the middle of the string not work? I'm lowering the input username in PHP using tolower on the string before sending it to the database, and I guess this approach won't work with certain usernames.

Turkish dotless I and dotted i are two separate characters; those are not considered equal in the utf8_turkish_ci collation.
See the collation chart here: http://collation-charts.org/mysql60/mysql604.utf8_turkish_ci.html
Note the separate entries for the dotless I and dotted i.
Additional background here: https://en.wikipedia.org/wiki/Dotted_and_dotless_I

(Too long for a Comment. Spencer's answer is good.)
This lists the letters and states which are equal or not, and which order they are in. Here is the excerpt show ing that the dotless I's are equal to each other but considered less than the dotted I's:
utf8_turkish_ci I=ı Ħ=ħ i=Ì=Í=Î=Ï=ì=í=î=ï=Ĩ=ĩ=Ī=ī=Ĭ=ĭ=Į=į=İ ij=Ĳ=ĳ iz J=j=j́=Ĵ=ĵ jz
Some other things that are unusual about utf8_turkish_ci: Ö=ö -- treated as a "letter" that comes between O and P. Similarly for Ç=ç and Ğ=ğ and Ş=ş
Note: utf8mb4 and utf8 handle Turkish identically.
MySQL 6.0 died on the vine years ago; it looks like that link for the collation is out of date with respect to Ş:
mysql> SELECT 'Ş' = 'S' COLLATE utf8_turkish_ci;
+------------------------------------+
| 'Ş' = 'S' COLLATE utf8_turkish_ci |
+------------------------------------+
| 0 |
+------------------------------------+

Related

MySQL collation query results

does anyone has explaination why :
SELECT * FROM MY_TABLE WHERE 1 = 1 AND libelle COLLATE latin1_general_ci LIKE '%dég%'
returns 1 record (only the record with é) while
SELECT * FROM MY_TABLE WHERE 1 = 1 AND libelle COLLATE latin1_swedish_ci LIKE '%dég%'
returns 4 record (of course including the one above) ?
According to MySQL doc latin1_general_ci is "Multilingual (Western European) case insensitive" so should not it manage accents like latin1_swedish_ci ?
Thanks
Nicolas

I suspect you're misunderstanding what collation is.
Collation is a set of rules used in natural language (Swedish, English, Russian, Japanese...) to determine the dictionary order of words. In relational databases, this is used to sort data (e.g. ORDER BY clauses) and to compare data (e.g. WHERE clauses or unique indexes). A couple of examples:
If you need to order by country name in English you get this:
Canada
China
Colombia
However, in traditional Spanish ch used to be an independent letter so correct order was this:
Canada
Colombia
China
In Swedish, å is a separate letter so you can have a login name like ångström even if you already have angström. In other languages they'd be duplicates and would not be allowed.
Collation is not something you use to display emojis and other Unicode characters. That's just encoding (ISO-8859-1, UTF-8, UTF-16... whatever).

Find accented and non-accented variations of same word

I have a database table which represent people and the records have people's names in them. Some of the names have accented characters in them. Some do not. Some are non-accented duplicates of the accented version.
I need to generate a report of all of the potential duplicates by finding names that are the same (first, middle, last) except for the accents so that someone else can go through this list and verify which are true duplicates, and which are actually different people (I'm assuming they have some other way of knowing).
For example: Jose DISTINCT-LAST-NAME and José DISTINCT-LAST-NAME should be picked up as potential duplicates because they have the same characters, but one has an accented character.
How can this type of query by written in MySQL?
This question: How to remove accents in MySQL? is not the same. It is asking about de-accenting strings in-place and the poster already has a second column of data that has been de-accented. Also, the accepted answer to that question is to set the character set and collation. I have already set the character set and collation.
I am trying to generate a report that finds strings in different records that are the same except for their accents.

I found your question very interesting.
According to this article Accents in text searches, using "like" condition with some character collation adjustments will solve your problem. I have not tested this solution, so if it helps you, please come back and tell us.
Here is a similar question: Accent insensitive search query in MySQL,
according to that, you can use something like:
where 'José' like 'Jose' collate utf8_general_ci

Well, I found something that seems to work (the real query involves a few more other fields, but the same basic idea):
select distinct p1.person_id, p1.first_name, p1.last_name, p2.last_name
from people as p1, people as p2
where binary p1.last_name <> binary p2.last_name
and p1.last_name = p2.last_name
and p1.first_name = p2.first_name
order by p1.last_name, p1.first_name, p2.last_name, p2.first_name;
The results look like this:
12345 Bob Jose José
56789 Bob José Jose
...
This makes sense as there are 2 records for Bob José and I know that in this case, it is the same person but one record is missing the accent.
The trick is to do a binary and non-binary compare on the "last_name" field as well as matching on all other fields. This way we can find everything that is "equal" and also not binary-equal. This works because with the current character-set/collation (utf8/utf8_general_ci), Jose and José are equal but are not binary-equal. you can try it out like this:
select 'Jose' = 'José', 'Jose' like 'José', binary 'Jose' = binary 'José';

The Bane of Character Encodings
There are a wide variety of character-sets and encodings that may be used in MySQL, and when dealing with encoding it is important to learn what you can about them. In particular, take a close look at the differences between:
utf8_unicode_ci
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_general_ci
Some character sets are built to include as many printable characters as possible, to support a wider range of uses, while others are built with the intent of portability and compatibility between systems. In particular, utf8_unicode_ci maps most accented characters to non-accented equivalents. Alternatively, you could use uft8_ascii_ci which is even more restrictive.
Take a look at the utf8_unicode_ci collation chart, and What's the difference between utf8_general_ci and utf8_unicode_ci .
The best answer is from a similar question, "How to remove accents in MySQL?"
If you set an appropriate collation for the column then the value
within the field will compare equal to its unaccented equivalent
naturally.
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'é' = 'e';
+------------+
| 'é' = 'e' |
+------------+
| 1 |
+------------+
1 row in set (0.05 sec)
How to apply this to your situation?
SELECT id, last-name
FROM people
WHERE last-name COLLATE utf8_unicode_ci IN
(
SELECT last-name
FROM people
GROUP BY last-name COLLATE utf8_unicode_ci
HAVING COUNT(last-name)>1
)

German umlauts and UTF8 collations, revisited

As I am sure a lot of people here are aware, having to deal with German umlauts and UTF8 collations can be problematic to say the least. Stuff like a = ä, o = ö, u = ü is not only capable of affecting the sort order of the results but the actual results as well. Here is an example which clearly demonstrates how things can go wrong by simply trying to make a distinction between a singular and plural version of a noun (Bademantel - singular, Bademäntel - plural).
CREATE TABLE keywords (
id INT (11) PRIMARY KEY AUTO_INCREMENT,
keyword VARCHAR (255) NOT NULL
) ENGINE = MyISAM DEFAULT CHARACTER
SET = utf8 COLLATE = utf8_unicode_ci;
INSERT INTO keywords (keyword) VALUES ('Bademantel'), ('Bademäntel');
SELECT * FROM keywords WHERE keyword LIKE ('%Bademäntel%');
Results should be
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademäntel |
+----+------------+
yet with utf8_unicode_ci the output is
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademantel |
| 2 | Bademäntel |
+----+------------+
which is clearly not the required result.
The actual problem is tied for my current project. It involves writing a keyword parser which is basically supposed to replace every occurrence of a keyword on the website with a link to the appropriate product page. In order to avoid unnecessary waste of resources only distinct keywords are fetched but using either
SELECT keyword FROM keywords GROUP BY keyword ORDER BY LENGTH(keyword) DESC
or
SELECT DISTINCT keyword FROM keywords ORDER BY LENGTH(keyword) DESC
will result in failing to process (link) all the non-umlaut versions of the words simply because they are not fetched during the query (i.e. all the keywords containing Bademäntel will be fetched but Bademantel will be omitted).
Now I realize that I have a couple of options to resolve this problem.
1) Use utf8_swedish_ci for the keywords table or during the queries which would effectively save me from having to modify a lot of existing code.
SELECT DISTINCT keyword COLLATE utf8_swedish_ci AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
Unfortunately I am not that reluctant to abandon utf8_unicode_ci because a) it offers a really nice feature of sorting "Eszett" (ss and ß are considered the same), b) somehow it simply feels wrong to use a Swedish collation to handle German related stuff.
2) Modify the existing code to make use of utf8_bin.
SELECT DISTINCT keyword COLLATE utf8_bin AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
This works as intended but it has a nasty drawback that all comparison is case-sensitive which means that if I decided to rely on utf8_bin as a solution for the problem I would have a hard time doing case-insensitive queries like LIKE('%Mäntel%') which would most definitely omit records like Bademäntel.
I know that this question pops every now and then on SO but some of the answers are now pretty old and I just want to know if there is some other solution that might have emerged in the meantime. I mean, I really can't get around the thought that a simple collation is allowed to completely change the results of a query. Sorting order yes, but the results itself?
Sorry for a bit longer post and thanks in advance for any kind of advice or comment.

For anyone else encountering this problem it's worth noting that since MySQL 5.6 there is official support for utf8_german2_ci collation which solves all of the above problems. Better late, than never I guess.

You could use a binary check using the keyword WHERE BINARY keyword = 'Bademantel'. The result would be the expected one.
Check out this sqlfiddle, which shows this:
SELECT * FROM stackoverflow WHERE BINARY keyword = 'Bademantel';
| id | keyword |
|----|------------|
| 1 | Bademantel |
SELECT * FROM stackoverflow WHERE keyword = 'Bademantel';
| id | keyword |
|----|------------|
| 1 | Bademantel |
| 2 | Bademäntel |
More about this behavior here: What effects does using a binary collation have? and here: What is the best MySQL collation for German language
So for applications with german umlauts or french grave accent or special chars in czech/polish language you have to decide which behavior is the best for your application.
Most cases will be ok with utf8_general_ci but sometimes you have to use utf8_bin for cases like your Bademantel.
The string comparison isn't bad at all, utf8_general_ci will help you sometimes. If you have saved a string like Straße - and you can search for Strasse which will also return Straße.

MySQL - Characters matching

How would I get MySQL to be more strict with character matching?
A quick example of what I mean, say I have a table with a single column `name`. In this column, I have two names: 'Jorge' and 'Jorgé" The only difference between these names is the ´ over the e. If I run the query SELECT * FROM table WHERE name = 'Jorge', it will return
+--------+
| name |
+--------+
| Jorge |
| Jorgé |
+--------+
and if I run the query SELECT * FROM table WHERE name = 'Jorgé', it returns the same result table. How would I set MySQL to be more strict in that so that it would not return both names?
Thanks ahead.
Quick Edit: I'm using the UTF-8 character encoding

If you want to make sure that no similar characters (like e and é) are considered the same, you should use the utf8_bin collation on that column. I assume that you're using utf8_general_ci now, which will consider some similar characters to be the same. utf8_bin only matches on the exact same characters.

#G-Nugget is correct, but since you are looking at Spanish stuff you might also be interested in the utf8_spanish_ci or utf8_spanish2_ci. They correspond to modern and traditional Spanish. "ñ" is considered a separate letter, and in traditional the "ch" and "ll" are also treated as separate letters.
More here: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

MySQL DB selects records with and without umlauts. e.g: '.. where something = FÖÖ'

My Table collation is "utf8_general_ci". If i run a query like:
SELECT * FROM mytable WHERE myfield = "FÖÖ"
i get results where:
... myfield = "FÖÖ"
... myfield = "FOO"
is this the default for "utf8_general_ci"?
What collation should i use to only get records where myfield = "FÖÖ"?

SELECT * FROM table WHERE some_field LIKE ('%ö%' COLLATE utf8_bin)

A list of the collations offered by MySQL for Unicode character sets can be found here:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
If you want to go all-out and require strings to be absolutely identical in order to test as equal, you can use utf8_bin (the binary collation). Otherwise, you may need to do some experimentation with the different collations on offer.

For scandinavian letters you can use utf8_swedish_ci fir example.
Here is the character grouping for utf8_swedish_ci. It shows which characters are interpreted as the same.
http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html
Here's the directory listing for other collations. I'm no sure which is the used utf8_general_ci though. http://collation-charts.org/mysql60/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL / MariaDB Case Insensitive Collation Still Case Sensitive? - mysql

Related

MySQL collation query results

Find accented and non-accented variations of same word

German umlauts and UTF8 collations, revisited

MySQL - Characters matching

MySQL DB selects records with and without umlauts. e.g: '.. where something = FÖÖ'

Categories

Resources