Mysql does'nt distinguish between characters "c" and "ç" in UNIQUE index [duplicate] - mysql

These two querys gives me the exact same result:
select * from topics where name='Harligt';
select * from topics where name='Härligt';
How is this possible? Seems like mysql translates åäö to aao when it searches. Is there some way to turn this off?
I use utf-8 encoding everywhere as far as i know. The same problem occurs both from terminal and from php.

Yes, this is standard behaviour in the non-language-specific unicode collations.
9.1.13.1. Unicode Character Sets
To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 9.1.7.7, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
See also Examples of the effect of collation
You need to either
use a collation that doesn't have this "feature" (namely utf8_bin, but that has other consequences)
use a different collation for the query only. This should work:
select * from topics where name='Harligt' COLLATE utf8_bin;
it becomes more difficult if you want to do a case insensitive LIKE but not have the Ä = A umlaut conversion. I know no mySQL collation that is case insensitive and does not do this kind of implicit umlaut conversion. If anybody knows one, I'd be interested to hear about it.
Related:
Looking for case insensitive MySQL collation where “a” != “ä”
MYSQL case sensitive search for utf8_bin field

Since you are in Sweden I'd recommend using the Swedish collation. Here's an example showing the difference it makes:
CREATE TABLE topics (name varchar(100) not null) CHARACTER SET utf8;
INSERT topics (name) VALUES ('Härligt');
select * from topics where name='Harligt';
'Härligt'
select * from topics where name='Härligt';
'Härligt'
ALTER TABLE topics MODIFY name VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_swedish_ci;
select * from topics where name='Harligt';
<no results>
select * from topics where name='Härligt';
'Härligt'
Note that in this example I only changed the one column to Swedish collation, but you should probably do it for your entire database, all tables, all varchar columns.

While collations are one way of solving this, the much more straightforward way seems to me to be the BINARY keyword:
SELECT 'a' = 'ä', BINARY 'a' = 'ä'
will return 1|0
In your case:
SELECT * FROM topics WHERE BINARY name='Härligt';
See also https://www.w3schools.com/sql/func_mysql_binary.asp

you want to check your collation settings, collation is the property that sets which characters are identical.
these 2 pages should help you
http://dev.mysql.com/doc/refman/5.1/en/charset-general.html
http://dev.mysql.com/doc/refman/5.1/en/charset-mysql.html

Here you can see some collation charts. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.
Here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html

Related

mysql collation: case-preserving, case-insensitive but accent-sensitive [duplicate]

How can I perform accent-sensitive but case-insensitive utf8 search in mysql? Utf8_bin is case sensitive, and utf8_general_ci is accent insensitive.
If you want to differ "café" from "cafe"
You may use :
Select word from table_words WHERE Hex(word) LIKE Hex("café");
This way it will return 'café'.
Otherwise if you use :
Select word from table_words WHERE Hex(word) LIKE Hex("cafe");
it will return café.
I'm using the latin1_german2_ci Collation.
There doesn't seem to be one because case sensitivity is tough to do in Unicode.
There is a utf8_general_cs collation but it seems to be experimental, and according to this bug report, doesn't do what it's expected to when using LIKE.
If your data consists of western umlauts only (ie. umlauts that are included in ISO-8859-1), you might be able to collate your search operation to latin1_german2_ci or create a separate search column with it (that specific collation is accent sensitive according to this page; latin1_general_ci might be as well, I don't know and can't test right now).
You can use "hex" to make the search accent-sensitive. Then simply add lcase to make it case insensitive again. So that would give:
SELECT name FROM people WHERE HEX(LCASE(name)) = HEX(LCASE("René"))
You do throw all your indexes out of the window like that. If you want to avoid having to do a full table scan and you have an index on "name", also search for the same thing without the hex and lcase:
SELECT name FROM people WHERE name = "René" and HEX(LCASE(name)) = HEX(LCASE("René"))
This way the index on "name" will be used to find for example only the rows "René" and "Rene" and then the comparison with the "hex" needs to be done only on those two rows instead of on the complete table.

Get same MySQL search results using foreign and English characters

We have a MySQL database containing a table of authors. Some of the authors names have non-English characters in them (example LÜTTGE).
Our client wants users to be able to find such records even if they don't enter the non-English character. So in the above example "LUTTGE" should also find that result. At the moment it only works if the user searches for the name using the non-English character, so "LÜTTGE" works but "LUTTGE" returns nothing.
The frontend to this is a web application written in CakePHP 2
Does anyone have any ideas on how to do this as I'm at a loss? Ideally we want to be able to do this within CakePHP/MySQL, and not use third party search systems.
The above is just one example in a table of thousands of records. So it's not just a case of substituting "U" with "Ü" - there are many other variants.
This can be handled by using the MySQL collation system.
For example, the following query returns a true (1) value:
SELECT 'LÜTTGE' COLLATE utf8_general_ci = 'LUTTGE'
Accordingly, if you set the column's character set to utf8 and its collation to utf8_general_ci you will get the result you mention with umlaut characters.
The default collation in MySQL reflects its Swedish origin and is utf8_swedish_ci. In Swedish, Ü and U are not the same letter. You probably have used the default collation for your columns.
The utf8_general_ci collation handles matching 'Eßen' to 'Esen' but not to 'Essen'. It handles matching 'LÜTTGE' to 'LUTTGE' but not to 'Luettge', unfortunately.
On the other hand, the utf8_german2_ci collation matches 'Eßen' to 'Essen' and 'LÜTTGE' to 'LUETTGE'. If your users are accustomed to using ASCII transliterations of German characters you may wish to explore your choices here. One of them is to use a query with OR
SELECT whatever
FROM table
WHERE ( namen COLLATE utf8_general_ci = 'LUTTGE'
OR namen COLLATE utf8_german2_ci = 'LUTTGE' )
It can get more complex if you need to handle Spanish, because Ñ is considered a different letter from N. You may need to do some explaining for your users.
Marcus suggested using the utf_unicode_ci collation. That will handle things partially too. Here are the cases
type utf8_general_ci utf8_german2_ci utf8_unicode_ci utf8_spanish_ci
'Eßen' to 'Esen' substitute match no match no match no match
'Eßen' to 'Essen' transliterate no match match match match
'LÜTTGE' to 'LUTTGE' substitute match no match match match
'LÜTTGE' to 'LUETTGE' transliterate no match match no match no match
'Niño' to 'Nino' transliterate match match match no match
So you still need some extra work to handle transliterations.

Mysql order by on column with unicode characters

I am running a select query on mysql table and trying to order it by the "name" column in the table.
The name column contains both English character names and names with Latin character like â.
I am running into the below problem.
The query I run returns the results ordered in the below manner i.e.
Eg: if Name contains "archer", "aaakash", "â hayden", "bourne", "jason"
The results returned by the query is ordered as below
"aaakash", "archer", "â hayden", "bourne", "jason"
However I want to order it based on unicode code points (like below)
"aaakash", "archer", "bourne", "jason", "â hayden"
(See the difference in the position of â hayden in the orders)
What can I do to order the results based on the character's position in unicode character set?
However I want to order it based on unicode code points (like below)
To sort using unicode code point, you probably need to use utf8_bin collation.
Precisely, the _bin suffix indicate to sort by the binary representation of each character.
To override the default collation while ordering, you will use ORDER BY ... COLLATE:
To paraphrase the documentation:
SELECT k
FROM t1
ORDER BY k COLLATE utf8_bin;
If your text column does not use utf8 encoding, you will have to CONVERT it:
SELECT k
FROM t1
ORDER BY CONVERT(k USING utf8) COLLATE utf8_bin;
Please notice I used utf8 as an example here as this is the most common Unicode encoding. But your MySQL server probably support other Unicode encoding, like ucs2("UTF-16").

MYSQL 5.1.61 sorting for Central European languages in utf8

I have a problem with sorting MYSQL result..
SELECT * FROM table WHERE something ORDER BY column ASC
column is set to utf8_unicode_ci..
As a result I first get rows which have column starting with Bosnian letters and then the others after that..
šablabl
šeblabla
čeblabla
aaaa
bbaa
bbb
ccc
MYSQL version is 5.1.61
Bgi is right. You need to use an appropriate collation. Unfortunately, MySQL doesn't have a Central European unicode collation yet. MariaDb, the MySQL fork being maintained by MySQL's creators, does.
So you can convert your text from utf8 to latin2 and then order with a Central European collating sequence. For example.
SELECT *
FROM tab
ORDER BY CONVERT(text USING latin2) COLLATE latin2_croatian_ci
See this fiddle: http://sqlfiddle.com/#!2/c8dd4/1/0
It is because the way of unicode is made. All the "normal" latin characters got back the same numerical correspondance they had in ASCII, and other characters from other cultures were added after. That means if your alphabet has other characters than the 26 regular ASCII ones, it wont appear in the correct order in Unicode.
I think you should try to change the collation on your column (maybe you'll have to change the charset also, but maybe not).
Use a Central European collation.
Good luck !!
If that's really what you see you have found a bug: utf8_unicode_ci is supposed to consider š equivalent to s and č equivalent to c!
In any case it's true that MySQL does not have great support of utf8 collations for Central European languages: you get only Czech, Slovak, and Slovenian. If none of those work for you, I guess you'll have to create your own utf8 collation, or use a non-Unicode character set and use the collations available there.
Older question and plenty of answers.
Maybe the way I deal with problems will help someone.
I use PDO. My DB is utf-8.
First - my db singleton code (relevant part of it). I set 'SET NAMES' to 'utf8' for all connections.
$attrib_array = array(PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8');
if (DB_HANDLER)
$attrib_array[PDO::ATTR_ERRMODE] = PDO::ERRMODE_EXCEPTION;
self::$instance = new PDO(DB_TYPE.':host='.DB_HOST.';dbname='.DB_NAME, DB_USER, DB_PASS, $attrib_array);
Second - my sorting looks something like this - collation depends on language (sample shows polish):
ORDER BY some_column COLLATE utf8_polish_ci DESC
To make things more streamlined I use a constant, which I define in lang translation file, so when file is pulled, proper collation constant is set. Of course I have 'utf8_general_ci' as default. Example:
define('MY_LOCALIZED_COLLATE', 'COLLATE utf8_polish_ci');
Now, my (relevant part of) query looks like this:
" ... ORDER BY some_column " . MY_LOCALIZED_COLLATE . " DESC" ;
Above works in most cases.
If you are missing collation set, you may try to add one yourself.
More detailed info about creating such set - see here: http://dev.mysql.com/doc/refman/5.0/en/adding-collation.html
EDIT:
Just one more thing I noticed:
if you have list to sort in e.g. Polish
and you have to force proper collation for sorting (as described above)
and you use e.g. INT column as sorting vector
... then you better have collation set (e.g. to UTF8), or you will get SQL errors, e.g.:
"Syntax error or access violation: 1253 COLLATION 'utf8_polish_ci' is not valid for CHARACTER SET 'latin1'"
... strange, but true

is possible to have accent sensitive and case insensitive utf8 collation in mysql?

How can I perform accent-sensitive but case-insensitive utf8 search in mysql? Utf8_bin is case sensitive, and utf8_general_ci is accent insensitive.
If you want to differ "café" from "cafe"
You may use :
Select word from table_words WHERE Hex(word) LIKE Hex("café");
This way it will return 'café'.
Otherwise if you use :
Select word from table_words WHERE Hex(word) LIKE Hex("cafe");
it will return café.
I'm using the latin1_german2_ci Collation.
There doesn't seem to be one because case sensitivity is tough to do in Unicode.
There is a utf8_general_cs collation but it seems to be experimental, and according to this bug report, doesn't do what it's expected to when using LIKE.
If your data consists of western umlauts only (ie. umlauts that are included in ISO-8859-1), you might be able to collate your search operation to latin1_german2_ci or create a separate search column with it (that specific collation is accent sensitive according to this page; latin1_general_ci might be as well, I don't know and can't test right now).
You can use "hex" to make the search accent-sensitive. Then simply add lcase to make it case insensitive again. So that would give:
SELECT name FROM people WHERE HEX(LCASE(name)) = HEX(LCASE("René"))
You do throw all your indexes out of the window like that. If you want to avoid having to do a full table scan and you have an index on "name", also search for the same thing without the hex and lcase:
SELECT name FROM people WHERE name = "René" and HEX(LCASE(name)) = HEX(LCASE("René"))
This way the index on "name" will be used to find for example only the rows "René" and "Rene" and then the comparison with the "hex" needs to be done only on those two rows instead of on the complete table.