I have MySQL 5.xx running on Linux system. My application writes correctly ä, ö, å etc. characters to database and even gets these values correctly. But when I use WHERE to filter search for char 'ä', it will return also Strings that contain 'a' chars. Why MySQL thinks that a is equal to ä?
Example query:
SELECT column FROM table WHERE field='%ä%';
MySQL's uses collations to compare character values.
Collations are the sets of rules used by database to define which characters are different and which are not when comparing.
Case sensitive collations distinguish between 'QUERY' and 'query', case insensitive do not.
Accent sensitive collations distinguish between 'résumé' and 'resume', accent insensitive do not.
In your column's default collation (most probably UTF8_GENERAL_CI), umlauted characters are indistinguishable from non-umlauted:
SELECT 'a' LIKE '%ä%'
---
1
To distinguish between them, use binary collation (which treat all characters with different unicodes as different characters):
SELECT 'a' LIKE '%ä%' COLLATE UTF8_BIN
---
0
Note: for many applications, collating a and ä as the same letter is considered a feature. My suggestion: be sure to double-check with your clients to determine which behavior is desired.
I might even follow up with a memo that says, "As we discussed on x date, the system will sort and find characters as follows..."
Related
This question is an extension of the following question - How to make mysql consider the control characters when doing string comparison?
Here is my query -
SELECT 'abc' < 'abcSOH' COLLATE utf8mb4_0900_bin;
Here SOH is the Start Of Header which is an ASCII control character with ASCII code 1. My expectation is that this query will return 1 as the second string's length is 4. I have even tried with Space (ASCII code 32) with the same results!!
If you check this fiddle, you can see only the 'utf8mb4_0900_bin' collation gives the expected result. All other collations that I have tested give the opposite result.
https://dbfiddle.uk/mDLVWOZG
I have gone through the documentation and could not find the reason behind this. Can anyone please explain why is this?
I am interested to know this because I would like to use a 1-byte character set (and corresponding collation) instead of a 4-byte character set because I have some legacy tables (converting to MySQL) that have a lot of columns and if I use a 4-byte character set, it gives an error that the row is too big.
Each column can have its own CHARACTER SET and COLLATION. But different rows must agree.
CREATE TABLE provides only "defaults" for those settings -- these defaults are used if you don't override them when declaring the individual columns.
So, legacy columns may as well be declared with whatever antique charset was used. (Sorry, EBCDIC is not available.)
All the "printable" characters of ASCII are available in UTF-8 (MySQL's utf8/utf8mb3/utf8mb4). In fact, the binary encoding is identical.
The "control characters" -- well, stick with ascii or latin1 (perhaps with latin1_bin).
Any _bin collation says to simply look at the bits.
I do not know if control characters are turned into space (hex 20) when INSERTing into a UTF-8 column.
When I search for LIKE %カナ it still brings up results for かな.
From the MySQL documentation (I'm on 8.0.26) under Language-Specific Collations:
For Japanese, the utf8mb4 character set includes utf8mb4_ja_0900_as_cs and
utf8mb4_ja_0900_as_cs_ks collations. Both collations are accent-sensitive and
case-sensitive. utf8mb4_ja_0900_as_cs_ks is also kana-sensitive and distinguishes
Katakana characters from Hiragana characters, whereas utf8mb4_ja_0900_as_cs treats
Katakana and Hiragana characters as equal for sorting.
Checking my column it shows the kana-sensitive collation:
SELECT COLUMN_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
COLUMN_NAME
COLLATION_NAME
kana
utf8mb4_ja_0900_as_cs_ks
There are three(?) different pieces of code that MySQL uses for character comparisons: =, LIKE, REGEXP. They are, confusingly, not identical. And in some cases, they are deliberately different.
s LIKE 'abc' is turned into s = 'abc', adding to the confusion.
A collation ending with _as_cs implies that 'e' <> 'E', but whether < or > applies is still honored. This is unlike BINARY or _bin collation, in which cases it blindly checks the bits.
The collation you are using is relatively new; it is not older than 8.0. If you find errors in the collation, please file a bug report at bugs.mysql.com and provide a simple testcase demonstrating the issue.
For case sensitive try to use LIKE BINARY.
Example:
SELECT name FROM users WHERE name LIKE BINARY 'John%';
I have a problem with sorting MYSQL result..
SELECT * FROM table WHERE something ORDER BY column ASC
column is set to utf8_unicode_ci..
As a result I first get rows which have column starting with Bosnian letters and then the others after that..
šablabl
šeblabla
čeblabla
aaaa
bbaa
bbb
ccc
MYSQL version is 5.1.61
Bgi is right. You need to use an appropriate collation. Unfortunately, MySQL doesn't have a Central European unicode collation yet. MariaDb, the MySQL fork being maintained by MySQL's creators, does.
So you can convert your text from utf8 to latin2 and then order with a Central European collating sequence. For example.
SELECT *
FROM tab
ORDER BY CONVERT(text USING latin2) COLLATE latin2_croatian_ci
See this fiddle: http://sqlfiddle.com/#!2/c8dd4/1/0
It is because the way of unicode is made. All the "normal" latin characters got back the same numerical correspondance they had in ASCII, and other characters from other cultures were added after. That means if your alphabet has other characters than the 26 regular ASCII ones, it wont appear in the correct order in Unicode.
I think you should try to change the collation on your column (maybe you'll have to change the charset also, but maybe not).
Use a Central European collation.
Good luck !!
If that's really what you see you have found a bug: utf8_unicode_ci is supposed to consider š equivalent to s and č equivalent to c!
In any case it's true that MySQL does not have great support of utf8 collations for Central European languages: you get only Czech, Slovak, and Slovenian. If none of those work for you, I guess you'll have to create your own utf8 collation, or use a non-Unicode character set and use the collations available there.
Older question and plenty of answers.
Maybe the way I deal with problems will help someone.
I use PDO. My DB is utf-8.
First - my db singleton code (relevant part of it). I set 'SET NAMES' to 'utf8' for all connections.
$attrib_array = array(PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8');
if (DB_HANDLER)
$attrib_array[PDO::ATTR_ERRMODE] = PDO::ERRMODE_EXCEPTION;
self::$instance = new PDO(DB_TYPE.':host='.DB_HOST.';dbname='.DB_NAME, DB_USER, DB_PASS, $attrib_array);
Second - my sorting looks something like this - collation depends on language (sample shows polish):
ORDER BY some_column COLLATE utf8_polish_ci DESC
To make things more streamlined I use a constant, which I define in lang translation file, so when file is pulled, proper collation constant is set. Of course I have 'utf8_general_ci' as default. Example:
define('MY_LOCALIZED_COLLATE', 'COLLATE utf8_polish_ci');
Now, my (relevant part of) query looks like this:
" ... ORDER BY some_column " . MY_LOCALIZED_COLLATE . " DESC" ;
Above works in most cases.
If you are missing collation set, you may try to add one yourself.
More detailed info about creating such set - see here: http://dev.mysql.com/doc/refman/5.0/en/adding-collation.html
EDIT:
Just one more thing I noticed:
if you have list to sort in e.g. Polish
and you have to force proper collation for sorting (as described above)
and you use e.g. INT column as sorting vector
... then you better have collation set (e.g. to UTF8), or you will get SQL errors, e.g.:
"Syntax error or access violation: 1253 COLLATION 'utf8_polish_ci' is not valid for CHARACTER SET 'latin1'"
... strange, but true
I have the following query in MySQL:
SELECT id FROM unicode WHERE `character` = 'a'
The table unicode contains each unicode character along with an ID (it's integer encoding value). Since the collation of the table is set to utf8_unicode_ci, I would have expected the above query to only return 97 (the letter 'a'). Instead, it returns 119 rows containing the IDs of many 'a'-like letters:
a A Ã ...
It seems to be ignoring both case and the multi-byte nature of the characters.
Any ideas?
As documented under Unicode Character Sets:
MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The full collation chart makes clear that, in this collation, most variations of a base letter are equivalent irrespective of their lettercase or accent/decoration.
If you want to only match exact letters, you should use a binary collation such as utf8_bin.
The collation of the table is part of the issue; MySQL with a _ci collation is treating all of those 'a's as variants of the same character.
Switching to a _cs collation will force the engine to distinguish 'a' from 'A', and 'á' from 'Á', but it may still treat 'a' and 'á' as the same character.
If you need exact comparison semantics, completely disregarding the equivalency of similar characters, you can use the BINARY comparison operators
SELECT id FROM unicode WHERE BINARY character = 'a'
The ci in the collation means case-insensitive. Switch to a case-sensitive collation (cs) to get the results you're looking for.
How can I perform accent-sensitive but case-insensitive utf8 search in mysql? Utf8_bin is case sensitive, and utf8_general_ci is accent insensitive.
If you want to differ "café" from "cafe"
You may use :
Select word from table_words WHERE Hex(word) LIKE Hex("café");
This way it will return 'café'.
Otherwise if you use :
Select word from table_words WHERE Hex(word) LIKE Hex("cafe");
it will return café.
I'm using the latin1_german2_ci Collation.
There doesn't seem to be one because case sensitivity is tough to do in Unicode.
There is a utf8_general_cs collation but it seems to be experimental, and according to this bug report, doesn't do what it's expected to when using LIKE.
If your data consists of western umlauts only (ie. umlauts that are included in ISO-8859-1), you might be able to collate your search operation to latin1_german2_ci or create a separate search column with it (that specific collation is accent sensitive according to this page; latin1_general_ci might be as well, I don't know and can't test right now).
You can use "hex" to make the search accent-sensitive. Then simply add lcase to make it case insensitive again. So that would give:
SELECT name FROM people WHERE HEX(LCASE(name)) = HEX(LCASE("René"))
You do throw all your indexes out of the window like that. If you want to avoid having to do a full table scan and you have an index on "name", also search for the same thing without the hex and lcase:
SELECT name FROM people WHERE name = "René" and HEX(LCASE(name)) = HEX(LCASE("René"))
This way the index on "name" will be used to find for example only the rows "René" and "Rene" and then the comparison with the "hex" needs to be done only on those two rows instead of on the complete table.