I have the following query in MySQL:
SELECT id FROM unicode WHERE `character` = 'a'
The table unicode contains each unicode character along with an ID (it's integer encoding value). Since the collation of the table is set to utf8_unicode_ci, I would have expected the above query to only return 97 (the letter 'a'). Instead, it returns 119 rows containing the IDs of many 'a'-like letters:
a A Ã ...
It seems to be ignoring both case and the multi-byte nature of the characters.
Any ideas?
As documented under Unicode Character Sets:
MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The full collation chart makes clear that, in this collation, most variations of a base letter are equivalent irrespective of their lettercase or accent/decoration.
If you want to only match exact letters, you should use a binary collation such as utf8_bin.
The collation of the table is part of the issue; MySQL with a _ci collation is treating all of those 'a's as variants of the same character.
Switching to a _cs collation will force the engine to distinguish 'a' from 'A', and 'á' from 'Á', but it may still treat 'a' and 'á' as the same character.
If you need exact comparison semantics, completely disregarding the equivalency of similar characters, you can use the BINARY comparison operators
SELECT id FROM unicode WHERE BINARY character = 'a'
The ci in the collation means case-insensitive. Switch to a case-sensitive collation (cs) to get the results you're looking for.
Related
This question is an extension of the following question - How to make mysql consider the control characters when doing string comparison?
Here is my query -
SELECT 'abc' < 'abcSOH' COLLATE utf8mb4_0900_bin;
Here SOH is the Start Of Header which is an ASCII control character with ASCII code 1. My expectation is that this query will return 1 as the second string's length is 4. I have even tried with Space (ASCII code 32) with the same results!!
If you check this fiddle, you can see only the 'utf8mb4_0900_bin' collation gives the expected result. All other collations that I have tested give the opposite result.
https://dbfiddle.uk/mDLVWOZG
I have gone through the documentation and could not find the reason behind this. Can anyone please explain why is this?
I am interested to know this because I would like to use a 1-byte character set (and corresponding collation) instead of a 4-byte character set because I have some legacy tables (converting to MySQL) that have a lot of columns and if I use a 4-byte character set, it gives an error that the row is too big.
Each column can have its own CHARACTER SET and COLLATION. But different rows must agree.
CREATE TABLE provides only "defaults" for those settings -- these defaults are used if you don't override them when declaring the individual columns.
So, legacy columns may as well be declared with whatever antique charset was used. (Sorry, EBCDIC is not available.)
All the "printable" characters of ASCII are available in UTF-8 (MySQL's utf8/utf8mb3/utf8mb4). In fact, the binary encoding is identical.
The "control characters" -- well, stick with ascii or latin1 (perhaps with latin1_bin).
Any _bin collation says to simply look at the bits.
I do not know if control characters are turned into space (hex 20) when INSERTing into a UTF-8 column.
When I search for LIKE %カナ it still brings up results for かな.
From the MySQL documentation (I'm on 8.0.26) under Language-Specific Collations:
For Japanese, the utf8mb4 character set includes utf8mb4_ja_0900_as_cs and
utf8mb4_ja_0900_as_cs_ks collations. Both collations are accent-sensitive and
case-sensitive. utf8mb4_ja_0900_as_cs_ks is also kana-sensitive and distinguishes
Katakana characters from Hiragana characters, whereas utf8mb4_ja_0900_as_cs treats
Katakana and Hiragana characters as equal for sorting.
Checking my column it shows the kana-sensitive collation:
SELECT COLUMN_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
COLUMN_NAME
COLLATION_NAME
kana
utf8mb4_ja_0900_as_cs_ks
There are three(?) different pieces of code that MySQL uses for character comparisons: =, LIKE, REGEXP. They are, confusingly, not identical. And in some cases, they are deliberately different.
s LIKE 'abc' is turned into s = 'abc', adding to the confusion.
A collation ending with _as_cs implies that 'e' <> 'E', but whether < or > applies is still honored. This is unlike BINARY or _bin collation, in which cases it blindly checks the bits.
The collation you are using is relatively new; it is not older than 8.0. If you find errors in the collation, please file a bug report at bugs.mysql.com and provide a simple testcase demonstrating the issue.
For case sensitive try to use LIKE BINARY.
Example:
SELECT name FROM users WHERE name LIKE BINARY 'John%';
These two querys gives me the exact same result:
select * from topics where name='Harligt';
select * from topics where name='Härligt';
How is this possible? Seems like mysql translates åäö to aao when it searches. Is there some way to turn this off?
I use utf-8 encoding everywhere as far as i know. The same problem occurs both from terminal and from php.
Yes, this is standard behaviour in the non-language-specific unicode collations.
9.1.13.1. Unicode Character Sets
To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 9.1.7.7, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
See also Examples of the effect of collation
You need to either
use a collation that doesn't have this "feature" (namely utf8_bin, but that has other consequences)
use a different collation for the query only. This should work:
select * from topics where name='Harligt' COLLATE utf8_bin;
it becomes more difficult if you want to do a case insensitive LIKE but not have the Ä = A umlaut conversion. I know no mySQL collation that is case insensitive and does not do this kind of implicit umlaut conversion. If anybody knows one, I'd be interested to hear about it.
Related:
Looking for case insensitive MySQL collation where “a” != “ä”
MYSQL case sensitive search for utf8_bin field
Since you are in Sweden I'd recommend using the Swedish collation. Here's an example showing the difference it makes:
CREATE TABLE topics (name varchar(100) not null) CHARACTER SET utf8;
INSERT topics (name) VALUES ('Härligt');
select * from topics where name='Harligt';
'Härligt'
select * from topics where name='Härligt';
'Härligt'
ALTER TABLE topics MODIFY name VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_swedish_ci;
select * from topics where name='Harligt';
<no results>
select * from topics where name='Härligt';
'Härligt'
Note that in this example I only changed the one column to Swedish collation, but you should probably do it for your entire database, all tables, all varchar columns.
While collations are one way of solving this, the much more straightforward way seems to me to be the BINARY keyword:
SELECT 'a' = 'ä', BINARY 'a' = 'ä'
will return 1|0
In your case:
SELECT * FROM topics WHERE BINARY name='Härligt';
See also https://www.w3schools.com/sql/func_mysql_binary.asp
you want to check your collation settings, collation is the property that sets which characters are identical.
these 2 pages should help you
http://dev.mysql.com/doc/refman/5.1/en/charset-general.html
http://dev.mysql.com/doc/refman/5.1/en/charset-mysql.html
Here you can see some collation charts. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.
Here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html
After viewing my prod logs, I have some error mentionning :
[2012-08-31 15:56:43] request.CRITICAL: Doctrine\DBAL\DBALException:
An exception occurred while executing 'SELECT t0.username ....... FROM fos_user t0 WHERE t0.username = ?'
with params {"1":"Nrv\u29e7Kasi"}:
SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT)
and (utf8_general_ci,COERCIBLE) for operation '='
Alghout i have UTF-8 default under the doctrine cfg :
doctrine:
dbal:
charset: UTF8
It seems that all my MySQL Tables are in latin1_swedish_ci, so my question is :
Can I manually change the collation to utf8_general_ci for all my tables without any complications/precautions ?
It is helpful to understand the following definitions:
A character encoding details how each symbol is represented in binary (and therefore stored in the computer). For example, the symbol é (U+00E9, latin small letter E with acute) is encoded as 0xc3a9 in UTF-8 (which MySQL calls utf8) and 0xe9 in Windows-1252 (which MySQL calls latin1).
A character set is the alphabet of symbols that can be represented using a given character encoding. Confusingly, the term is also used to mean the same as character encoding.
A collation is an ordering on a character set, so that strings can be compared. For example: MySQL's latin1_swedish_ci collation treats most accented variations of a character as equivalent to the base character, whereas its latin1_general_ci collation will order them before the next base character but not equivalent (there are other, more significant, differences too: such as the order of characters like å, ä, ö and ß).
MySQL will decide which collation should be applied to a given expression as documented under Collation of Expressions: in particular, the collation of a column takes precedence over that of a string literal.
The WHERE clause of your query compares the following strings:
a value in fos_user.username, encoded in the column's character set (Windows-1252) and expressing a preference for its collation latin1_swedish_ci (with a coercibility value of 2); with
the string literal 'Nrv⧧Kasi', encoded in the connection's character set (UTF-8, as configured by Doctrine) and expressing a preference for the connection's collation utf8_general_ci (with a coercibility value of 4).
Since the first of these strings has a lower coercibility value than the second, MySQL attempts to perform the comparison using that string's collation: latin1_swedish_ci. To do so, MySQL attempts to convert the second string to latin1—but since the ⧧ character does not exist in that character set, the comparison fails.
Warning
One should pause for a moment to consider how the column is currently encoded: you are attempting to filter for records where fos_user.username is equal to a string that contains a character which cannot exist in that column!
If you believe that the column does contain such characters, then you probably wrote to the column whilst the connection character encoding was set to something (e.g. latin1) that caused MySQL to interpret the received byte sequence as characters which are all in the Windows-1252 character set.
If this is the case, before continuing any further you should fix your data!
convert such columns to the character encoding that was used on data insertion, if different to the incumbent encoding:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET foo;
drop the encoding information associated with such columns by converting them to the binary character set:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET binary;
associate with such columns the encoding in which data was actually transmitted by converting them to the relevant character set.
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET bar;
Note that, if converting from a multi-byte encoding, you may need to increase the size of the column (or even change its type) in order to accomodate the maximum possible length of the converted string.
Once one is certain that the columns are correctly encoded, one could force the comparison to be conducted using a Unicode collation by either—
explicitly converting the value fos_user.username to a Unicode character set:
WHERE CONVERT(fos_user.username USING utf8) = ?
forcing the string literal to have a lower coercibility value than the column (will cause an implicit conversion of the column's value to UTF-8):
WHERE fos_user.username = ? COLLATE utf8_general_ci
Or one could, as you say, permanently convert the column(s) to a Unicode encoding and set its collation appropriately.
Can I manually change the collation to utf8_general_ci for all my tables without any complications/precautions ?
The principle consideration is that Unicode encodings take up more space than single-byte character sets, so:
more storage may be required;
comparisons may be slower; and
index prefix lengths may need to be adjusted (note that the maximum is in bytes, so may represent fewer characters than previously).
Also, be aware that, as documented under ALTER TABLE Syntax:
To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to three bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length will not fit in a TEXT column's length bytes, so MySQL will convert the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.
To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns.
Thats right. I ran into this problem and the best quick and fast solution is
CONVERT(fos_user.username USING utf8)
Simply convert table's character set by command as follows,
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8;
I have MySQL 5.xx running on Linux system. My application writes correctly ä, ö, å etc. characters to database and even gets these values correctly. But when I use WHERE to filter search for char 'ä', it will return also Strings that contain 'a' chars. Why MySQL thinks that a is equal to ä?
Example query:
SELECT column FROM table WHERE field='%ä%';
MySQL's uses collations to compare character values.
Collations are the sets of rules used by database to define which characters are different and which are not when comparing.
Case sensitive collations distinguish between 'QUERY' and 'query', case insensitive do not.
Accent sensitive collations distinguish between 'résumé' and 'resume', accent insensitive do not.
In your column's default collation (most probably UTF8_GENERAL_CI), umlauted characters are indistinguishable from non-umlauted:
SELECT 'a' LIKE '%ä%'
---
1
To distinguish between them, use binary collation (which treat all characters with different unicodes as different characters):
SELECT 'a' LIKE '%ä%' COLLATE UTF8_BIN
---
0
Note: for many applications, collating a and ä as the same letter is considered a feature. My suggestion: be sure to double-check with your clients to determine which behavior is desired.
I might even follow up with a memo that says, "As we discussed on x date, the system will sort and find characters as follows..."