I have a mysql table with utf8_general_ci encoding where I keep data in different languages mostly English, Turkish, Farsi, etc.
The problem is that the sql statement:
SELECT * FROM `qkw` WHERE `eword` = 'turk'
returns rows with both "turk & türk" values as result.
I have the same problem with indexing which treats ü & u the same. Is this a bug in Mysql or should I use a different encoding? Any suggestions?
Thanks
The different collations are documented here, including the effect you're seeing;
To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 10.1.7.8, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
If you don't want that, you can choose a collation from that list that does not see them as equivalent, for example utf8_swedish_ci.
Your best bet would probably be to use the utf8_turkish_ci collation.
It will distinguish between 'u' and 'ü' as you wish. It is (_ci suffix) a case insensitive collation:
create table t (v varchar(255)
character set utf8
collate utf8_turkish_ci);
insert into t values ("turk"), ("türk"), ("top"), ("twin");
mysql> select * from t order by v;
+-------+
| v |
+-------+
| türk |
| top |
| turk |
| twin |
+-------+
mysql> select * from t where v = "turk";
+------+
| v |
+------+
| turk |
+------+
mysql> select * from t where v = "TURK";
+------+
| v |
+------+
| turk |
+------+
Being based on simply comparing the binary code of each character, Using utf8_bin will produce slightly different results. Not only it will be case sensitive, but the ordering will be different:
mysql> alter table t change column v v varchar(255) collate utf8_bin;
Query OK, 4 rows affected (0.24 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> select * from t order by v;
+-------+
| v |
+-------+
| top |
| turk |
| twin |
| türk |
+-------+
4 rows in set (0.00 sec)
mysql> select * from t where v = "turk";
+------+
| v |
+------+
| turk |
+------+
1 row in set (0.00 sec)
mysql> select * from t where v = "TURK";
Empty set (0.00 sec)
Related
I have a table with a column, which has cp1251_general_ci collation. I don't want to change column collation, but I want to get data in utf8 encoding.
Is there a way to select any data somehow in a way that it looks just like a data with utf8_general_ci collation?
I.e. I need something like this
SELECT CONVERT_TO_UTF8(weirdColumn) FROM weirdTable
Here's a demo table using the cp1251 encoding. I'll insert some Cyrillic characters into it.
mysql> CREATE TABLE weirdTable (weirdColumn text) ENGINE=InnoDB DEFAULT CHARSET=cp1251;
mysql> insert into weirdTable values ('ЂЃЉЌ');
mysql> select * from weirdTable;
+-------------+
| weirdColumn |
+-------------+
| ЂЃЉЌ |
+-------------+
Use MySQL's CONVERT() function to force the characters to a different encoding:
mysql> select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
Here's proof that the result has been converted to utf8. I create a table using metadata from the query result:
mysql> create table w2
as select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> show create table w2\G
*************************** 1. row ***************************
Table: w2
Create Table: CREATE TABLE `w2` (
`weirdColumnUtf8` longtext CHARACTER SET utf8
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)
mysql> select * from w2;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
On my MySQL instance, utf8mb4 is the default character encoding. That's okay; it's a superset of utf8, and the utf8 encoding is enough to store these characters. However, I generally recommend if you use utf8, there's no reason not to use utf8mb4.
If you change the character encoding, you cannot keep the cp1251 collation. Collations are specific to encodings. But you can use one of the collations associated with utf8 or utf8mb4. You can see the available collations for a given character encoding:
mysql> SHOW COLLATION WHERE Charset = 'utf8';
+--------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------------+---------+-----+---------+----------+---------+---------------+
...
| utf8_general_ci | utf8 | 33 | Yes | Yes | 1 | PAD SPACE |
| utf8_general_mysql500_ci | utf8 | 223 | | Yes | 1 | PAD SPACE |
...
I need to search a text field on a database avoiding mismatch for special chars but for the same phrase.
For example, if my search term in DB field is saved as "I lòve mysql ánd query" I would like to match the search for "I love mysql ánd query","I love mysql and query","I löve mysql ánd query",etc.
I was thinking to convert the phrases with a PHP function that I use for url rewrites flattening them out always to "I love mysql and query" but I'm not sure I can flatten them out in the query?
Since your data is already written to the DB with accents, can you try using DB collation to map directly between accented characters:
$connection->query("SET NAMES utf8 COLLATE utf8_general_ci");
You can read more about it here
The page above explaints clearly what this collation will do for you:
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE t1
(c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO t1 VALUES ('a'),('A'),('À'),('á');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1 | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a | 61 | 0041 |
| A | 41 | 0041 |
| À | C380 | 0041 |
| á | C3A1 | 0041 |
+------+---------+------------------------+
4 rows in set (0.00 sec)
You can also test it for youself directly in the DB (test taken from here):
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';
+-----------+-----------+-----------+
| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
+-----------+-----------+-----------+
| 1 | 1 | 1 |
+-----------+-----------+-----------+
1 row in set (0.06 sec)
I'm having a difficult time sorting a char field in MySQL. The problem is that accented characters get mixed up with un-accented characters. For example:
Abc
Ábd
Acc
I thought it may have something to do with collation. So I changed the collation of my table to utf8-ut8_bin, after reading this post. Actually, I altered the table several times to various collations. No cigar.
I should also add that, I don't mind the order of the sort as long as the sort doesn't result in a mixed list. In other words, this is fine:
Ábd
Abc
Acc
and so is this:
Abc
Acc
Ábd
Looking forward to your response.
You just need to use a case-sensitive collation, for example: utf8_general_cs.
UPD
I am sorry, it seems there is no utf8_general_cs, utf8_bin should work though.
And you should change the collation of the specific field instead of that of the table (or be sure that the field does use the table defaults).
mysql> SELECT * FROM (
-> SELECT 'A' as l
-> UNION ALL
-> SELECT 'á' as l
-> UNION ALL
-> SELECT 'A' as l) ls
-> ORDER BY l;
+----+
| l |
+----+
| A |
| á |
| A |
+----+
3 rows in set (0.00 sec)
mysql> SELECT * FROM (
-> SELECT 'A' as l
-> UNION ALL
-> SELECT 'á' as l
-> UNION ALL
-> SELECT 'A' as l) ls
-> ORDER BY l COLLATE utf8_bin;
+----+
| l |
+----+
| A |
| A |
| á |
+----+
3 rows in set (0.00 sec)
#newtower had a good starting point, but neither he, nor the 'Community' realized that it was inconclusive. So I am providing an answer that should 'fix' it:
unicode_ci (and virtually all other collations):
SET NAMES utf8 COLLATE utf8_unicode_ci;
SELECT GROUP_CONCAT(l SEPARATOR '=') AS gc
FROM (
SELECT 'A' as l UNION ALL
SELECT 'á' as l UNION ALL
SELECT 'A' as l ) ls
GROUP BY l
ORDER BY gc;
+--------+
| gc |
+--------+
| A=á=A |
+--------+
bin:
SET NAMES utf8 COLLATE utf8_bin;
SELECT GROUP_CONCAT(l SEPARATOR '=') AS gc
FROM (
SELECT 'A' as l UNION ALL
SELECT 'á' as l UNION ALL
SELECT 'A' as l ) ls
GROUP BY l
ORDER BY gc;
+------+
| gc |
+------+
| A=A |
| á |
+------+
(You could add a DISTINCT in the GROUP_CONCAT to avoid the dup A.)
And here is a full rundown of the utf8 collations (using that technique): http://mysql.rjweb.org/utf8_collations.html
Suggest you click "Affects Me" on https://bugs.mysql.com/bug.php?id=58797
I have MySQL database with 30 rows in customer_customer table. Out of which 5 record has adm_name as Mike.
mysql> select id from customer_customer where adm_name like '%mike%';
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+----+
5 rows in set (0.00 sec)
Now I have changed character set of my table to utf8
mysql> ALTER TABLE customer_customer CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
Query OK, 30 rows affected (0.03 sec)
Records: 30 Duplicates: 0 Warnings: 0
Again if I run same like query, then MySQL is not returning me any records.
mysql> select id from customer_customer where adm_name like '%mike%';
Empty set (0.00 sec)
I am not able to understand this behavior. Is there anyone who has came across this situation? Am I doing anything wrong?
You changed collation to binary, in this case comparison is done byte by byte rather than character by character. Here it is a good example and explanation for the BINARY operator.
mysql> SELECT 'a' = 'A';
-> 1
mysql> SELECT BINARY 'a' = 'A';
-> 0
forgive my newbie question, but why finding by '2' or '2' in Mysql returns the same record?
For example:
Say I have a record with string field named 'slug', and the value is '2'. And the following SQLs returns same record.
SELECT * From articles WHERE slug='2'
SELECT * From articles WHERE slug='2'
It has to do with the collation of your database:
mysql> SHOW VARIABLES LIKE 'collation_%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
mysql> SELECT '2'='2';
+-----------+
| '2'='2' |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT '2'='2';
+-----------+
| '2'='2' |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)
they should not return the same row for equality, but if you use like you are probably getting the same row. using like mysql will use fuzzy matching, so 2 and 2 will be the same (afer all they are both a form of 2, aren't they?)
What is the datatype of slug? i think its numeric one. If so here mysql does cast it to int, and any ways '2' or ' 2 ' will become 2. This wont happen with string datatypes.