Related
I can find single chars using:
SELECT * FROM table WHERE column LIKE '%Â%' COLLATE utf8mb4_bin;
-- 1 result: 246.8 ± 11.7
I have tried a dozen or more REGEXP with COLLATE utf8mb4_bin, but none return what I need. Like:
SELECT * FROM table WHERE AND HEX(column) REGEXP "^[..]*" COLLATE utf8mb4_bin;
-- 24356, but ALL chars, not just utf8mb4_bin
SELECT * FROM table WHERE AND HEX(column) COLLATE utf8mb4_bin REGEXP "...";
-- 24045, but ALL chars, not just utf8mb4_bin
Some of the chars I can find using the first query above, one at a time (But I need a REGEXP like query to find all, as I don't know what else exists, and I have 30M rows): ┬╜ †… ╬ô├⌐┬╝
Column collation: utf8mb4_unicode_ci
Thanks! Lou
First, the "CHARACTER SET" (such as utf8mb4) is the encoding. You can search for such in the data.
On top of that, the "COLLATION" (such as utf8mb4_bin) is how the characters are compared. For that, you need to look in information_schema.COLUMNS.
To add to the confusion, ± looks like "Mojibake" for ±. That is where the encoding got messed up when inserting data. Not all "Mojibake" start with Â. This gives a list of Mojibake messes from "latin1" characters: http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings. In the middle of the second table is ±.
± is a rather unlikely case to home in on. Can you find another clue?
Here is a regexp to find rows with any 8-bit chars in the column colname:
SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[89A-F].';
If the table is huge, it will take a long time. Also you might want to tack a LIMIT on the end. (That was found in the same document.)
As for analyzing the cause for Mojibake, see Trouble with UTF-8 characters; what I see is not what I stored And for fixing: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Be sure to pick the case that fits your situation.
But, but, ... If Mojibake is involved, the data may be corrupted. Please provide SHOW CREATE TABLE table, SHOW VARIABLES LIKE 'char%'; I may need some more things to help you figure out what went wrong and how to fix it. (Unless my links suffice.)
An example of using that regexp:
mysql> SELECT city, country FROM `c2014`
WHERE HEX(city) RLIKE '^(..)*[89A-F].'
and id%1237 = 1 and length(city) < 20
LIMIT 7;
+------------------+---------+
| city | country |
+------------------+---------+
| Çernoleva | al |
| Kashta e Bardhë | al |
| Sharrëdushk | al |
| São Tomé | ao |
| Teca diá Ndala | ao |
| Gömür | az |
| Balèmyouré | bf |
+------------------+---------+
Another note: The output would be the same for either utf8mb4, utf8, or latin1. It actually looks for "non-Ascii" characters.
I have a function that returns five characters with mixed case. If I do a query on this string it will return the value regardless of case.
How can I make MySQL string queries case sensitive?
The good news is that if you need to make a case-sensitive query, it is very easy to do:
SELECT * FROM `table` WHERE BINARY `column` = 'value'
http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html
The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation:
col_name COLLATE latin1_general_cs LIKE 'a%'
col_name LIKE 'a%' COLLATE latin1_general_cs
col_name COLLATE latin1_bin LIKE 'a%'
col_name LIKE 'a%' COLLATE latin1_bin
If you want a column always to be treated in case-sensitive fashion, declare it with a case sensitive or binary collation.
The answer posted by Craig White has a big performance penalty
SELECT * FROM `table` WHERE BINARY `column` = 'value'
because it doesn't use indexes. So, either you need to change the table collation like mention here https://dev.mysql.com/doc/refman/5.7/en/case-sensitivity.html.
OR
Easiest fix, you should use a BINARY of value.
SELECT * FROM `table` WHERE `column` = BINARY 'value'
E.g.
mysql> EXPLAIN SELECT * FROM temp1 WHERE BINARY col1 = "ABC" AND col2 = "DEF" ;
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | temp1 | ALL | NULL | NULL | NULL | NULL | 190543 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
VS
mysql> EXPLAIN SELECT * FROM temp1 WHERE col1 = BINARY "ABC" AND col2 = "DEF" ;
+----+-------------+-------+-------+---------------+---------------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+------------------------------------+
| 1 | SIMPLE | temp1 | range | col1_2e9e898e | col1_2e9e898e | 93 | NULL | 2 | Using index condition; Using where |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+------------------------------------+
enter code here
1 row in set (0.00 sec)
Instead of using the = operator, you may want to use LIKE or LIKE BINARY
// this returns 1 (true)
select 'A' like 'a'
// this returns 0 (false)
select 'A' like binary 'a'
select * from user where username like binary 'a'
It will take 'a' and not 'A' in its condition
The most correct way to perform a case sensitive string comparison without changing the collation of the column being queried is to explicitly specify a character set and collation for the value that the column is being compared to.
select * from `table` where `column` = convert('value' using utf8mb4) collate utf8mb4_bin;
Why not use binary?
Using the binary operator is inadvisable because it compares the actual bytes of the encoded strings. If you compare the actual bytes of two strings encoded using the different character sets two strings that should be considered the same they may not be equal. For example if you have a column that uses the latin1 character set, and your server/session character set is utf8mb4, then when you compare the column with a string containing an accent such as 'café' it will not match rows containing that same string! This is because in latin1 é is encoded as the byte 0xE9 but in utf8 it is two bytes: 0xC3A9.
Why use convert as well as collate?
Collations must match the character set. So if your server or session is set to use the latin1 character set you must use collate latin1_bin but if your character set is utf8mb4 you must use collate utf8mb4_bin. Therefore the most robust solution is to always convert the value into the most flexible character set, and use the binary collation for that character set.
Why apply the convert and collate to the value and not the column?
When you apply any transforming function to a column before making a comparison it prevents the query engine from using an index if one exists for the column, which could dramatically slow down your query. Therefore it is always better to transform the value instead where possible. When a comparison is performed between two string values and one of them has an explicitly specified collation, the query engine will use the explicit collation, regardless of which value it is applied to.
Accent Sensitivity
It is important to note that MySql is not only case insensitive for columns using an _ci collation (which is typically the default), but also accent insensitive. This means that 'é' = 'e'. Using a binary collation (or the binary operator) will make string comparisons accent sensitive as well as case sensitive.
What is utf8mb4?
The utf8 character set in MySql is an alias for utf8mb3 which has been deprecated in recent versions because it does not support 4 byte characters (which is important for encoding strings like 🐈). If you wish to use the UTF8 character encoding with MySql then you should be using the utf8mb4 charset.
To make use of an index before using the BINARY, you could do something like this if you have large tables.
SELECT
*
FROM
(SELECT * FROM `table` WHERE `column` = 'value') as firstresult
WHERE
BINARY `column` = 'value'
The subquery would result in a really small case-insensitive subset of which you then select the only case-sensitive match.
You can use BINARY to case sensitive like this
select * from tb_app where BINARY android_package='com.Mtime';
unfortunately this sql can't use index, you will suffer a performance hit on queries reliant on that index
mysql> explain select * from tb_app where BINARY android_package='com.Mtime';
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | tb_app | NULL | ALL | NULL | NULL | NULL | NULL | 1590351 | 100.00 | Using where |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
Fortunately, I have a few tricks to solve this problem
mysql> explain select * from tb_app where android_package='com.Mtime' and BINARY android_package='com.Mtime';
+----+-------------+--------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-----------------------+
| 1 | SIMPLE | tb_app | NULL | ref | idx_android_pkg | idx_android_pkg | 771 | const | 1 | 100.00 | Using index condition |
+----+-------------+--------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-----------------------+
Following is for MySQL versions equal to or higher than 5.5.
Add to /etc/mysql/my.cnf
[mysqld]
...
character-set-server=utf8
collation-server=utf8_bin
...
All other collations I tried seemed to be case-insensitive, only "utf8_bin" worked.
Do not forget to restart mysql after this:
sudo service mysql restart
According to http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html there is also a "latin1_bin".
The "utf8_general_cs" was not accepted by mysql startup. (I read "_cs" as "case-sensitive" - ???).
No need to changes anything on DB level, just you have to changes in SQL Query it will work.
Example -
"SELECT * FROM <TABLE> where userId = '" + iv_userId + "' AND password = BINARY '" + iv_password + "'";
Binary keyword will make case sensitive.
Excellent!
I share with you, code from a function that compares passwords:
SET pSignal =
(SELECT DECODE(r.usignal,'YOURSTRINGKEY') FROM rsw_uds r WHERE r.uname =
in_usdname AND r.uvige = 1);
SET pSuccess =(SELECT in_usdsignal LIKE BINARY pSignal);
IF pSuccess = 1 THEN
/*Your code if match*/
ELSE
/*Your code if don't match*/
END IF;
For those looking to do case sensitive comparison with a regular expression using RLIKE or REGEXP, you can instead use REGEXP_LIKE() with match type c like this:
SELECT * FROM `table` WHERE REGEXP_LIKE(`column`, 'value', 'c');
mysql is not case sensitive by default, try changing the language collation to latin1_general_cs
My OpenCart table collation is utf8_bin, unfortunately I can't search for product names with accent in their name. I searched on Google and just found that the collation must be utf8_general_ci for accent compatible and case insensitive search.
What If I add collate declaration to the search query?
SELECT *
FROM `address`
COLLATE utf8_general_ci
LIMIT 0 , 30
Does it have any (bad) side effect? I red about problems with indexing, performance? Or it is totally safe?
I'm afraid you have to consider the side effects on query performance, especially those using indexes. Here is a simple test:
mysql> create table aaa (a1 varchar(100) collate latin1_general_ci, tot int);
insert into aaa values('test1',3) , ('test2',4), ('test5',5);
mysql> create index aindex on aaa (a1);
Query OK, 0 rows affected (0.59 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> desc aaa;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| a1 | varchar(100) | YES | MUL | NULL | |
| tot | int(11) | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
2 rows in set (0.53 sec)
mysql> explain select * from aaa where a1='test1' ;
+----+-------------+-------+------+---------------+--------+---------+-------+--
----+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | r
ows | Extra |
+----+-------------+-------+------+---------------+--------+---------+-------+--
----+-----------------------+
| 1 | SIMPLE | aaa | ref | aindex | aindex | 103 | const |
1 | Using index condition |
+----+-------------+-------+------+---------------+--------+---------+-------+--
----+-----------------------+
1 row in set (0.13 sec)
mysql> explain select * from aaa where a1='test1' collate utf8_general_ci;
+----+-------------+-------+------+---------------+------+---------+------+-----
-+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows
| Extra |
+----+-------------+-------+------+---------------+------+---------+------+-----
-+-------------+
| 1 | SIMPLE | aaa | ALL | NULL | NULL | NULL | NULL | 3
| Using where |
+----+-------------+-------+------+---------------+------+---------+------+-----
-+-------------+
1 row in set (0.06 sec)
You can see that MySQL is stopping using the index on a1 when you search it using another collation, which can be a huge problem for you.
To make sure your indexes are being used for queries, you may have to change your column collation to the most frequently used one.
If practical, change the column definition(s).
ALTER TABLE tbl
MODIFY col VARCHAR(...) COLLATE utf8_general_ci ...;
(You should include anything else that was already in the column definition.) If you have multiple columns to modify, do them all in the same ALTER (for speed).
If, for some reason, you cannot do the ALTER, then, yes, you can tweak the SELECT to change the collation:
The SELECT you mentioned had no WHERE clause for filtering, so let me change the test case:
Let's say you have this, which will find only 'San Jose':
SELECT *
FROM tbl
WHERE city = 'San Jose'
To include San José:
SELECT *
FROM tbl
WHERE city COLLATE utf8_general_ci = 'San Jose'
If you might have "combining accents", consider using utf8_unicode_ci. More on Combining Diacriticals and More on your topic.
As for side effects? None except for on potentially big one: The index on the column cannot be used. In my second SELECT (above), INDEX(city) is useless. The ALTER avoids this performance penalty on the SELECT, but the one-time ALTER, itself, is costly.
In using of COLLATE in SQL statements, I don't find that usage, Anyway for explaining about your main question of effects of using collations I found some tips, but at first:
From dev.mysql.com:
Nonbinary strings (as stored in the CHAR, VARCHAR, and TEXT data types) have a character set and collation. A given character set can have several collations, each of which defines a particular sorting and comparison order for the characters in the set.
Collation is merely the ordering that is used for string comparisons—it has (almost) nothing to do with the character encoding that is used for data storage. I say almost because collations can only be used with certain character sets, so changing collation may force a change in the character encoding.
To the extent that the character encoding is modified, MySQL will correctly re-encode values to the new character set whether going from single to multi-byte or vice-versa. Beware that any values that become too large for the column will be truncated.[1]
The practical advantage of binary collation is its speed, as string comparison is very simple/fast. In general case, indexes with binary might not produce expected results for sort, however for exact matches they can be useful.[2]
With multiple operands, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should the comparison use the collation of the column x, or of the string literal 'Y'? Both x and 'Y' have collations, so which collation takes precedence?
Standard SQL resolves such questions using what used to be called “coercibility” rules. [3]
If you change the collation of a field, ORDER BY -[also in WHERE]- cannot use any INDEX; hence it could be surprisingly inefficient. [4]
Since the forced collation is defined over the same character set as the column's encoding, there won't be any performance impact(versus defining that collation as the column's default; whereas utf8_general_ci will almost certainly perform slower in comparisons than utf8_bin due the extra lookups/computation required).
However, if one forced a collation that is defined over a different character set, MySQL would have to transcode the column's values (which would have a performance impact).[5]
This might help: UTF-8: General? Bin? Unicode?
Please note that utf8_bin is also case sensitive. So I would go for altering table collation to utf8_general_ci and have peace of mind for the future.
I have a function that returns five characters with mixed case. If I do a query on this string it will return the value regardless of case.
How can I make MySQL string queries case sensitive?
The good news is that if you need to make a case-sensitive query, it is very easy to do:
SELECT * FROM `table` WHERE BINARY `column` = 'value'
http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html
The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation:
col_name COLLATE latin1_general_cs LIKE 'a%'
col_name LIKE 'a%' COLLATE latin1_general_cs
col_name COLLATE latin1_bin LIKE 'a%'
col_name LIKE 'a%' COLLATE latin1_bin
If you want a column always to be treated in case-sensitive fashion, declare it with a case sensitive or binary collation.
The answer posted by Craig White has a big performance penalty
SELECT * FROM `table` WHERE BINARY `column` = 'value'
because it doesn't use indexes. So, either you need to change the table collation like mention here https://dev.mysql.com/doc/refman/5.7/en/case-sensitivity.html.
OR
Easiest fix, you should use a BINARY of value.
SELECT * FROM `table` WHERE `column` = BINARY 'value'
E.g.
mysql> EXPLAIN SELECT * FROM temp1 WHERE BINARY col1 = "ABC" AND col2 = "DEF" ;
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | temp1 | ALL | NULL | NULL | NULL | NULL | 190543 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
VS
mysql> EXPLAIN SELECT * FROM temp1 WHERE col1 = BINARY "ABC" AND col2 = "DEF" ;
+----+-------------+-------+-------+---------------+---------------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+------------------------------------+
| 1 | SIMPLE | temp1 | range | col1_2e9e898e | col1_2e9e898e | 93 | NULL | 2 | Using index condition; Using where |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+------------------------------------+
enter code here
1 row in set (0.00 sec)
Instead of using the = operator, you may want to use LIKE or LIKE BINARY
// this returns 1 (true)
select 'A' like 'a'
// this returns 0 (false)
select 'A' like binary 'a'
select * from user where username like binary 'a'
It will take 'a' and not 'A' in its condition
The most correct way to perform a case sensitive string comparison without changing the collation of the column being queried is to explicitly specify a character set and collation for the value that the column is being compared to.
select * from `table` where `column` = convert('value' using utf8mb4) collate utf8mb4_bin;
Why not use binary?
Using the binary operator is inadvisable because it compares the actual bytes of the encoded strings. If you compare the actual bytes of two strings encoded using the different character sets two strings that should be considered the same they may not be equal. For example if you have a column that uses the latin1 character set, and your server/session character set is utf8mb4, then when you compare the column with a string containing an accent such as 'café' it will not match rows containing that same string! This is because in latin1 é is encoded as the byte 0xE9 but in utf8 it is two bytes: 0xC3A9.
Why use convert as well as collate?
Collations must match the character set. So if your server or session is set to use the latin1 character set you must use collate latin1_bin but if your character set is utf8mb4 you must use collate utf8mb4_bin. Therefore the most robust solution is to always convert the value into the most flexible character set, and use the binary collation for that character set.
Why apply the convert and collate to the value and not the column?
When you apply any transforming function to a column before making a comparison it prevents the query engine from using an index if one exists for the column, which could dramatically slow down your query. Therefore it is always better to transform the value instead where possible. When a comparison is performed between two string values and one of them has an explicitly specified collation, the query engine will use the explicit collation, regardless of which value it is applied to.
Accent Sensitivity
It is important to note that MySql is not only case insensitive for columns using an _ci collation (which is typically the default), but also accent insensitive. This means that 'é' = 'e'. Using a binary collation (or the binary operator) will make string comparisons accent sensitive as well as case sensitive.
What is utf8mb4?
The utf8 character set in MySql is an alias for utf8mb3 which has been deprecated in recent versions because it does not support 4 byte characters (which is important for encoding strings like 🐈). If you wish to use the UTF8 character encoding with MySql then you should be using the utf8mb4 charset.
To make use of an index before using the BINARY, you could do something like this if you have large tables.
SELECT
*
FROM
(SELECT * FROM `table` WHERE `column` = 'value') as firstresult
WHERE
BINARY `column` = 'value'
The subquery would result in a really small case-insensitive subset of which you then select the only case-sensitive match.
You can use BINARY to case sensitive like this
select * from tb_app where BINARY android_package='com.Mtime';
unfortunately this sql can't use index, you will suffer a performance hit on queries reliant on that index
mysql> explain select * from tb_app where BINARY android_package='com.Mtime';
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | tb_app | NULL | ALL | NULL | NULL | NULL | NULL | 1590351 | 100.00 | Using where |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
Fortunately, I have a few tricks to solve this problem
mysql> explain select * from tb_app where android_package='com.Mtime' and BINARY android_package='com.Mtime';
+----+-------------+--------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-----------------------+
| 1 | SIMPLE | tb_app | NULL | ref | idx_android_pkg | idx_android_pkg | 771 | const | 1 | 100.00 | Using index condition |
+----+-------------+--------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-----------------------+
Following is for MySQL versions equal to or higher than 5.5.
Add to /etc/mysql/my.cnf
[mysqld]
...
character-set-server=utf8
collation-server=utf8_bin
...
All other collations I tried seemed to be case-insensitive, only "utf8_bin" worked.
Do not forget to restart mysql after this:
sudo service mysql restart
According to http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html there is also a "latin1_bin".
The "utf8_general_cs" was not accepted by mysql startup. (I read "_cs" as "case-sensitive" - ???).
No need to changes anything on DB level, just you have to changes in SQL Query it will work.
Example -
"SELECT * FROM <TABLE> where userId = '" + iv_userId + "' AND password = BINARY '" + iv_password + "'";
Binary keyword will make case sensitive.
Excellent!
I share with you, code from a function that compares passwords:
SET pSignal =
(SELECT DECODE(r.usignal,'YOURSTRINGKEY') FROM rsw_uds r WHERE r.uname =
in_usdname AND r.uvige = 1);
SET pSuccess =(SELECT in_usdsignal LIKE BINARY pSignal);
IF pSuccess = 1 THEN
/*Your code if match*/
ELSE
/*Your code if don't match*/
END IF;
For those looking to do case sensitive comparison with a regular expression using RLIKE or REGEXP, you can instead use REGEXP_LIKE() with match type c like this:
SELECT * FROM `table` WHERE REGEXP_LIKE(`column`, 'value', 'c');
mysql is not case sensitive by default, try changing the language collation to latin1_general_cs
I have a member search function where you can give parts of names and the return should be all members having at least one of username, firstname or lastname matching that input. The problem here is that some names have 'weird' characters like the é in Renée and the user doesn't wanna type the weird character but the normal ASCII substitute e.
In PHP I convert the input string to ASCII using iconv (just in case someone types weird characters). In the database however I should also convert the weird chars to ASCII (obviously) for the strings to match.
I tried the following:
SELECT
CONVERT(_latin1'Renée' USING ascii) t1,
CAST(_latin1'Renée' AS CHAR CHARACTER SET ASCII) t2;
(That's two tries.) Both don't work. Both have Ren?e as output. The question mark should be an e. It's alright if it outputs Ren?ee since I can just remove all question marks after the convert.
As you can imagine, the columns I want to query are encoded Latin1.
Thanks.
You don't need to convert anything. Your requirement is to compare two strings and ask if they're equal, ignoring accents; the database server can use a collation to do that for you:
Non-UCA collations have a one-to-one
mapping from character code to weight.
In MySQL, such collations are case
insensitive and accent insensitive.
utf8_general_ci is an example: 'a',
'A', 'À', and 'á' each have different
character codes but all have a weight
of 0x0041 and compare as equal.
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';
+-----------+-----------+-----------+
| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
+-----------+-----------+-----------+
| 1 | 1 | 1 |
+-----------+-----------+-----------+
1 row in set (0.06 sec)
First off, it should work this way:
SELECT * FROM `test` WHERE `name` COLLATE utf8_general_ci LIKE '%renee%';
Where the test table is:
+-----+--------+
| id | name |
+-----+--------+
| 1 | Renée |
| 2 | Renêe |
| 3 | Renee |
+-----+--------+
What is your MySQL version, and how do you try to match things?
One of the other possible solutions is transliteration.
Related: PHP Transliteration
Transliterating the input should not be a problem, but transliterating the values from the permanent storage (e.g. db) real-time during the search may not be feasible. So you can add three more fields like: username_slug, firstname_slug and lastname_slug. When inserting/modifying a record, set the slug values appropriately. And when searching, search the transliterated input against that slug fields.
+------+----------+---------------+----------+---------------+ ...
| id | username | username_slug | lastname | lastname_slug | ...
+------+----------+---------------+----------+---------------+ ...
| 1 | Renée | renee | La Niña | la-nina | ...
| 2 | Renêe | renee | ... | ... | ...
| 3 | Renee | renee | ... | ... | ...
+------+----------+---------------+----------+---------------+ ...
A search for "renee" or "renèe" would match all of the records.
As a side effect, you may be able to use that fields for generating SEF (search engine friendly) links, hence they are named ,..._slug, e.g. example.com/users/renee. Of course, in that case you should check for the uniqueness of the slug field.
#vincebowdren answer above works, I'm just adding this as an answer for formatting purposes:
CREATE TABLE `members` (
`id` int(11) DEFAULT NULL,
`lastname` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL
);
insert into members values (1, 'test6ë');
select id from members where lastname like 'test6e%';
Yields
+------+
| id |
+------+
| 1 |
+------+
And using Latin1,
set names latin1;
CREATE TABLE `members2` (
`id` int(11) DEFAULT NULL,
`lastname` varchar(20) CHARACTER SET latin1 DEFAULT NULL
);
insert into members2 values (1, 'Renée');
select id from members2 where lastname like '%Renee%';
will yield:
+------+
| id |
+------+
| 1 |
+------+
Of course, the OP should have the same charset in the application (PHP), connection (MySQL on Linux used to default to latin1 in 5.0, but defaults to UTF8 in 5.1), and in the field datatype to have less unknowns. Collations take care of the rest.
EDIT: I wrote should to have a better control over everything, but the following also works:
set names latin1;
select id from members where lastname like 'test6ë%';
Because, once the connection charset is set, MySQL does the conversion internally. In this case, it will convert somehow convert and compare the UTF8 string (from DB) to the latin1 (from query).
EDIT 2: Some skepticism requires me to provide an even more convincing example:
Given the statements above, here what I did more. Make sure the terminal is in UTF8.
set names utf8;
insert into members values (5, 'Renée'), (6, 'Renêe'), (7, 'Renèe');
select members.id, members.lastname, members2.id, members2.lastname
from members inner join members2 using (lastname);
Remember that members is in utf8 and members2 is in latin1.
+------+----------+------+----------+
| id | lastname | id | lastname |
+------+----------+------+----------+
| 5 | Renée | 1 | Renée |
| 6 | Renêe | 1 | Renée |
| 7 | Renèe | 1 | Renée |
+------+----------+------+----------+
which proves with the correct settings, the collation does the work for you.
The CAST() operator in the context of character encodings translates from one method of character storage to another — it does not change the actual characters, which is what you are after. An é character is what it is in any character set, it is not an e. You need to convert accented characters to non-accented characters, which is a different issue and has been asked a number of times previously (normalizing accented characters in MySQL queries).
I am unsure if there is a way to do this directly in MySQL, short of having a translation table and going through letter by letter. It would most likely be easier to write a PHP script to go through the database and make the translations.