Case insensitive unicode collation in MySQL - mysql

I've got a database where we store usernames with a capital first letter of each name -- ie, IsaacSparling. I'm trying to do case insensitive autocomplete against my MySQL (v5.1.46) db. Table has a charset of UTF8 and a collation of utf8_unicode_ci. I've done these tests against the utf8_general_ci collation as well.
Plain ASCII text works fine:
mysql> select username from users where username like 'j%';
+----------------+
| username |
+----------------+
| J******** |
| J*********** |
| J************* |
+----------------+
3 rows in set (0.00 sec)
mysql> select username from users where username like 'J%';
+----------------+
| username |
+----------------+
| J******** |
| J*********** |
| J************* |
+----------------+
3 rows in set (0.00 sec)
(names redacted, but they're there).
However, when I try to do the same for unicode characters outside the ASCII set, no such luck:
mysql> select username from users where username like 'ø%';
Empty set (0.00 sec)
mysql> select username from users where username like 'Ø%';
+-------------+
| username |
+-------------+
| Ø********* |
+-------------+
1 row in set (0.00 sec)
Some investigation has lead me to this: http://bugs.mysql.com/bug.php?id=19567 (tl;dr, this is a known bug with the unicode collations, and fixing it is at 'new feature' priority -- ie, won't be finished in any reasonable timeframe).
Has anybody discovered any effective workarounds that allow for case-insensitive searching for unicode characters in MySQL? Any thoughts appreciated!

Works fine for me with version 5.1.42-community
Maybe your mysql client did not send the unicode characters properly. I tested with sqlYog and it worked just fine with both utf8_unicode_ci and utf8_general_ci collations

IF what you care about is being able to order the field values by the text without caring if it is in upper or lower case I think the best thing you can do is when addressing the field instead of typing just username, type LOWER(username) username and then you can perfectly use an order by that field calling it by its name

Have you tried using CONVERT? Something like
WHERE `lastname` LIKE CONVERT( _utf8 'ø%' USING latin1 )
might work for you.

I just resolved the same problem using the query
show variables like '%char%';
My character_set_client was set to 'utf8', but character_set_connection and character_set_results were set to 'latin1'. Thus, the functions UPPER, LOWER, LIKE did not work as expected.
I just inserted the line
mysql_query("SET NAMES utf8");
right after connection to get the case-insensitive searching work.

Related

MySQL 'set names latin1' seems to cause data to be stored as utf8

I have a table defined as follows:
mysql> show create table temptest;
+------------+-----------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+------------+-----------------------------------------------------------------------------------------------------------+
| temptest | CREATE TABLE `temptest` (
`mystring` varchar(100) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+------------+-----------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
When I use mysql console (through mysql temptest) and insert a character through
insert into temptest values ("é");
I can see it is getting saved as "latin1" encoding
mysql> select hex(mystring) from temptest;
+---------------+
| hex(mystring) |
+---------------+
| E9 |
+---------------+
But if I issue a "set names latin1" and perform the same operation, I see it storing the same character in utf8 encoding.
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into temptest values ("é");
Query OK, 1 row affected (0.01 sec)
mysql> select hex(mystring) from temptest;
+---------------+
| hex(mystring) |
+---------------+
| E9 |
| C3A9 |
+---------------+
As far as I understand, "set names" shouldn't affect how mysql stores the data (https://dev.mysql.com/doc/refman/8.0/en/set-names.html). What am I missing here? Any insight into this would be greatly appreciated. Thank you.
SET NAMES latin1 declares that the encoding in your client is latin1.
But (apparently) it is actually utf8.
So, when you type é, the client generates the 2 bytes C3 A9.
Then those are sent as if they were latin1 to the server (mysqld).
The Server says "Oh, I am getting some latin1 bytes, and I will be putting them into a latin1 column, so I don't need to transform them.
In go two latin1 characters é (hex C3A9). This is called Mojibake.
If you do SET NAMES utf8 and SELECT the text, you will "see" é and it will be 4 bytes (hex C383C2A9)!
Bottom line: Your client encoding was really utf8, so you should have said SET NAMES utf8 (or utf8mb4). Confused? Welcome to the club.

Searching emoji from varchar column returns different record

I'm using MySQL5.6. The DB character set is utf8mb4.
When I search emoji as below, I got unexpected results.
mysql> SELECT id, hex(title) FROM tags WHERE title = 0xF09F9886;
+-----+------------+
| id | hex(title) |
+-----+------------+
| 165 | F09F9886 |
| 166 | F09F9884 |
+-----+------------+
It should return only id=165. Does anyone know this why?
I found how to fix it. It was a problem of collation. I used default collation value, I presume it's utf8mb4_general_ci. When I changed that utf8mb4_bin, MySQL returned right result.
You can change collation as below.
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

MySQl - incorrect string value on version from latin1 to utf8

So we originally had latin1 for our MySQL database (a long long time ago) and we are trying to convert to UTF8 before a more global outreach, but I'm having issues with the transition. Here's my MySQL:
/* First set as latin1 */
SET NAMES 'latin1';
/* We must change things to blob and then back again */
ALTER TABLE `address` CHANGE line_1 line_1 BLOB;
ALTER TABLE `address` CONVERT TO CHARACTER SET utf8;
ALTER TABLE `address` CHANGE line_1 line_1 VARCHAR(64);
And the error we are getting:
Incorrect string value: '\xF6gberg...' for column 'line_1' at row 7578
ALTER TABLE `address` CHANGE line_1 line_1 VARCHAR(64)
The method we are using is basically described through here:
http://www.percona.com/blog/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/
Any ideas would be great. (Also, since I'm not an expert in MySQL not sure what kind of data you would need, so lemme know if you need anything additional.)
Update
I've tried
SET NAMES 'utf8';
SET NAMES 'utf8mb4';
And I tried using utf8mb4 as was described below. After switching to utf8mb4 (which I'll likely keep), the alteration of of the address still produced the same problem.
Update 2
So I tried looking at converting the string itself to see what's happening and noticed something super weird:
mysql> select line_1 from address where line_1 like '%berg%';
+------------------------+
| line_1 |
+------------------------+
| H�bergsgatan 97 |
+------------------------+
mysql> select CONVERT(line_1 USING utf8) from address where line_1 like '%berg%';
+----------------------------+
| CONVERT(line_1 USING utf8) |
+----------------------------+
| NULL |
+----------------------------+
mysql> select CONVERT(line_1 USING utf8mb4) from address where line_1 like '%berg%';
+-------------------------------+
| CONVERT(line_1 USING utf8mb4) |
+-------------------------------+
| NULL |
+-------------------------------+
mysql> select CONVERT(line_1 USING latin1) from address where line_1 like '%berg%';
+------------------------------+
| CONVERT(line_1 USING latin1) |
+------------------------------+
| Högbergsgatan 97 |
+------------------------------+
So it seems like utf isn't the proper encoding for this? o_O As I'm working with addresses I was able to look it up and it seems like the address is in Stockholm and is supposed to be "Högbergsgatan 97" which matches latin1. I tried the swedish character encoding, but that seems to have failed as well:
mysql> select CONVERT(line_1 USING swe7) from address where addressid = 11065;
+----------------------------+
| CONVERT(line_1 USING swe7) |
+----------------------------+
| H?gbergsgatan 97 |
+----------------------------+
So I'm trying to see what I can do to rectify this.
Also, note that I had forgotten earlier to state that I'm using MySQL 5.6 (if that makes any difference)

MySQL table file names encoding

Could someone know what encoding is used here #T0#g0#x0#y0#w0#u0#p0#q0#o0.MYD ?
This is a file name corresponding to table name which name using cyrillic letters.
This is the MySQL internal filename encoding Documented here.
You can convert this back to normal utf8 by using a procedure like:
mysql> SELECT CONVERT(_filename'#T0#g0#x0#y0#w0#u0#p0#q0#o0' USING utf8);
+------------------------------------------------------------+
| CONVERT(_filename'#T0#g0#x0#y0#w0#u0#p0#q0#o0' USING utf8) |
+------------------------------------------------------------+
| Настройки |
+------------------------------------------------------------+
1 row in set (0.00 sec)

When I write special latin1 characters to an utf-8 encoded mysql table, is that data lost?

When I write special latin1 characters, for example
á, é ã , ê
to an utf-8 encoded mysql table, is that data lost ?
The charset for that table is utf-8.
Is there any way to get that latin1 encoded rows back so I can convert to utf-8 and write back (this time in the right way)?
Update
I think I wasn't very specific about what I meant with "data". By data I mean the special characters, not the row.
When selecting, I still get the row and the fields, but with '?' instead of special latin1 characters. It is possible to recover those '?' and transform to the right utf8 ones?
If the whole database (or a whole table) is affected, you can first verify that it is a Latin1-as-UTF8 charset problem with SET NAMES Latin1:
mysql> select txt from tbl;
+-----------+
| txt |
+-----------+
| Québec |
| Québec |
+-----------+
2 rows in set (0.00 sec)
mysql> SET NAMES Latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select txt from tbl;
+---------+
| txt |
+---------+
| Québec |
| Québec |
+---------+
2 rows in set (0.00 sec)
If this verifies, i.e. you get the desired data when using default charset Latin-1, then you can dump the whole table forcing --default-character-set=latin1 so that a file will be created with the correct data, albeit with the wrong charset specification.
But now you can replace the header row stating
/*!40101 SET NAMES latin1 */;
with UTF8. Reimport the database and you're done.
If only some rows are affected, then it is much more difficult:
SELECT txt, CAST(CAST(txt AS CHAR CHARACTER SET Latin1) AS BINARY) AS utf8 FROM tbl;
+-----------+---------+
| txt | utf8 |
+-----------+---------+
| Québec | Québec |
+-----------+---------+
1 row in set (0.00 sec)
...but you have the problem of locating the affected rows. Some of the code points you might find with
WHERE txt LIKE '%Ã%'
but for the others, you'll have to sample manually.
The data is not lost. See this SQLFiddle example
The additional affected rows can be found using the following:
SELECT column
FROM table
WHERE NOT HEX(column) REGEXP '^([0-7][0-9A-F])*$'