MySQL UTF-8 Collation not working as I would expect [duplicate] - mysql

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Looking for case insensitive MySQL collation where “a” != “ä”
I'm struggling with this utf8 nonsense, I create a test table:
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(20) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
I insert a single row:
INSERT INTO `test`(`name`) VALUES ('Cryptïc');
I query against the table:
SELECT `name` FROM `test` WHERE `name` LIKE 'Cryptic';
I get result set:
+---------+
| name |
+---------+
| Cryptïc |
+---------+
i should not equal ï, a little help?

Use utf8_bin instead of utf8_general_ci.
With utf8_general_ci, similar characters (like i and ï) are treated as the same character in comparisons and sorting. The comparison is also case insensitive (hence the _ci), which means that i and I are also treated the same.
Other collations, like utf8_unicode_ci do better sorting, but still 'fail' on comparisons.

Related

How do I do a case-insensitive MySQL query when columns use utf8mb4_bin collation?

I have a first column typed as varchar(190) that is using utf8mb4_bin collation.
When I perform the following query I only get back all of Joe as expected:
SELECT first, last FROM person WHERE first = 'Joe'
What I would like to get is Joe, joe, jOe, joE, jOE, JoE, JOE, and JOe. Basically a case-insensitive search on a case-sensitive field.
How do I do this?
CREATE TABLE `person` (
`id` int NOT NULL AUTO_INCREMENT,
`first` varchar(190) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
`middle` varchar(190) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
`last` varchar(190) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
`job` varchar(190) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`),
UNIQUE KEY `names_unq` (`first`,`middle`,`last`,`job`),
KEY `index_job` (`job`),
KEY `index_first` (`first`,`job`),
KEY `index_first_last` (`first`,`last`,`job`),
KEY `index_middle` (`middle`,`job`),
KEY `index_last` (`job`,`last`)
) ENGINE=InnoDB AUTO_INCREMENT=99750823 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
You can specify a collation in a string comparison expression to override the collation used in the comparison. Read https://dev.mysql.com/doc/refman/8.0/en/charset-literal.html for more details on this.
CREATE TABLE `person` (
`first` text COLLATE utf8mb4_bin,
`last` text COLLATE utf8mb4_bin
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
mysql> select first, last from person where first = 'Joe';
+-------+-------+
| first | last |
+-------+-------+
| Joe | Grant |
+-------+-------+
mysql> select first, last from person where first = 'joe';
Empty set (0.00 sec)
mysql> select first, last from person where first = 'joe' collate utf8mb4_unicode_ci;
+-------+-------+
| first | last |
+-------+-------+
| Joe | Grant |
+-------+-------+
use "collate utf8mb4_unicode_ci" as it makes one-to-one comparison with character to whatever the filter condition you have given.
The simplest way to do this is to use UPPER().
NB: This is not optimised (unless there is an index UPPER() which is unlikely) but it is a quick fix for simple queries. I do not recommend it for large scale or production queries without testing the speed/cost.
SELECT first, last FROM person WHERE UPPER(first) = 'JOE';
If you are matching a parameter you might need to use upper on both side as in
SELECT first, last FROM person WHERE UPPER(first) = UPPER(#name);

Storing emojies in mysql

I would like to store emojies in mysql (version 5.7.18).
My table structure looks like this:
CREATE TABLE `message_message` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`message` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`is_read` tinyint(1) NOT NULL,
`chat_id` int(11) NOT NULL,
PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
I am trying to save emojies in message field only and I can see that it gets saved with question marks (?☺️???).
Is there a way for me to read these values directly from the table (actually I would like to see emojies in table viewer). I am using SequelPro for viewing table (if that matters).
Exact mysql query that I am running
INSERT INTO message_message(message, created_at, msg_sender_id, chat_id, is_read) VALUES ('💁👍', UTC_TIME(), 110, 164, False)
If I run select query on this table, it looks like this:
+---------------------------------------------------------------------+
| message |
+---------------------------------------------------------------------+
| 😁 |
| 😁💁👍 |
| 💁👍 |
| 💁👍 |
| 💁👍 |
| 💁👍
Does this looks like data is stored correctly?
Apparently, your data is stored correctly.
You provided this string F09F9281F09F918D as a result for SELECT hex(message) for the data inserted with
INSERT INTO message_message(message, created_at, msg_sender_id, chat_id, is_read) VALUES ('💁👍', UTC_TIME(), 110, 164, False)
And if one checks the UTF8 for both emojis:
F0 9F 92 81 for 💁
F0 9F 91 8D for 👍
then you would find that those exactly match with what you already have.
It means your code is correct and if you have any problems with your GUI application - it's a GUI application configuration or unicode support issues and is a bit out of topic for the stackoverflow.
References:
https://unicode-table.com/en/1F481/
https://unicode-table.com/en/1F44D/
I think your table collation must be properly configured too:
CREATE TABLE `message_message` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`message` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`is_read` tinyint(1) NOT NULL,
`chat_id` int(11) NOT NULL,
PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Make sure your table collation is CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, to update this (in your case), the query would be:
ALTER TABLE message_message CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin
Make sure your database's default collation is utf8mb4, to update this, the query would be:
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "DBNAME";

Best type of indexing when there is LIKE clause [duplicate]

This question already has answers here:
improve performance for LIKE clause
(3 answers)
Closed 6 years ago.
Here is my query:
SELECT name, usage_guidance, total_used_num
FROM tags
WHERE
( name LIKE CONCAT('%', ?, '%') OR
usage_guidance LIKE CONCAT(?, '%') )
AND name NOT IN ($in)
ORDER BY name LIKE CONCAT('%', ?, '%') DESC, name ASC
LIMIT 6
Which one is the best index?
tags(name,usage_guidance)
tags(usage_guidance,name)
tags(name)
tags(usage_guidance)
Or is there any better option?! You know, when LIKE comes in, I'm getting confused bout creating indexes. Because LIKE %something would never take any benefit of indexes. Also in query above I have both AND, OR and IN .. That's why I asked this question to know your opinion about it too.
Here is my table structure:
CREATE TABLE `tags` (
`id` int(11) NOT NULL,
`name` varchar(50) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`usage_guidance` varchar(150) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`description` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`parent_id` int(11) UNSIGNED DEFAULT NULL,
`related` int(11) UNSIGNED DEFAULT NULL,
`total_used_num` int(11) UNSIGNED NOT NULL,
`date_time` int(11) UNSIGNED NOT NULL
)
ENGINE=InnoDB DEFAULT CHARSET=latin1;
And I'm trying to make a autocomplete suggestion query. Something like this:
Yep, what you have here is a database killer
A B-tree index can be used for column comparisons in expressions that
use the =, >, >=, <, <=, or BETWEEN operators. The index also can be
used for LIKE comparisons if the argument to LIKE is a constant string
that does not start with a wildcard character.
Source: http://dev.mysql.com/doc/refman/5.7/en/index-btree-hash.html
So that means your LIKE query cannot use the index and then you have two likes connected with an OR. If that's not enough, you have thrown in a NOT IN comparison as well.
But fortunately, the second LIKE expression isn't so bad, it doesn't start with a wildcard. So your best hope is to create a composite index on usage_guidance, name
If you could post your SHOW CREATE TABLE and a few lines of sample data + the expected output, we might get an idea if there is a way to rewrite this query.

MySQL returns incorrect UTF8 extended characters in some cases only

Note: In the following question you may see ? or blocks instead of characters, this is because you don't have the appropriate font. Please ignore this.
Background
I have a table with data structured as follows:
CREATE TABLE `decomposition_dup` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`parent` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
`structure` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
`child` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `parent` (`parent`),
KEY `child` (`child`),
KEY `parent_2` (`parent`,`child`)
) ENGINE=InnoDB AUTO_INCREMENT=211929 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
And some example data:
INSERT INTO `decomposition_dup` (`id`, `parent`, `structure`, `child`) VALUES
(154647, '锦', 'a', '钅'),
(154648, '锦', 'a', '帛'),
(185775, '钅', 'd', '二'),
(185774, '钅', 'd', '㇟'),
(21195, '钅', 'd', '𠂉'),
(21178, '⻐', 'd', '乇'),
(21177, '⻐', 'd', '𠂉');
And the charsets are all set correctly:
Problem
It is very important to note that:
154647, 185775, 185774 & 21195 refer to this character: http://unicode.scarfboy.com/?s=%E9%92%85
21178 and 21177 refer to this character: http://unicode.scarfboy.com/?s=%E2%BB%90
As you can see, they are different characters. However, in some cases they are treated as the same character.
Case 1
When I run the following query, it only returns the correct child (i.e. doesn't return the similar-looking but different character child):
SELECT *
FROM decomposition_dup
WHERE parent = '锦'
This is correct behaviour.
Case 2
However, when I run the following query using 钅 (http://unicode.scarfboy.com/?s=%E9%92%85) it returns both the similar characters:
SELECT *
FROM decomposition_dup
WHERE parent = '钅'
This should only return 185775, 185774 & 21195.
Case 3
And when I run the following query using ⻐ (http://unicode.scarfboy.com/?s=%E2%BB%90) it also returns both the similar characters:
SELECT *
FROM decomposition_dup
WHERE parent = '⻐'
This should only return 21178 and 21177.
Case 4
If I replace = with LIKE for the broken queries (i.e. Case 2 and Case 3), they return correctly.
For example, the following query is the same as Case 3 but usingLIKE:
SELECT *
FROM decomposition_dup
WHERE parent LIKE '⻐'
This returns the correct characters but slows down the query.
Question
Is this a bug in MySQL or is there something that I am overlooking when querying for UTF8 extended characters?
If you want them to be the same, set the COLLATION of the columns to utf8mb4_unicode_ci or utf8mb4_unicode_520_ci.
If you want them to be different, use utf8mb4_general_ci, instead:
mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
-> CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_general_ci AS general;
+---------+
| general |
+---------+
| 0 |
+---------+
mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
-> CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_ci AS unicode;
+---------+
| unicode |
+---------+
| 1 |
+---------+
mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
-> CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_520_ci AS unicode_520;
+-------------+
| unicode_520 |
+-------------+
| 1 |
+-------------+
From what I can make out the problem lies within the SQL side of things upon research you'll see that this error code means that
MySQL's utf8 permits only the Unicode characters that can be
represented with 3 bytes in UTF-8.
so It might be the characters you are using within the cases of SQL

mysql match against not return case insensitive results

I have two tables:
CREATE TABLE IF NOT EXISTS `test1` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`bucket_id` int(10) unsigned NOT NULL COMMENT 'folder this component belongs to',
`test1_name` varchar(81) NOT NULL COMMENT 'Name of this component',
`test1_desc` varchar(1024) NOT NULL COMMENT 'Component Description',
PRIMARY KEY (`id`),
FULLTEXT KEY `test1_search` (`test1_name`,`test1_desc`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
CREATE TABLE IF NOT EXISTS `bucket` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`bkt_name` varchar(81) NOT NULL COMMENT 'The name of this bucket',
`bkt_desc` varchar(1024) NOT NULL COMMENT 'A description of this bucket',
`bkt_keywords` varchar(512) DEFAULT NULL COMMENT 'keywords for searches',
PRIMARY KEY (`id`),
FULLTEXT KEY `fldr_search` (`bkt_desc`,`bkt_keywords`,`bkt_name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=8 ;
Bucket is just a holder while test1 contains all the things that would go into a bucket. For example:
INSERT INTO `bucket` (`id`, `bkt_name`, `bkt_desc`, `bkt_keywords`) VALUES
(1, 'Simpsons', 'The Simpsons Cartoon Family was first successful adult cartoon series', 'Homer, Marge, Lisa and Bart'),
(2, 'Griffins', 'The family from the popular family guy series', 'Peter, Lois, Meg, Chris, Stewie, Brian');
INSERT INTO `test1` (`id`, `bucket_id`, `bkt_name`, `bkt_desc`) VALUES
(1, 1, 'Homer Simpson', 'Homer the figurative head of the Simpsons Family and is the husband of Marge'),
(2, 2, 'Peter Griffin', 'Peter the figurative head of the Griffin family on the hit TV seriers The family Guy');
Now, using the following query I want to look for all buckets whose name, description or keywords contain the search term "family" or whose components contain the words "family")
So far, what I have is this query and it's not returning mixed case results as in "Family" is not found while "family" is.
SELECT *
FROM bucket
RIGHT JOIN test1 ON test1.bucket_id = bucket.id
WHERE
bucket.isvisible > 0 AND
MATCH(bucket.bkt_keywords, bucket.bkt_desc, bucket.bkt_name)
AGAINST('family' IN BOOLEAN MODE) OR
MATCH(test1.test1_name, test1.test1_desc)
AGAINST('family' IN BOOLEAN MODE)
I should also add that all text fields have the collation of utf8_general_ci as does the entire table which is MyISAM.
I think your tables do not use utf8_general_ci as collation, but utf8_bin. I was able to reproduce the behaviour you describe after modifying the tables as follows:
ALTER TABLE test1 CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
ALTER TABLE bucket CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
You should perhaps set your tables' collation explicitely to:
ALTER TABLE test1 CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE bucket CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
If the above changes anything, I would guess your server or session is actually set to use another collation by default (since the collation is not specified in your tables definition). This could be checked with:
SHOW GLOBAL VARIABLES LIKE 'collation_server';
SHOW SESSION VARIABLES LIKE 'collation_server';
The answer is apparently adding some parens around the two match against clauses.
SELECT *
FROM bucket
RIGHT JOIN test1 ON test1.bucket_id = bucket.id
WHERE bucket.isvisible > 0 AND
( MATCH(bucket.bkt_keywords, bucket.bkt_desc, bucket.bkt_name)
AGAINST('family' IN BOOLEAN MODE) OR
MATCH(test1.test1_name, test1.test1_desc)
AGAINST('family' IN BOOLEAN MODE) )