finding values case insensitively with emojis - mysql

I have a table with varchar value that needs to store text values with emojis:
CREATE TABLE `my_table` (
`id` bigint(11) NOT NULL AUTO_INCREMENT,
`value` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `value_idx` (`value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Now I need to do selects on this table to find all values starting with prefix. Selects must be case insensitive and must match emoji as well. So far I found 4 options, which all have trade offs:
I can use utf8mb4_unicode_ci collation and do selects like
select * from my_table where value like 'prefix%'
It will wind all values starting with prefix ignoring its characters case, but will not find anything if prefix contains emojis
I can set collection to utf8mb4_bin and my selects will find values if prefix contains emojis, but will be case sensitive
I can do
select * from my_table where LOWER(value) like 'prefix%'
and it will work case insensitively and with emojis, but will not use index
And finally I can save all values in lower case and use utf8mb4_bin collation, but saving in lower case is also the trade off
Is there any solution that would allow me to do "like" selects ignoring case of the prefix and allowing to have emojis in prefix?
UPD: I do not have problems with storing emojis, I have problems with finding them with "like" select keeping case insensitive collation

Solution is to use MySQL 5.6+ and to use utf8mb4_unicode_520_ci collation which doesn't treat all 4 bytes characters as equal

Related

Laravel: How to create table with case sensitive column (binary)?

I would like to use base62 unique identifiers and my problem is that the columns are not case sensitive, so F1 is the same as f1 when I search for it. Now in MYSQL I would simply do
CREATE TABLE USERS
(
USER_NAME STRING(10) BINARY
)
So in Laravel it should look like
$table->string('base62_id', 10)->binary();
However, I don't think ->binary() exists in laravel for this purpose. So how would I do that?
I understand this question is old, but in case anyone stumbles upon it. At time of writing, Laravel is version 8 and this is valid:
$table->string("case_sensitive_id")->charset("utf8")->collation("utf8_bin")->nullable();
This will achieve case sensitivity without any sort of alter statements.
So this is the answer:
DB::statement("ALTER TABLE `mytable` ADD `base62_id` VARCHAR( 10 ) CHARACTER SET utf8 COLLATE utf8_bin UNIQUE AFTER `id` , ADD INDEX ( `base62_id` )");
The key is to use
CHARACTER SET utf8 COLLATE utf8_bin
to make it case sensitive.
Thank you # my source: http://blog.birdhouse.org/2010/10/24/base62-urls-django/comment-page-1/

MySQL Case Sensitivity (or otherwise, how to store passwords correctly in MySQL)

CAUSE:
I have a table and the columns are all suitably Collated as utf8mb4_unicode_ci,
CREATE TABLE IF NOT EXISTS `users` (
`user_id` int(8) NOT NULL AUTO_INCREMENT,
`username` varchar(100) NOT NULL,
`pass_word` varchar(512) NOT NULL ,
...etc etc...
PRIMARY KEY (`user_id`),
UNIQUE KEY `email_addr` (`email_addr`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=989 ;
...Including the column storing the password hash (generated from password_hash) such as $2y$14$tFpExwd2TXm43Bd20P4nkMbL1XKxwF.VCpL.FXeVRaUO3FFxGJ4Di.
BUT, I find that due to the case insensitivity of the column, that a hash of $2y$14$tFpExwd2tXm43Bd20P4NKmbL1XKxwF.VCpL.FxEVRaUO3FFxGJ4DI would still allow access.
This means that there are potentially hundreds of collisions possible by storing the data in a case insensitive manner. Not good.
ISSUE:
Now, Is there a way of forcing MySQL to treat pass_word column as a case sensitive column, when doing comparisons. I want to avoid having to edit every occurance of the PHP/SQL querying, and instead simply set the database table column to compare in a case sensitive manner by default.
The utf8mb4 character set does not give me any _cs options, and the only non-_ci option appears to be utf8mb4_bin.
So simple questions:
Does the UTF8mb4_bin character set & collation on MySQL treat standard comparisons case sensitively? [yes]
Dose the UTF8mb4_bin suit what I want to do. Should I use another set, and if so, why?
Are there any issues in storing password_hash outputs in a MySQL utf8mb4_bin column?
Does this approach conveniently sidestep the need to edit the query SQL of each login query? Can I change the column type and then move on?
EDIT
As detailed by nj_ , this is a silly issue that is not an issue at all because the value of pass_word is never directly edited when logging in.
... It's been a long day.
If you're really that worried about the potential 2^55 collisions in your 62^55 address space, you can simply change the column type to BLOB, which is always case-sensitive.
CREATE TABLE IF NOT EXISTS `users` (
`user_id` int(8) NOT NULL AUTO_INCREMENT,
`username` varchar(100) NOT NULL,
`pass_word` BLOB NOT NULL ,
...etc etc...
PRIMARY KEY (`user_id`),
UNIQUE KEY `email_addr` (`email_addr`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=989 ;
Example:
INSERT INTO `users` (..., `pass_word`) VALUES (..., 'AbC');
SELECT * FROM `users` WHERE `pass_word` = 'AbC' LIMIT 0,1000; -> 1 hit
SELECT * FROM `users` WHERE `pass_word` = 'abc' LIMIT 0,1000; -> 0 hits
Case sensitivity is no problem in this case, because you cannot verify the password directly with SQL anyway. A correctly salted password hash cannot be searched for in the database. Search by username only and extract the stored hash from the database:
$sql= 'SELECT * FROM users WHERE username = ?';
$db->prepare($sql);
$db->bind_param('s', $_POST['username']);
Afterwards you can extract the hash from the row and check the entered password against the found hash with the password_verify() function:
// Check if the hash of the entered login password, matches the stored hash.
// The salt and the cost factor will be extracted from $existingHashFromDb.
$isPasswordCorrect = password_verify($password, $existingHashFromDb);

MySQL won't properly GROUP BY on emojis

I'm storing single emojis in a CHAR column in a MySQL database. The column's encoding is utf8mb4.
When I run this aggregate query, MySQL won't group by the emoji characters. It instead returns a single row with a single emoji and the count of all the rows in the database.
SELECT emoji, count(emoji) FROM emoji_counts GROUP BY emoji
Here's my table definition:
CREATE TABLE `emoji_counts` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`emoji` char(1) DEFAULT '',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Is there some special Unicode behavior I'll have to account for?
Turns out I needed to specify an expanded collation in the query, namely utf8mb4_unicode_520_ci.
This worked:
SELECT emoji, count(emoji) FROM emoji_counts group by emoji collate utf8mb4_unicode_520_ci;
EDIT: That collation isn't available on some server configs (including ClearDB's)... utf8mb4_bin also appears to work.

Case sensitive uniqueness and case insensitive search

I have a table with a field a using encoding utf8 and collation utf8_unicode_ci:
CREATE TABLE dictionary (
a varchar(128) NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The collation utf8_unicode_ci is required for an efficient case insensitive search with extensions and ligations. For this purpose i have the index:
CREATE INDEX a_idx on dictionary(a);
Problem: Additionally i must ensure that all stored values of the field a are unique but in a case sensitive way.
German example: "blühen" and "Blühen" must both be stored in the table. But adding "Blühen" a second time should not be possible.
Is there a build-in functionality in MySQL to have both?
Unfortunately it seems not to be possible to set the collation for the index in MySQL 5.1.
Solutions to this problem include a uniqueness check before insert or a trigger. Both are far less elegant than using a unique index.
Well, there are 2 ways to accomplish this:
using _bin collation
change your datatype to VARBINARY
Case 1: using _bin collation
Create your table as follows:
CREATE TABLE `dictionary` (
`a` VARCHAR(128) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
UNIQUE KEY `idx_un_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Please note:
the datatype of the column a
the UNIQUE index on column a
Case 2: using VARBINARY dataype
Create your table as follows:
CREATE TABLE `dictionary` (
`a` VARBINARY(128) NOT NULL,
UNIQUE KEY `idx_uniq_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Please note:
the new datatype VARBINARY
the UNIQUE index on column a
So, both the above will solve your purpose. That is, they both will allow values like 'abc', 'Abc', 'ABC', 'aBc' etc but not allow the same value again if the case matches.
Please note that giving an "_bin" collation is different than using the binary datatype. So please feel free to refer to the following links:
The BINARY and VARBINARY datatypes
The _bin and binary Collations
I hope the above helps!
You can achieve this by adding additinal column 'column_lower'.
CREATE TABLE `dictionary` (
`a` VARCHAR(128) NOT NULL,
`a_lower` VARCHAR(128) NOT NULL,
UNIQUE KEY `idx_un_a_lower` (`a_lower`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Insert that goes like this:
insert into dictionary set a = x, a_lower = lower(x);
Select can now be case-insensitive:
select * from dictionary where a_lower like lower('search_term%')
Note that column which has index on it, can store at max 191 characters. MySQL can have at max 767 bytes long index, that is 767 / 4 (unicode can take up to 4 bytes if you use utf8mb4 collation) = 191.75 = 191 characters. If you use utf8 collation that takes up at max 3 bytes per character column can store at max 767 / 3 = 255 characters.
SELECT * FROM dictionary WHERE a COLLATE utf8_general_ci = 'abc'
Try this It will work .. it worked for me.

Can MySQL automatically specify `_utf8` for inserts to UTF-8 columns?

I have a table like this, where one column is latin1, the other is UTF-8:
Create Table: CREATE TABLE `names` (
`name_english` varchar(255) character NOT NULL,
`name_chinese` varchar(255) character set utf8 default NULL,
) ENGINE=MyISAM DEFAULT CHARSET=latin1
When I do an insert, I have to type _utf8 before values being inserted into UTF-8 columns:
insert into names (name_english = "hooey", name_chinese = _utf8 "鬼佬");
However, since MySQL should know that name_chinese is a UTF-8 column, it should be able to know to use _utf8 automatically.
Is there any way to tell MySQL to use _utf8 automatically, so when I'm programatically making prepared statements, I don't have to worry about including it with the right parameters?
why not to use UTF-8 for the whole table?