Case sensitive search in Django, but ignored in Mysql - mysql

I have a field in a Django Model for storing a unique (hash) value. Turns out that the database (MySQL/inno) doesn't do a case sensitive search on this type (VARCHAR), not even if I explicitly tell Django to do a case sensitive search Document.objects.get(hash__exact="abcd123"). So "abcd123" and "ABcd123" are both returned, which I don't want.
class document(models.Model):
filename = models.CharField(max_length=120)
hash = models.CharField(max_length=33 )
I can change the 'hash field' to a BinaryField , so in the DB it becomes a LONGBLOB , and it does do a case-sensitive search (and works). However, this doesn't seem very efficient to me.
Is there a better way (in Django) to do this, like adding 'utf8 COLLATE'? or what would be the correct Fieldtype in this situation?
(yes, I know I could use PostgreSQL instead..)

The default collation for character set for MySQL is latin1_swedish_ci, which is case insensitive. Not sure why that is. But you should create your database like so:
CREATE DATABASE database_name CHARACTER SET utf8;

As #dan-klasson mentioned, the default non-binary string comparison is case insensetive by default; notice the _ci at the end of latin1_swedish_ci, it stands for case-insensetive.
You can, as Dan mentioned, create the database with a case sensitive collation and character set.
You may be also interested to know that you can always create a single table or even set only a single column to use a different collation (for the same result). And you may also change these collations post creation, for instance per table:
ALTER TABLE documents__document CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Additionally, if you rather not change the database/table charset/collation, Django allows to run a custom query using the raw method. So you may be able to work around the change by using something like the following, though I have not tested this myself:
Document.objects.raw("SELECT * FROM documents__document LIKE '%s' COLLATE latin1_bin", ['abcd123'])

Related

Search With LIKE inside a JSON object in Mysql [duplicate]

I have a MySQL query, where I filter by a json field:
SELECT id, username
FROM (SELECT id,
Json_extract(payload, '$.username') AS username
FROM table1) AS tmp
WHERE username = 'userName1';
It returns 1 row, which looks like:
1, "userName1" See the quotes that are not in the clause?
What I need is to make the WHERE clause case insensitive.
But when I do
WHERE username LIKE 'userName1';
it returns 0 rows. I don't understand why it works this way, the = clause works though it doesn't have those double quotes.
If I do
WHERE username LIKE '%userName1%';
now also returns the row, because %% takes quotes into consideration:
1, "userName1"
But when I do
WHERE username LIKE '%username1%';
it returns 0 rows, so unlike the usual MySQL LIKE it's somehow case sensitive.
What am I doing wrong and how to filter the json payload the case insensitive way?
EDIT=========================================
The guess is that COLLATE should be used here, but so far I don't understand how to make it work.
Default collation of MySQL is latin1_swedish_ci before 8.0 and utf8mb4_0900_ai_ci since 8.0. So non-binary string comparisons are case-insensitive by default in ordinary columns.
However, as mentioned in MySQL manual for JSON type
MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation.".
Therefore, your JSON value is in utf8mb4_bin collation and you need to apply a case insensitive collation to either operand to make the comparison case insensitive.
E.g.
WHERE username COLLATE XXX LIKE '...'
where XXX should be a utf8mb4 collation (such as the utf8mb4_general_ci you've mentioned.).
Or
WHERE username LIKE '...' COLLATE YYY
where YYY should be a collation that match the character set of you connection.
For equality comparison, you should unquote the JSON value with JSON_UNQUOTE() or the unquoting extraction operator ->>
E.g.
JSON_UNQUOTE(JSON_EXTRACT(payload, '$.username'))
Or simply
payload->>'$.username'
The JSON type and functions work way different from ordinary data types. It appears that you are new to it. So I would suggest you to read the manual carefully before putting it into a production environment.
Okay, I was able to solve the case insensitivity by adding COLLATE utf8mb4_general_ci after the LIKE clause.
So the point here is to find a working collation, which in its turn can be found by researching the db you work with.

MySQL LIKE with json_extract

I have a MySQL query, where I filter by a json field:
SELECT id, username
FROM (SELECT id,
Json_extract(payload, '$.username') AS username
FROM table1) AS tmp
WHERE username = 'userName1';
It returns 1 row, which looks like:
1, "userName1" See the quotes that are not in the clause?
What I need is to make the WHERE clause case insensitive.
But when I do
WHERE username LIKE 'userName1';
it returns 0 rows. I don't understand why it works this way, the = clause works though it doesn't have those double quotes.
If I do
WHERE username LIKE '%userName1%';
now also returns the row, because %% takes quotes into consideration:
1, "userName1"
But when I do
WHERE username LIKE '%username1%';
it returns 0 rows, so unlike the usual MySQL LIKE it's somehow case sensitive.
What am I doing wrong and how to filter the json payload the case insensitive way?
EDIT=========================================
The guess is that COLLATE should be used here, but so far I don't understand how to make it work.
Default collation of MySQL is latin1_swedish_ci before 8.0 and utf8mb4_0900_ai_ci since 8.0. So non-binary string comparisons are case-insensitive by default in ordinary columns.
However, as mentioned in MySQL manual for JSON type
MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation.".
Therefore, your JSON value is in utf8mb4_bin collation and you need to apply a case insensitive collation to either operand to make the comparison case insensitive.
E.g.
WHERE username COLLATE XXX LIKE '...'
where XXX should be a utf8mb4 collation (such as the utf8mb4_general_ci you've mentioned.).
Or
WHERE username LIKE '...' COLLATE YYY
where YYY should be a collation that match the character set of you connection.
For equality comparison, you should unquote the JSON value with JSON_UNQUOTE() or the unquoting extraction operator ->>
E.g.
JSON_UNQUOTE(JSON_EXTRACT(payload, '$.username'))
Or simply
payload->>'$.username'
The JSON type and functions work way different from ordinary data types. It appears that you are new to it. So I would suggest you to read the manual carefully before putting it into a production environment.
Okay, I was able to solve the case insensitivity by adding COLLATE utf8mb4_general_ci after the LIKE clause.
So the point here is to find a working collation, which in its turn can be found by researching the db you work with.

mysql collation: case-preserving, case-insensitive but accent-sensitive [duplicate]

How can I perform accent-sensitive but case-insensitive utf8 search in mysql? Utf8_bin is case sensitive, and utf8_general_ci is accent insensitive.
If you want to differ "café" from "cafe"
You may use :
Select word from table_words WHERE Hex(word) LIKE Hex("café");
This way it will return 'café'.
Otherwise if you use :
Select word from table_words WHERE Hex(word) LIKE Hex("cafe");
it will return café.
I'm using the latin1_german2_ci Collation.
There doesn't seem to be one because case sensitivity is tough to do in Unicode.
There is a utf8_general_cs collation but it seems to be experimental, and according to this bug report, doesn't do what it's expected to when using LIKE.
If your data consists of western umlauts only (ie. umlauts that are included in ISO-8859-1), you might be able to collate your search operation to latin1_german2_ci or create a separate search column with it (that specific collation is accent sensitive according to this page; latin1_general_ci might be as well, I don't know and can't test right now).
You can use "hex" to make the search accent-sensitive. Then simply add lcase to make it case insensitive again. So that would give:
SELECT name FROM people WHERE HEX(LCASE(name)) = HEX(LCASE("René"))
You do throw all your indexes out of the window like that. If you want to avoid having to do a full table scan and you have an index on "name", also search for the same thing without the hex and lcase:
SELECT name FROM people WHERE name = "René" and HEX(LCASE(name)) = HEX(LCASE("René"))
This way the index on "name" will be used to find for example only the rows "René" and "Rene" and then the comparison with the "hex" needs to be done only on those two rows instead of on the complete table.

MYSQL 5.1.61 sorting for Central European languages in utf8

I have a problem with sorting MYSQL result..
SELECT * FROM table WHERE something ORDER BY column ASC
column is set to utf8_unicode_ci..
As a result I first get rows which have column starting with Bosnian letters and then the others after that..
šablabl
šeblabla
čeblabla
aaaa
bbaa
bbb
ccc
MYSQL version is 5.1.61
Bgi is right. You need to use an appropriate collation. Unfortunately, MySQL doesn't have a Central European unicode collation yet. MariaDb, the MySQL fork being maintained by MySQL's creators, does.
So you can convert your text from utf8 to latin2 and then order with a Central European collating sequence. For example.
SELECT *
FROM tab
ORDER BY CONVERT(text USING latin2) COLLATE latin2_croatian_ci
See this fiddle: http://sqlfiddle.com/#!2/c8dd4/1/0
It is because the way of unicode is made. All the "normal" latin characters got back the same numerical correspondance they had in ASCII, and other characters from other cultures were added after. That means if your alphabet has other characters than the 26 regular ASCII ones, it wont appear in the correct order in Unicode.
I think you should try to change the collation on your column (maybe you'll have to change the charset also, but maybe not).
Use a Central European collation.
Good luck !!
If that's really what you see you have found a bug: utf8_unicode_ci is supposed to consider š equivalent to s and č equivalent to c!
In any case it's true that MySQL does not have great support of utf8 collations for Central European languages: you get only Czech, Slovak, and Slovenian. If none of those work for you, I guess you'll have to create your own utf8 collation, or use a non-Unicode character set and use the collations available there.
Older question and plenty of answers.
Maybe the way I deal with problems will help someone.
I use PDO. My DB is utf-8.
First - my db singleton code (relevant part of it). I set 'SET NAMES' to 'utf8' for all connections.
$attrib_array = array(PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8');
if (DB_HANDLER)
$attrib_array[PDO::ATTR_ERRMODE] = PDO::ERRMODE_EXCEPTION;
self::$instance = new PDO(DB_TYPE.':host='.DB_HOST.';dbname='.DB_NAME, DB_USER, DB_PASS, $attrib_array);
Second - my sorting looks something like this - collation depends on language (sample shows polish):
ORDER BY some_column COLLATE utf8_polish_ci DESC
To make things more streamlined I use a constant, which I define in lang translation file, so when file is pulled, proper collation constant is set. Of course I have 'utf8_general_ci' as default. Example:
define('MY_LOCALIZED_COLLATE', 'COLLATE utf8_polish_ci');
Now, my (relevant part of) query looks like this:
" ... ORDER BY some_column " . MY_LOCALIZED_COLLATE . " DESC" ;
Above works in most cases.
If you are missing collation set, you may try to add one yourself.
More detailed info about creating such set - see here: http://dev.mysql.com/doc/refman/5.0/en/adding-collation.html
EDIT:
Just one more thing I noticed:
if you have list to sort in e.g. Polish
and you have to force proper collation for sorting (as described above)
and you use e.g. INT column as sorting vector
... then you better have collation set (e.g. to UTF8), or you will get SQL errors, e.g.:
"Syntax error or access violation: 1253 COLLATION 'utf8_polish_ci' is not valid for CHARACTER SET 'latin1'"
... strange, but true

is possible to have accent sensitive and case insensitive utf8 collation in mysql?

How can I perform accent-sensitive but case-insensitive utf8 search in mysql? Utf8_bin is case sensitive, and utf8_general_ci is accent insensitive.
If you want to differ "café" from "cafe"
You may use :
Select word from table_words WHERE Hex(word) LIKE Hex("café");
This way it will return 'café'.
Otherwise if you use :
Select word from table_words WHERE Hex(word) LIKE Hex("cafe");
it will return café.
I'm using the latin1_german2_ci Collation.
There doesn't seem to be one because case sensitivity is tough to do in Unicode.
There is a utf8_general_cs collation but it seems to be experimental, and according to this bug report, doesn't do what it's expected to when using LIKE.
If your data consists of western umlauts only (ie. umlauts that are included in ISO-8859-1), you might be able to collate your search operation to latin1_german2_ci or create a separate search column with it (that specific collation is accent sensitive according to this page; latin1_general_ci might be as well, I don't know and can't test right now).
You can use "hex" to make the search accent-sensitive. Then simply add lcase to make it case insensitive again. So that would give:
SELECT name FROM people WHERE HEX(LCASE(name)) = HEX(LCASE("René"))
You do throw all your indexes out of the window like that. If you want to avoid having to do a full table scan and you have an index on "name", also search for the same thing without the hex and lcase:
SELECT name FROM people WHERE name = "René" and HEX(LCASE(name)) = HEX(LCASE("René"))
This way the index on "name" will be used to find for example only the rows "René" and "Rene" and then the comparison with the "hex" needs to be done only on those two rows instead of on the complete table.