MySQL search with uft8_general_ci is case sensitive for FULLTEXT? - mysql

I set up a MyISAM table to do FULLTEXT searching.
I do not want searches to be case-sensitive.
My searches are along the lines of:
SELECT * FROM search WHERE MATCH (keywords) AGAINST ('+diversity +kitten' IN BOOLEAN MODE);
Let's say the keywords field I'm looking for has the value "my Diversity kitten".
I noticed the searches were case-sensitive.
I double-checked my collation on the search table, it was set to utf8_bin. D'oh!
I changed it to utf8_general_ci.
But my query is still case-sensitive!
Why?
Is there a server setting I need to change, too?
Is there something I need to do besides change the collation?
I did a "REPAIR TABLE search QUICK" to rebuild the FULLTEXT index, but that didn't do it either...
My searches are still case-sensitive. =(

Aha, figured it out for reals this time.
I believe my issue was using NaviCat to update the collation. I have an older version of NaviCat, maybe it was a bug or something.
Doing:
ALTER TABLE search CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
fixed it correctly.
Always use command line, kids! =)

Hmm - that behavior doesn't match the manual:
By default, the search is performed in
case-insensitive fashion. However, you
can perform a case-sensitive full-text
search by using a binary collation for
the indexed columns. For example, a
column that uses the latin1 character
set of can be assigned a collation of
latin1_bin to make it case sensitive
for full-text searches.
Which version of MySQL do you use? Can you provide some data that would allow replicating the problem on another machine?

Related

mysql utf8mb4_general_ci issue

I have an issue with unicode chars in an utf8mb4_general_ci table
SELECT * FROM `t1` WHERE c1='musca'
returns
musca
muşca
muşcă
What I would like to know is if this is a bug - sounds like it;
and if it affects searching - it might, or better said it should; I can't make the column unique index
Anything I should do so mysql would consider a and ă and s and ş as different entities? (probably a and â, t and ţ, and i and î as well, but I haven't checked).
Should I store unicode chars as &#226 &#259 &#351 &#355 &#238 ?
I will need to retrieve the exact match of the user input.
Edited to add:
The answer is in the comments: I should collate the columns as utf8mb4_0900_as_cs as Madhur Bhaiya explained and demonstrated
You need COLLATION utf8_romanian_ci (or utf8mb4_romanian_ci) on the table columns in question. It is the only collation that treats those 5 characters as a separate 'letter'.
Reference: http://mysql.rjweb.org/utf8_collations.html
That is available in most versions of MySQL/MariaDB. There is no need for utf8mb4_0900_as_cs, which implies MySQL 8.0.

Problem in enforcing case sensitivity in MySQL

I am aware about MySQL being case insensitive by default.
I also read about using the collation utf8_general_cs to enable case sensitivity. But I get an error saying the collation is not identified. Also when I query the collation for charset utf8, the resultset shows ci related collations only. So question number 1 would be, do we need to configure cs related collations? If so, then I would like some guidance over it. Or is it dependent on some particular database engine?
Also I read about using utf8_bin collation for making MySQL queries search case sensitive. I did so. Set the schema collation as utf8_bin. But it didn't work. Restarted MySQL services as well to ensure that collation has been updated. But yet, when I do a
Select * from table where name like 'el%';
It gives name starting from 'EL' as well.
Note: I am preferably looking for options to set the collation at the database level.
MySQL server version 5.6.x
Column collation has precedence over database and table collations. If you've been making changes, it's possible that your column is currently using the value that was the default when the table was created. You should be able to spot it with a proper SQL tool or by running:
SELECT table_schema, column_name, collation_name
FROM information_schema.columns
WHERE table_schema = 'your database name'
AND table_name = 'your table name'
If you aren't willing to change the column collation, you can set it at expression level:
SELECT *
FROM foo
WHERE bar LIKE 'el%' COLLATE utf8mb4_0900_as_cs;
(demo)
Collation affects sorting and character comparison so you'll have to read some docs to figure out which one suits your needs best (it isn't straightforward if you aren't a Unicode geek).
My projects are all in Spanish so I tend to use utf8mb4_spanish_ci a lot ;-)
I thought that normally worked?!
Anyway if this is a localised problem there are a few solutions:
Primarily one could use a SELECT * FROM users WHERE name REGEXP '^[E][P]*$' - regular expression.
The alternative
Surely you must be accessing this through another languages wrapper for either automation or handling rather than the MySQL console right? I would suggest sorting it using the language you’re using for a wrapper this can easily be implemented. Still however only deal with those beginning in el as this will reduce the number of things to check.

Accent insensitive search query in MySQL

Is there any way to make search query accent insensitive?
the column's and table's collation are utf8_polish_ci and I don't want to change them.
example word : toruń
select * from pages where title like '%torun%'
It doesn't find "toruń". How can I do that?
You can change the collation at runtime in the sql query,
...where title like '%torun%' collate utf8_unicode_ci
but beware that changing the collation on the fly at runtime forgoes the possibility of mysql using an index, so performance on large tables may be terrible.
Or, you can copy the column to another column, such as searchable_title, but change the collation on it. It's actually common to do this type of stuff, where you copy data but have it in some slightly different form that's optimized for some specific workload/purpose. You can use triggers as a nice way to keep the duplicated columns in sync. This method has the potential to perform well, if indexed.
Note - Make sure that your db really has those characters and not html entities.
Also, the character set of your connection matters. The above assumes it's set to utf8, for example, via set names like set names utf8
If not, you need an introducer for the literal value
...where title like _utf8'%torun%' collate utf8_unicode_ci
and of course, the value in the single quotes must actually be utf8 encoded, even if the rest of the sql query isn't.
This wont work in extreme circumstances, but try to change the column collation to UFT8 utf8_unicode_ci. Then accented characters will be equal to their non-accented counterparts.
You could try SOUNDEX:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
This compares two string by how they sound. But this obviously delivers many more results.

Accent-insensitive searches / problems with utf8_general_ci collation

Edit: if you're here because you're confused by the polish collation in MySQL, read this.
I'm trying to perform a full-text search on a table of polish cities and many of them contain accented characters. It's meant to be used in an ajax call for auto completion so it would be nice if the search was accent-insensitive. I've set the collation of the rows to ut8_polish_ci. Now, given the city "Zelów", I query the database like this
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ("zelow")
but to no avail. Mysql returns an empty result. I've tried different accents, tried adding different collations to the query but nothing helped. I'm not sure how I should approach this because accent-sensitivity seems to be poorly documented. Any ideas?
EDIT
So I found out that the case-insensitive full-text searches are performed only IN BOOLEAN MODE, so the correct query would be
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( "zelow" IN BOOLEAN MODE )
Previously I thought otherwise due to a misleading comment on dev.mysql.com. There might be more to it but I'm just really confused right now.
Anyway, as mentioned in the comments below, I have UNIQUE index on the cities column so changing the collation of the table to accent-insensitive utf8_general_ci is out of the question.
I realized however, that the following query works quite well on a table with utf8_polish_ci collation:
SELECT * FROMcitiesWHERE city LIKE 'zelow' COLLATE utf8_general_ci
It would seem now that the most reasonable solution would be to do a full-text search in a similar fashion:
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( 'zelow' IN BOOLEAN MODE ) COLLATE utf8_genral_ci
This however yields the following error:
#1253 - COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
This is really starting to get on my nerves. Might as well abandon full-text search in favour of a simple where-like approach but it doesn't seem sensible in a table with almost 50k records which will be intensively queried...
LAST EDIT
Ok, the thing with boolean mode was partly bullshit. Only partly because it indeed works as I said, however, on a utf8_general_ci it works the other way around. I'm utterly perplexed and have no will to study this issue further. I decided to drop the UNIQUE index (no further cities will be added anyway so no need to make a big deal out of it) and stick with the utf8_general_ci table collation. I appreciate all the help, it steered me in the right direction.
Change your collation to utf_general_ci. It ignores accent when searching and ordering but still stores them correctly.
MySQL is very flexible in the encoding/collation area, maybe too flexible. When changing your encoding/collation, make sure you are converting the table, not just changing the encoding/collation types.
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
You can also convert individual fields, so your table can have a collation setting of utf8_general_ci, but you can change one or more fields so they use some other collation. Base on the "binary" error you are seeing, it seems your text field might have a collation of UTF8-BIN (or be a blob). Can you post the result of CREATE TABLE?
Remember, the CHARACTER SET (encoding) is how the data is stored, the collation is how it is indexed. Not all combinations work.
My original problem, and question, might help a little:
Converting mysql tables from latin1 to utf8
If you try :
select * from cities where cityname like 'zelow'
Change your collation from binary to utf8_bin. utf8_bin should be compatible with utf8_general_ci, but will still allow you to store city names with differing accents.

MySQL MyISAM fulltext search - how to add '#' as a word character for utf8 charset?

I am using MyISAM full text search. The table columns are having charset "utf8" and "utf8_general_ci" as collation.
Now I want to implement #HashTag system, so that if I search for "#HashTag", only rows that contain "#HashTag" show up. Not rows that just contains "HashTag".
According to the comment in this MySQL documentation, its easy to do it for non-multibyte charsets, that is charsets with fixed-width encoding.
But I could not find a good reference for how to do it for utf8 charset. Has anyone done this for utf8 charset columns? If yes, could you list the exact steps?
Also, I want to avoid recompiling MySQL if possible.
Not an answer to your question, but would it not be a good idea to parse out the hash tags during input time using a regular expression, and store them in a separate column? Might be easier (and faster) than bending mySQL into accepting # as a search character.