mysql full text search strange behaviour with some words [duplicate] - mysql

I have a few column fulltext indexed and i'm testing some string to search. My db contains cars components so my researches could be for example "Engine 1.6". The problem is that when I use string with point (like 1.6) query returns no results.
Here's my variables
+--------------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
+--------------------------+----------------+
| ft_max_word_len | 84 |
+--------------------------+----------------+
| ft_min_word_len | 4 |
+--------------------------+----------------+
| ft_query_expansion_limit | 20 |
+--------------------------+----------------+
| ft_stopword_file | (built-in) |
+--------------------------+----------------+
I don't know why but even if the ft_min_word_len is 4, a search like "Engine 24V" works. The query for matching is like this:
WHERE MATCH(sdescr,udescr) AGAINST ('+engine +1.6' IN BOOLEAN MODE)

I spend the last day figuring out this issue. The reason why this is happening is that by default, MySQL/MariaDB collations treat space(" "), periods("."), and commas(",") as punctuation. Long story short, collations "weight" characters to determine how to filter or sort them. The punctuations mentioned above are considered EOL or 'stopwords.'
We need to have MySQL/MariaDB treat those punctuations as characters rather than punctuations to solve this issue.
We are presented with three solutions in the MySQL documentation. The first one requires changing the source code and recompiling, which isn't a very viable option for me. The second and third options are good and aren't too hard to follow.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.13.1, “Character Definition Arrays”.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.14, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.10.7, “Adding a User-Defined Collation for Full-Text Indexing”.
First things first:
We need to know which character we're trying to fix. Take a look link below and find the HEX equivalent to the character you're trying to fix. In my case, it was 2E, the period.
https://www.eso.org/~ndelmott/ascii.html
Now, we need to find the collation files in the database server.
SSH into your server.
Login into your MySQL/MariaDB: mysql -u root -p
Run Show VARIABLES LIKE 'character_sets_dir'
The result should return a table with a value of a directory path. I was using docker, so mine came back as usr/share/mysql/charsets.
At this point, I opened a second terminal, but this is necessary.
Back in the server, outside of the MySQL/MariaDB command line:
Navigate to the directory path the previous query returned. You'll find an Index.xml as well as other XML files.
Follow the first step in the MySQL Documentation
NOTE: Before continuing the second step, open latin1.xml and look closely at the <map> nested in <lower> and <upper>. Find the HEX equivalent character to the one you want to fix, in my case, 2E. We can then map the correct spot in the <map> nested inside <ctype>.
Continue to the second step in the MySQL Documentation
After the changes, Restart your server.
Assign the User-defined Collation to our database/table/column.
All we need to do is assign our collation to our database, table, or column. In my case, I just needed to assign it to two columns, so I ran the following command:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
Here are some links that might be helpful:
https://mariadb.com/kb/en/setting-character-sets-and-collations/
https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html
This should solve your problem if you don't have any existing data in the table.
If you do have existing data and you try to run the query above, you might have gotten an error similar to the one below:
SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x80\x93 fr...' for column.
The issue here is due to attempting to convert a 4byte character into a 3byte character. To solve this, we need to convert our data from 4bytes to binary, then to 3bytes(latin1). For more info, check out this link.
Run the following query in the mysql/mariadb command line:
UPDATE table_name SET fulltext_column = CONVERT(CAST(CONVERT(fulltext_column USING utf8) AS BINARY) USING latin1);
You'll need to convert the values of every column which are causing the issue. In my case, it was just one.
Then follow it with:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
We are done. We can now search a term with our character, and our database engine will match against it.

InnoDB solves this problem, MyISAM still persists with this feature/behaviour.
MyISAM works with words like "Node.js" but not with words like "ASP.NET"
The working here
UPDATED: Later I found I might be wrong. MySAM works with the words "Node.js" because at least four characters are required for MySAM while InnoDB requires at least 3 characters.
I found a link here with the below explanation:
Note: Some words are ignored in full-text searches.
The minimum length of the word for full-text searches as of follows :
Three characters for InnoDB search indexes.
Four characters for MyISAM search indexes.
Stop words are words that are very common such as 'on', 'the' or 'it', appear in almost every document. These type of words are ignored during searching.

Related

MySQL strange characters replace with <BR

I inherited a MySQL table (MyISAM utf8_general_ci encoding) that has a strange character looks like this in myPHPAdmin: •
I assume this a bullet point of some type?
When rendered on a HTML page it looks like this: �
How do I replace this value with a <BR><LI> so I can turn it into a line break with a properly formatted list item?
I've tried a standard UPDATE query but it does not replace these values? I assume I need to escape them somehow?
Query attempted:
UPDATE `FL_Regs` SET `Remarks` = "<BR><LI>" WHERE `Remarks` = "•"
You did not showed your query, so I'm only guessing.
If you're having hard times with your client encoding characters for you (I imagine you may use phpmyadmin, which involve a lot of steps between your browser and the actual server), you may try by giving the string to search as sequence of bytes.
It happen that • is U+2022, a character named "BULLET" in Unicode, which is encoded as e2 80 a2 in UTF8. So you can use X'E280A2' instead of '•' in your query.
Typically:
> select X'E280A2';
+-----------+
| X'E280A2' |
+-----------+
| • |
+-----------+
You can, if you want to better understand what's happening, try to use the HEX() function, first maybe to check what's MySQL is receiving when your're sending a bullet:
SELECT HEX('•');
Typically I'm getting E280A2 which is as previously seen the UTF8 encoding of the BULLET character.
And so see what's actually stored in your table:
SELECT HEX(your_column) FROM your_table;
Try to limit the search to a single raw to make it almost readable.

Fulltext index match string with period (.) mysql

I have a few column fulltext indexed and i'm testing some string to search. My db contains cars components so my researches could be for example "Engine 1.6". The problem is that when I use string with point (like 1.6) query returns no results.
Here's my variables
+--------------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
+--------------------------+----------------+
| ft_max_word_len | 84 |
+--------------------------+----------------+
| ft_min_word_len | 4 |
+--------------------------+----------------+
| ft_query_expansion_limit | 20 |
+--------------------------+----------------+
| ft_stopword_file | (built-in) |
+--------------------------+----------------+
I don't know why but even if the ft_min_word_len is 4, a search like "Engine 24V" works. The query for matching is like this:
WHERE MATCH(sdescr,udescr) AGAINST ('+engine +1.6' IN BOOLEAN MODE)
I spend the last day figuring out this issue. The reason why this is happening is that by default, MySQL/MariaDB collations treat space(" "), periods("."), and commas(",") as punctuation. Long story short, collations "weight" characters to determine how to filter or sort them. The punctuations mentioned above are considered EOL or 'stopwords.'
We need to have MySQL/MariaDB treat those punctuations as characters rather than punctuations to solve this issue.
We are presented with three solutions in the MySQL documentation. The first one requires changing the source code and recompiling, which isn't a very viable option for me. The second and third options are good and aren't too hard to follow.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.13.1, “Character Definition Arrays”.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.14, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.10.7, “Adding a User-Defined Collation for Full-Text Indexing”.
First things first:
We need to know which character we're trying to fix. Take a look link below and find the HEX equivalent to the character you're trying to fix. In my case, it was 2E, the period.
https://www.eso.org/~ndelmott/ascii.html
Now, we need to find the collation files in the database server.
SSH into your server.
Login into your MySQL/MariaDB: mysql -u root -p
Run Show VARIABLES LIKE 'character_sets_dir'
The result should return a table with a value of a directory path. I was using docker, so mine came back as usr/share/mysql/charsets.
At this point, I opened a second terminal, but this is necessary.
Back in the server, outside of the MySQL/MariaDB command line:
Navigate to the directory path the previous query returned. You'll find an Index.xml as well as other XML files.
Follow the first step in the MySQL Documentation
NOTE: Before continuing the second step, open latin1.xml and look closely at the <map> nested in <lower> and <upper>. Find the HEX equivalent character to the one you want to fix, in my case, 2E. We can then map the correct spot in the <map> nested inside <ctype>.
Continue to the second step in the MySQL Documentation
After the changes, Restart your server.
Assign the User-defined Collation to our database/table/column.
All we need to do is assign our collation to our database, table, or column. In my case, I just needed to assign it to two columns, so I ran the following command:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
Here are some links that might be helpful:
https://mariadb.com/kb/en/setting-character-sets-and-collations/
https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html
This should solve your problem if you don't have any existing data in the table.
If you do have existing data and you try to run the query above, you might have gotten an error similar to the one below:
SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x80\x93 fr...' for column.
The issue here is due to attempting to convert a 4byte character into a 3byte character. To solve this, we need to convert our data from 4bytes to binary, then to 3bytes(latin1). For more info, check out this link.
Run the following query in the mysql/mariadb command line:
UPDATE table_name SET fulltext_column = CONVERT(CAST(CONVERT(fulltext_column USING utf8) AS BINARY) USING latin1);
You'll need to convert the values of every column which are causing the issue. In my case, it was just one.
Then follow it with:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
We are done. We can now search a term with our character, and our database engine will match against it.
InnoDB solves this problem, MyISAM still persists with this feature/behaviour.
MyISAM works with words like "Node.js" but not with words like "ASP.NET"
The working here
UPDATED: Later I found I might be wrong. MySAM works with the words "Node.js" because at least four characters are required for MySAM while InnoDB requires at least 3 characters.
I found a link here with the below explanation:
Note: Some words are ignored in full-text searches.
The minimum length of the word for full-text searches as of follows :
Three characters for InnoDB search indexes.
Four characters for MyISAM search indexes.
Stop words are words that are very common such as 'on', 'the' or 'it', appear in almost every document. These type of words are ignored during searching.

MySQL Full-Text search for hashtags (including the # symbol in index)

I am pretty sure there should be a way to search for hashtags using Full-Text index in MyISAM table. Default setup would do the following:
textfield
hashtag
#hashtag
#two #hashtag #hashtag
SELECT * FROM table WHERE MATCH(textfield) AGAINST ('#hashtag')
> | hashtag |
> | #hashtag |
> | #two #hashtag #hashtag |
While it should return only 2nd and 3rd rows instead. It looks like hashtag is treated as a word delimiter, so it is "removed" before the search begins. What should I do to enable indexing and searching for terms containing # as part of the word?
As documented under Fine-Tuning MySQL Full-Text Search:
You can change the set of characters that are considered word characters in several ways, as described in the following list. After making the modification, rebuild the indexes for each table that contains any FULLTEXT indexes. Suppose that you want to treat the hyphen character ('-') as a word character. Use one of these methods:
Modify the MySQL source: In storage/myisam/ftdefs.h, see the true_word_char() and misc_word_char() macros. Add '-' to one of those macros and recompile MySQL.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the contents of the <ctype><map> array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the <ctype><map> array format, see Section 10.3.1, “Character Definition Arrays”.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.4, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.9.7, “Adding a Collation for Full-Text Indexing”.

MYSQL full text search numbers and character being ignored?

I"m trying to do a full text search, the database I'm querying has a lot of LCD screen sizes. Eg. 32".
I need to do a full text search as a search phrase can be complex, we started with a LIKE comparison but it didn't cut it.
Heres what we've got.
SELECT stock_items.name AS si_name,
stock_makes.name AS sm_name,
stock_items.id AS si_id
FROM stock_items
LEFT JOIN stock_makes ON stock_items.make = stock_makes.id
WHERE MATCH (stock_items.name,
stock_makes.name) AGAINST ('+32"' IN BOOLEAN MODE)
AND stock_items.deleted != 1
With this search we get 0 results. Although 32" Appears multiple times in the fields.
We have modified the mysql config to allow us to search 3 characters or more (instead of the default four) and searches like +NEC work fine.
My guess is here that either a) full text search in mysql ignores " character or maybe the numbers.
I don't have any control over the database data unfortunately or I'd replace the double quotes with something else.
Any other solutions ?
MySQL ignore cirtain characters when indexing, " is one of them I presume.
There are few ways to change the default character settings as described here
Modify the MySQL source: In myisam/ftdefs.h, see the true_word_char() and misc_word_char() macros. Add '-' to one of those macros and recompile MySQL.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation.
I used the second approach and that worked for us. The provided page has a detailed example in the comments at the end of the page (comment by John Navratil).
In any case you need to rebuild the index after you changed settings:
REPAIR TABLE tbl_name QUICK;

MYSQL 5.1.61 sorting for Central European languages in utf8

I have a problem with sorting MYSQL result..
SELECT * FROM table WHERE something ORDER BY column ASC
column is set to utf8_unicode_ci..
As a result I first get rows which have column starting with Bosnian letters and then the others after that..
šablabl
šeblabla
čeblabla
aaaa
bbaa
bbb
ccc
MYSQL version is 5.1.61
Bgi is right. You need to use an appropriate collation. Unfortunately, MySQL doesn't have a Central European unicode collation yet. MariaDb, the MySQL fork being maintained by MySQL's creators, does.
So you can convert your text from utf8 to latin2 and then order with a Central European collating sequence. For example.
SELECT *
FROM tab
ORDER BY CONVERT(text USING latin2) COLLATE latin2_croatian_ci
See this fiddle: http://sqlfiddle.com/#!2/c8dd4/1/0
It is because the way of unicode is made. All the "normal" latin characters got back the same numerical correspondance they had in ASCII, and other characters from other cultures were added after. That means if your alphabet has other characters than the 26 regular ASCII ones, it wont appear in the correct order in Unicode.
I think you should try to change the collation on your column (maybe you'll have to change the charset also, but maybe not).
Use a Central European collation.
Good luck !!
If that's really what you see you have found a bug: utf8_unicode_ci is supposed to consider š equivalent to s and č equivalent to c!
In any case it's true that MySQL does not have great support of utf8 collations for Central European languages: you get only Czech, Slovak, and Slovenian. If none of those work for you, I guess you'll have to create your own utf8 collation, or use a non-Unicode character set and use the collations available there.
Older question and plenty of answers.
Maybe the way I deal with problems will help someone.
I use PDO. My DB is utf-8.
First - my db singleton code (relevant part of it). I set 'SET NAMES' to 'utf8' for all connections.
$attrib_array = array(PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8');
if (DB_HANDLER)
$attrib_array[PDO::ATTR_ERRMODE] = PDO::ERRMODE_EXCEPTION;
self::$instance = new PDO(DB_TYPE.':host='.DB_HOST.';dbname='.DB_NAME, DB_USER, DB_PASS, $attrib_array);
Second - my sorting looks something like this - collation depends on language (sample shows polish):
ORDER BY some_column COLLATE utf8_polish_ci DESC
To make things more streamlined I use a constant, which I define in lang translation file, so when file is pulled, proper collation constant is set. Of course I have 'utf8_general_ci' as default. Example:
define('MY_LOCALIZED_COLLATE', 'COLLATE utf8_polish_ci');
Now, my (relevant part of) query looks like this:
" ... ORDER BY some_column " . MY_LOCALIZED_COLLATE . " DESC" ;
Above works in most cases.
If you are missing collation set, you may try to add one yourself.
More detailed info about creating such set - see here: http://dev.mysql.com/doc/refman/5.0/en/adding-collation.html
EDIT:
Just one more thing I noticed:
if you have list to sort in e.g. Polish
and you have to force proper collation for sorting (as described above)
and you use e.g. INT column as sorting vector
... then you better have collation set (e.g. to UTF8), or you will get SQL errors, e.g.:
"Syntax error or access violation: 1253 COLLATION 'utf8_polish_ci' is not valid for CHARACTER SET 'latin1'"
... strange, but true