MYSQL full text search numbers and character being ignored? - mysql

I"m trying to do a full text search, the database I'm querying has a lot of LCD screen sizes. Eg. 32".
I need to do a full text search as a search phrase can be complex, we started with a LIKE comparison but it didn't cut it.
Heres what we've got.
SELECT stock_items.name AS si_name,
stock_makes.name AS sm_name,
stock_items.id AS si_id
FROM stock_items
LEFT JOIN stock_makes ON stock_items.make = stock_makes.id
WHERE MATCH (stock_items.name,
stock_makes.name) AGAINST ('+32"' IN BOOLEAN MODE)
AND stock_items.deleted != 1
With this search we get 0 results. Although 32" Appears multiple times in the fields.
We have modified the mysql config to allow us to search 3 characters or more (instead of the default four) and searches like +NEC work fine.
My guess is here that either a) full text search in mysql ignores " character or maybe the numbers.
I don't have any control over the database data unfortunately or I'd replace the double quotes with something else.
Any other solutions ?

MySQL ignore cirtain characters when indexing, " is one of them I presume.
There are few ways to change the default character settings as described here
Modify the MySQL source: In myisam/ftdefs.h, see the true_word_char() and misc_word_char() macros. Add '-' to one of those macros and recompile MySQL.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. . You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation.
I used the second approach and that worked for us. The provided page has a detailed example in the comments at the end of the page (comment by John Navratil).
In any case you need to rebuild the index after you changed settings:
REPAIR TABLE tbl_name QUICK;

Related

mysql full text search strange behaviour with some words [duplicate]

I have a few column fulltext indexed and i'm testing some string to search. My db contains cars components so my researches could be for example "Engine 1.6". The problem is that when I use string with point (like 1.6) query returns no results.
Here's my variables
+--------------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
+--------------------------+----------------+
| ft_max_word_len | 84 |
+--------------------------+----------------+
| ft_min_word_len | 4 |
+--------------------------+----------------+
| ft_query_expansion_limit | 20 |
+--------------------------+----------------+
| ft_stopword_file | (built-in) |
+--------------------------+----------------+
I don't know why but even if the ft_min_word_len is 4, a search like "Engine 24V" works. The query for matching is like this:
WHERE MATCH(sdescr,udescr) AGAINST ('+engine +1.6' IN BOOLEAN MODE)
I spend the last day figuring out this issue. The reason why this is happening is that by default, MySQL/MariaDB collations treat space(" "), periods("."), and commas(",") as punctuation. Long story short, collations "weight" characters to determine how to filter or sort them. The punctuations mentioned above are considered EOL or 'stopwords.'
We need to have MySQL/MariaDB treat those punctuations as characters rather than punctuations to solve this issue.
We are presented with three solutions in the MySQL documentation. The first one requires changing the source code and recompiling, which isn't a very viable option for me. The second and third options are good and aren't too hard to follow.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.13.1, “Character Definition Arrays”.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.14, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.10.7, “Adding a User-Defined Collation for Full-Text Indexing”.
First things first:
We need to know which character we're trying to fix. Take a look link below and find the HEX equivalent to the character you're trying to fix. In my case, it was 2E, the period.
https://www.eso.org/~ndelmott/ascii.html
Now, we need to find the collation files in the database server.
SSH into your server.
Login into your MySQL/MariaDB: mysql -u root -p
Run Show VARIABLES LIKE 'character_sets_dir'
The result should return a table with a value of a directory path. I was using docker, so mine came back as usr/share/mysql/charsets.
At this point, I opened a second terminal, but this is necessary.
Back in the server, outside of the MySQL/MariaDB command line:
Navigate to the directory path the previous query returned. You'll find an Index.xml as well as other XML files.
Follow the first step in the MySQL Documentation
NOTE: Before continuing the second step, open latin1.xml and look closely at the <map> nested in <lower> and <upper>. Find the HEX equivalent character to the one you want to fix, in my case, 2E. We can then map the correct spot in the <map> nested inside <ctype>.
Continue to the second step in the MySQL Documentation
After the changes, Restart your server.
Assign the User-defined Collation to our database/table/column.
All we need to do is assign our collation to our database, table, or column. In my case, I just needed to assign it to two columns, so I ran the following command:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
Here are some links that might be helpful:
https://mariadb.com/kb/en/setting-character-sets-and-collations/
https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html
This should solve your problem if you don't have any existing data in the table.
If you do have existing data and you try to run the query above, you might have gotten an error similar to the one below:
SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x80\x93 fr...' for column.
The issue here is due to attempting to convert a 4byte character into a 3byte character. To solve this, we need to convert our data from 4bytes to binary, then to 3bytes(latin1). For more info, check out this link.
Run the following query in the mysql/mariadb command line:
UPDATE table_name SET fulltext_column = CONVERT(CAST(CONVERT(fulltext_column USING utf8) AS BINARY) USING latin1);
You'll need to convert the values of every column which are causing the issue. In my case, it was just one.
Then follow it with:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
We are done. We can now search a term with our character, and our database engine will match against it.
InnoDB solves this problem, MyISAM still persists with this feature/behaviour.
MyISAM works with words like "Node.js" but not with words like "ASP.NET"
The working here
UPDATED: Later I found I might be wrong. MySAM works with the words "Node.js" because at least four characters are required for MySAM while InnoDB requires at least 3 characters.
I found a link here with the below explanation:
Note: Some words are ignored in full-text searches.
The minimum length of the word for full-text searches as of follows :
Three characters for InnoDB search indexes.
Four characters for MyISAM search indexes.
Stop words are words that are very common such as 'on', 'the' or 'it', appear in almost every document. These type of words are ignored during searching.

How can you build such a full-text index that treats underscores as separate words in InnoDB?

So that such queries would return a non-empty set:
SELECT * FROM mytable WHERE MATCH(name) AGAINST ('+some +text' IN BOOLEAN MODE);
From a table where the only record's name attribute is 'some_text' . Basically I want to force InnoDB to treat underscores as delimeters when building the full-text index just like it does with dots and hyphens. How can this be achived natively in mysql or even with a 3rd party parser that has this by default?
Thank you
Edit: I'm aware that the easiest solultion would be to duplicate the column and separate the words there as I wish and build the index on that, but I'd rather not do that if not neccessary because the table has millions of rows..
You are trying to change the characters that define a word. The place to look is in the documentation on fine-tuning the search. Specifically, you want to control what characters are allowed in a word -- and you want _ to be a non-word character.
One recommended method is to modify the character set file:
Suppose that you want to treat the hyphen character ('-') as a word
character. Use one of these methods:
. . .
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish
letters and numbers from other characters. . You can edit the contents
of the array in one of the character set XML files to
specify that '-' is a “letter.” Then use the given character set for
your FULLTEXT indexes. For information about the array
format, see Section 10.3.1, “Character Definition Arrays”.
The only downside is that this will affect all full text indexes.
An alternative is to define a second column where you replace underscores with spaces and build the full text index on that.
If you want an index-specific approach, then another option is to define your own collation.
Note: You may also need to be careful about the minimum word size. The default is 3 or 4; smaller words are ignored.

Fulltext index match string with period (.) mysql

I have a few column fulltext indexed and i'm testing some string to search. My db contains cars components so my researches could be for example "Engine 1.6". The problem is that when I use string with point (like 1.6) query returns no results.
Here's my variables
+--------------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
+--------------------------+----------------+
| ft_max_word_len | 84 |
+--------------------------+----------------+
| ft_min_word_len | 4 |
+--------------------------+----------------+
| ft_query_expansion_limit | 20 |
+--------------------------+----------------+
| ft_stopword_file | (built-in) |
+--------------------------+----------------+
I don't know why but even if the ft_min_word_len is 4, a search like "Engine 24V" works. The query for matching is like this:
WHERE MATCH(sdescr,udescr) AGAINST ('+engine +1.6' IN BOOLEAN MODE)
I spend the last day figuring out this issue. The reason why this is happening is that by default, MySQL/MariaDB collations treat space(" "), periods("."), and commas(",") as punctuation. Long story short, collations "weight" characters to determine how to filter or sort them. The punctuations mentioned above are considered EOL or 'stopwords.'
We need to have MySQL/MariaDB treat those punctuations as characters rather than punctuations to solve this issue.
We are presented with three solutions in the MySQL documentation. The first one requires changing the source code and recompiling, which isn't a very viable option for me. The second and third options are good and aren't too hard to follow.
Modify a character set file: This requires no recompilation. The true_word_char() macro uses a “character type” table to distinguish letters and numbers from other characters. You can edit the contents of the array in one of the character set XML files to specify that '-' is a “letter.” Then use the given character set for your FULLTEXT indexes. For information about the array format, see Section 10.13.1, “Character Definition Arrays”.
Add a new collation for the character set used by the indexed columns, and alter the columns to use that collation. For general information about adding collations, see Section 10.14, “Adding a Collation to a Character Set”. For an example specific to full-text indexing, see Section 12.10.7, “Adding a User-Defined Collation for Full-Text Indexing”.
First things first:
We need to know which character we're trying to fix. Take a look link below and find the HEX equivalent to the character you're trying to fix. In my case, it was 2E, the period.
https://www.eso.org/~ndelmott/ascii.html
Now, we need to find the collation files in the database server.
SSH into your server.
Login into your MySQL/MariaDB: mysql -u root -p
Run Show VARIABLES LIKE 'character_sets_dir'
The result should return a table with a value of a directory path. I was using docker, so mine came back as usr/share/mysql/charsets.
At this point, I opened a second terminal, but this is necessary.
Back in the server, outside of the MySQL/MariaDB command line:
Navigate to the directory path the previous query returned. You'll find an Index.xml as well as other XML files.
Follow the first step in the MySQL Documentation
NOTE: Before continuing the second step, open latin1.xml and look closely at the <map> nested in <lower> and <upper>. Find the HEX equivalent character to the one you want to fix, in my case, 2E. We can then map the correct spot in the <map> nested inside <ctype>.
Continue to the second step in the MySQL Documentation
After the changes, Restart your server.
Assign the User-defined Collation to our database/table/column.
All we need to do is assign our collation to our database, table, or column. In my case, I just needed to assign it to two columns, so I ran the following command:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
Here are some links that might be helpful:
https://mariadb.com/kb/en/setting-character-sets-and-collations/
https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html
This should solve your problem if you don't have any existing data in the table.
If you do have existing data and you try to run the query above, you might have gotten an error similar to the one below:
SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x80\x93 fr...' for column.
The issue here is due to attempting to convert a 4byte character into a 3byte character. To solve this, we need to convert our data from 4bytes to binary, then to 3bytes(latin1). For more info, check out this link.
Run the following query in the mysql/mariadb command line:
UPDATE table_name SET fulltext_column = CONVERT(CAST(CONVERT(fulltext_column USING utf8) AS BINARY) USING latin1);
You'll need to convert the values of every column which are causing the issue. In my case, it was just one.
Then follow it with:
ALTER TABLE table_name MODIFY fulltext_column_one TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci, MODIFY fulltext_column_two TEXT CHARACTER SET latin1 COLLATE latin1_fulltext_ci;
We are done. We can now search a term with our character, and our database engine will match against it.
InnoDB solves this problem, MyISAM still persists with this feature/behaviour.
MyISAM works with words like "Node.js" but not with words like "ASP.NET"
The working here
UPDATED: Later I found I might be wrong. MySAM works with the words "Node.js" because at least four characters are required for MySAM while InnoDB requires at least 3 characters.
I found a link here with the below explanation:
Note: Some words are ignored in full-text searches.
The minimum length of the word for full-text searches as of follows :
Three characters for InnoDB search indexes.
Four characters for MyISAM search indexes.
Stop words are words that are very common such as 'on', 'the' or 'it', appear in almost every document. These type of words are ignored during searching.

How to allow fulltext searching with hyphens in the search query

I have keywords like "some-or-other" where the hyphens matter in the search through my mysql database. I'm currently using the fulltext function.
Is there a way to escape the hyphen character?
I know that one option is to comment out #define HYPHEN_IS_DELIM in the myisam/ftdefs.h file, but unfortunately my host does not allow this. Is there another option out there?
Here's the code I have right now:
$search_input = $_GET['search_input'];
$keyword_safe = mysql_real_escape_string($search_input);
$keyword_safe_fix = "*'\"" . $keyword_safe . "\"'*";
$sql = "
SELECT *,
MATCH(coln1, coln2, coln3) AGAINST('$keyword_safe_fix') AS score
FROM table_name
WHERE MATCH(coln1, coln2, coln3) AGAINST('$keyword_safe_fix')
ORDER BY score DESC
";
From here http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
One solution to find a word with a dashes or hyphens in is to use FULL TEXT SEARCH IN BOOLEAN MODE, and to enclose the word with the hyphen / dash in double quotes.
Or from here http://bugs.mysql.com/bug.php?id=2095
There is another workaround. It was recently added to the manual:
"
Modify a character set file: This requires no recompilation. The true_word_char() macro
uses a “character type” table to distinguish letters and numbers from other
characters. . You can edit the contents in one of the character set XML
files to specify that '-' is a “letter.” Then use the given character set for your
FULLTEXT indexes.
"
Have not tried it on my own.
Edit: Here is some more additional info from here http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html
A phrase that is enclosed within double quote (“"”) characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Prior to MySQL 5.0.3, the engine then performed a substring search for the phrase in the records that were found, so the match must include nonword characters in the phrase. As of MySQL 5.0.3, nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase" in MySQL 5.0.3, but not before.
If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty.
Some people would suggest to use the following query:
SELECT id
FROM texts
WHERE MATCH(text) AGAINST('well-known' IN BOOLEAN MODE)
HAVING text LIKE '%well-known%';
But by that you need many variants depending on the used fulltext operators. Task: Realize a query like +well-known +(>35-hour <39-hour) working week*. Too complex!
And do not forget the default len of ft_min_word_len so a search for up-to-date returns only date in your results.
Trick
Because of that I prefer a trick so constructions with HAVING etc aren't needed at all:
Instead of adding the following text to your database table: "The Up-to-Date Sorcerer" is a well-known science fiction short story. copy the hyphen words without hypens to the end of the text inside a comment: "The Up-to-Date Sorcerer" is a well-known science fiction short story.<!-- UptoDate wellknown -->
If the users searches for up-to-date remove the hyphen in the sql query:
MATCH(text) AGAINST('uptodate ' IN BOOLEAN MODE)
By that you're user can find up-to-date as one word instead of getting all results that contain only date (because ft_min_word_len kills up and to).
Of course before you echo the texts you should remove the <!-- ... --> comments.
Advantages
the query is simpler
the user is able to use all fulltext operators as usual
the query is faster.
If a user searches for -well-known +science MySQL treats that as not include *well*, could include *known* and must include *science*. This isn't what the user expected. The trick solves that, too (as the sql query searches for -wellknown +science)
Maybe simpler to use the Binary operator.
SELECT *
FROM your_table_name
WHERE BINARY your_column = BINARY "Foo-Bar%AFK+LOL"
http://dev.mysql.com/doc/refman/5.0/en/cast-functions.html#operator_binary
The BINARY operator casts the string following it to a binary string. This is an easy way to force a column comparison to be done byte by byte rather than character by character. This causes the comparison to be case sensitive even if the column is not defined as BINARY or BLOB. BINARY also causes trailing spaces to be significant.
My preferred solution to this is to remove the hyphen from the search term and from the data being searched. I keep two columns in my full-text table - search and return. search contains sanitised data with various characters removed, and is what the users' search terms are compared to, after my code has sanitised those as well.
Then I display the return column.
It does mean I have two copies of the data in my database, but for me that trade-off is well worth it. My FT table is only ~500k rows, so it's not a big deal in my use case.

How do you get your Fulltext boolean search to pick up the term C++?

So, I need to find out how to do a fulltext boolean search on a MySQL database to return a record containg the term "C++".
I have my SQL search string as:
SELECT *
FROM mytable
WHERE MATCH (field1, field2, field3)
AGAINST ("C++" IN BOOLEAN MODE)
Although all of my fields contain the string C++, it is never returned in the search results.
How can I modify MySQL to accommodate this? Is it possible?
The only solution I have found would be to escape the + character during the process of entering my data as something like "__plus" and then modifying my search to accomodate, but this seems cumbersome and there has to be a better way.
How can I modify MySQL to accommodate this?
You'll have to change MySQL's idea of what a word is.
Firstly, the default minimum word length is 4. This means that no search term containing only words of <4 letters will ever match, whether that's ‘C++’ or ‘cpp’. You can configure this using the ft_min_word_len config option, eg. in your my.cfg:
[mysqld]
ft_min_word_len=3
(Then stop/start MySQLd and rebuild fulltext indices.)
Secondly, ‘+’ is not considered a letter by MySQL. You can make it a letter, but then that means you won't be able to search for the word ‘fish’ in the string ‘fish+chips’, so some care is required. And it's not trivial: it requires recompiling MySQL or hacking an existing character set. See the section beginning “If you want to change the set of characters that are considered word characters...” in section 11.8.6 of the doc.
escape the + character during the process of entering my data as something like "__plus" and then modifying my search to accomodate
Yes, something like that is a common solution: you can keep your ‘real’ data (without the escaping) in a primary, definitive table — usually using InnoDB for ACID compliance. Then an auxiliary MyISAM table can be added, containing only the mangled words for fulltext search bait. You can also do a limited form of stemming using this approach.
Another possibility is to detect searches that MySQL can't do, such as those with only short words, or unusual characters, and fall back to a simple-but-slow LIKE or REGEXP search for those searches only. In this case you will probably also want to remove the stoplist by setting ft_stopword_file to an empty string, since it's not practical to pick up everything in that as special too.
From http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html:
A phrase that is enclosed within double quote (“"”) characters matches only rows that contain the phrase literally, as it was typed.
This means you can search for 'C++' using this query:
SELECT *
FROM mytable
WHERE MATCH (field1, field2, field3)
AGAINST ('"C++"' IN BOOLEAN MODE)
Usually escaped characters are used in the query not in the database data. Try escaping each "+" in your query.
solution::
change my.ini file
put these two lines
ft_min_word_len = "1"
ft_stopword_file =""
below
[mysqld]
than savve file and restart mysql server.
my.ini file will sharewd by all. so can we do changes in my.ini file for some session only.?