Improve MySQL InnoDB Full-text Search Performance - mysql

Background:
DB: MySQL 8.0, InnoDB engine;
Table size: About 2M rows of 2G data;
FULL_TEXT index column: sentence (TEXT data type)
https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
Old Query using SQL LIKE:
SELECT * FROM books WHERE sentence LIKE '%This is a sample search input string%' and author_id = 5 and publisher_id = 23;
New Query using MySQL FULL_TEXT search:
SELECT * FROM books WHERE MATCH (sentence) AGAINST ('This is a sample search input string') and author_id = 5 and publisher_id = 23 LIMIT 1;
Problems:
I expect a lot of search speed boost from using LIKE to FULL_TEXT(match... against). But based on my testing, this isn't the case:
For an input string with <10 words, full_text search is faster than LIKE;
For an input string with ~25 words, the full_text search can take 3+ seconds to return which is similar to LIKE.
And longer the string, the worse speed full_text search has which can take more than 15s.
Profiling the query:
https://dev.mysql.com/doc/refman/8.0/en/show-profile.html
By looking at the profiling result, 90% of the time is spent on "FULLTEXT initialization"
Optimization I've tried which haven't brought speed improvement:
6.1 Rewrite query trying to use other indexes together with full-text index:
select * from books as b1 join books b2 on b1.author_id = b2.author_id and b1.publisher_id = b2.publisher_id WHERE b2.author_id = 5 and b2.publisher = 23 and MATCH (b1.source) AGAINST ('Sample input string') LIMIT 1;
6.2 Only select the document_id instead of the whole record:
SELECT id FROM books WHERE MATCH (sentence) AGAINST ('This is a sample search input string') and author_id = 5 and publisher_id = 23 LIMIT 1;
Questions:
Are there any other ways that I could try to improve the search speed? According to this doc: https://dev.mysql.com/doc/refman/8.0/en/fulltext-fine-tuning.html
I could try adding more stop words, run OPTIMIZE TABLE, move current table to a new one, or upgrading the hardware. But I'm not sure if it's worth to try those methods at all.

Toss the short words and any stop words
Prefix the rest with '+' and add "IN BOOLEAN MODE"
AGAINST ('+sample +search +input +string' IN BOOLEAN MODE)
If you need the words exactly like your LIKE, then do
WHERE MATCH(..) AGAINST(...)
AND ... LIKE '%This is a sample search input string%'
AND ...
The FULLTEXT search will be fast, but may find documents that have the words in some other order. The LIKE will be fast because it is only checking only a few rows (the ones that FT found).

Related

Quick way to find a word using a SQL query

My current code is to try to find 2 words "Red Table" in Title:
SELECT `id`,`title`,`colors`, `child_value`, `vendor`,`price`, `image1`,`shipping`
FROM `databasename`.`the_table` WHERE
`display` = '1' and (`title` REGEXP '([[:blank:][:punct:]]|^)RED([[:blank:][:punct:]]|$)')
and (`title` REGEXP '([[:blank:][:punct:]]|^)TABLE([[:blank:][:punct:]]|$)')
The problem is, this is so slow! I even put the status "Index" to the column Title.
I just want to search for multiple words in one (I would prefer in title AND description), but obviously I can't use LIKE because it has to be separated by space or dash or start or end with that word etc.
I tried chat or something like that but phpmyadmin said function doesn't exist.
Any suggestions?
You can not employ regular index for LIKE or REGEXP. Use Full Text Search for this. You can create FULLTEXT index on many columns and search them all in a single expression.
CREATE TABLE the_table(
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
description TEXT,
...
FULLTEXT(title,description)
) ENGINE=InnoDB;
And write query like this:
SELECT * FROM the_table WHERE MATCH(title , description )
AGAINST('+RED +TABLE' IN BOOLEAN MODE) -- + means that a word must be present in each row that is returned
Read more about usage and options: MySQL Full text search
Plan A: MySQL pre-8.0: [[:<:]]RED[[:>:]] is a simplification. Those codes mean "word boundary", which encompasses space, punctuation, and start/end.
Plan B: MySQL 8.0: \\bRED\\b is the new phrasing of the Plan A.
Plan C: FULLTEXT(title) with AND MATCH(title) AGAINST('+RED +TABLE' IN BOOLEAN MODE)
Plan D: If you need specifically "RED TABLE" but not "BLUE TABLE AND A RED CHAIR", then use this technique:
AND MATCH(title) AGAINST('+RED +TABLE' IN BOOLEAN MODE)`
AND LIKE '%RED TABLE%';
Note on Plan D: The fulltext search is very efficient, hences it is done first. Then other clauses are applied to the few resulting rows. That is the cost of LIKE (or a similar REGEXP) is mitigated by having to check many fewer rows.
Note: We have not discussed "red table" versus "red tables"
By having suitable collation on the column title, you can either limit to the same case as the argument of have "Red" = "RED" = "red" = ...
Plan E: (To further fill out the discussion): FULLTEXT(title) with AND MATCH(title) AGAINST('+"red table" IN BOOLEAN MODE) should match only "red" and "table" in that order and next to each other.
In general...
Use ENGINE=InnoDB, not MyISAM.
It is not really practical to set the min word len to 1 or 2. (3 is the default for InnoDB; and all settings have different names.)
If your hosting provider does not allow any my.cnf changes, change providers.

Mysql fulltext search didn't find exact phrase

I have a table with column name with fulltext index -
In this table there is many rows but some of them contain the phrase "aberlour 18" exactly at start like here -
But when I search using the fulltext search with exact name from the table -
SELECT * FROM `whiskybase_bottles` WHERE MATCH(`name`) AGAINST('Aberlour 18 year old')
It finds it only on 6th place -
How can I improve it to find it first?
I don't want to use "LIKE" search because fulltext works better for my needs on other cases.
I already decreased the "innodb ft min token size" param to be 2 instead of 3 to include the age statement and recreated the index after it -
If you are trying to find rows that have "Aberlour 18 year old" included in the name but not necessarily the entire name, you can use the + operator that is used for boolean full text searches. Without the + operator, your query is searching for phrases that match the search exactly and does not select phrases that have other words and characters like you want.
Your code would look something like this:
SELECT *
FROM `whiskybase_bottles`
WHERE MATCH(`name`) AGAINST('+"Aberlour 18 year old"' IN BOOLEAN MODE)
You can learn more about BOOLEAN MODE and the related operators here:
https://www.w3resource.com/mysql/mysql-full-text-search-functions.php
For a general search with a variable you can do something like this:
SELECT *
FROM `whiskybase_bottles`
WHERE MATCH(`name`) AGAINST(concat('+\"', search_phrase, '\"') IN BOOLEAN MODE)
If you want to rank the exact phrase on the top.
Use this code instead:
SELECT *,MATCH (name) AGAINST( 'Aberlour 18 year old' IN BOOLEAN MODE) as score
FROM `whiskybase_bottles`
WHERE MATCH(`name`) AGAINST('Aberlour 18 year old')
ORDER BY name = 'Aberlour 18 year old' DESC, score DESC

MYSQL REGEX search many words with no order condition

I try to use a regex with mysql that search boundary words in a json array string but I don't want the regex match words order because I don't know them.
So I started firstly to write my regex on regex101 (https://regex101.com/r/wNVyaZ/1) and then try to convert this one for mysql.
WHERE `Wish`.`services` REGEXP '^([^>].*[[:<:]]Hygiène[[:>:]])([^>].*[[:<:]]Radiothérapie[[:>:]]).+';
WHERE `Wish`.`services` REGEXP '^([^>].*[[:<:]]Hygiène[[:>:]])([^>].*[[:<:]]Andrologie[[:>:]]).+';
In the first query I get result, cause "Hygiène" is before "Radiothérapie" but in the second query "Andrologie" is before "Hygiène" and not after like it written in the query. The problem is that the query is generated automatically with a list of services that are choosen with no order importance and I want to match only boundary words if they exists no matter the order they have.
You can search for words in JSON like the following (I tested on MySQL 5.7):
select * from wish
where json_search(services, 'one', 'Hygiène') is not null
and json_search(services, 'one', 'Andrologie') is not null;
+------------------------------------------------------------+
| services |
+------------------------------------------------------------+
| ["Andrologie", "Angiologie", "Hygiène", "Radiothérapie"] |
+------------------------------------------------------------+
See https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#function_json-search
If you can, use the JSON search queries (you need a MySQL with JSON support).
If it's advisable, consider changing the database structure and enter the various "words" as a related table. This would allow you much more powerful (and faster) queries.
JOIN has_service AS hh ON (hh.row_id = id)
JOIN services AS ss ON (hh.service_id = ss.id
AND ss.name IN ('Hygiène', 'Angiologie', ...)
Otherwise, in this context, consider that you're not really doing a regexp search, and you're doing a full table scan anyway (unless MySQL 8.0+ or PerconaDB 5.7+ (not sure) and an index on the full extent of the 'services' column), and several LIKE queries will actually cost you less:
WHERE (services LIKE '%"Hygiène"%'
OR services LIKE '%"Angiologie"%'
...)
or
IF(services LIKE '%"Hygiène"%', 1, 0)
+IF(services LIKE '%"Angiologie"%', 1, 0)
+ ... AS score
HAVING score > 0 -- or score=5 if you want only matches on all full five
ORDER BY score DESC;

mysql ngrams indexing example

As I have read in many parts, ngram indexing can improve word searches.
In this old post it says that it could be adapted for mysql but it does not say how: levenshtein alternative
Can anyone put some example as you can use this technique in mysql?
Can use this technique to improve the performance of the levenshtein function for mysql?.
My need is to find approximate text (like levenshtein)
In mi tests I used levenshtein() and levenshtein_ratio() functions from:
http://www.artfulsoftware.com/infotree/qrytip.php?id=552
SELECT *, levenshtein_ratio('stacoverflou',words_column) AS ratio
FROM my_table
ORDER BY ratio DESC
That improve performance (Supposing not misspelled first letter)
SELECT *, levenshtein_ratio('stacoverflou',words_column) AS ratio
FROM my_table
WHERE words_column LIKE 's%'
ORDER BY ratio DESC
Also I found this php library for building ngrams:
https://gist.github.com/Xeoncross/5366393
But I have no idea how to use these ngrams in mysql
I have finally made an algorithm myself:
Generate ngrams algorithm:
I build a words table with 3 columns: ngrams (fulltext), word (UNIQUE), lang
I used Bigrams() function to make ngrams for each word
I added a char padding to each ngram to skip the full text index minimum word length ex: 'abcd' should be 'ab bc cd' but with padding it look like 'abxx bcxx cdxx'
Search algorithm:
I take the client written words as correct and use them to search in the real table
with mysql fulltext search query using AGAINST('+word_1 +word_2 +word_n' IN BOOLEAN MODE)
if the score (ranking) > 0 Mission accomplished and ngrams not used (show result to client)
If the score (ranking) = 0 (maybe misspelled words) then use ngrams words table to retrieve correct words
Retrieve correct words from ngrams algorithm:
For each word generate ngrams and preform a fulltext search query using AGAINST('abxx bcxx cdxx' IN BOOLEAN MODE) FROM words table (where we have the ngrams column) and retrieve the correct word. Note that here no (+) prefix to ngrams
Rebuild the search like as step 4
If score > 0 Mission accomplished -> show results -> END
If score still 0 then make another query but this time without (+) +word prefix and IN NATURAL LANGUAGE MODE -> show results -> END
Step 2 code:
// original from : https://gist.github.com/Xeoncross/5366393
// modified for working also with unicode characters
function Bigrams($word){
$ngrams = array();
$len = mb_strlen($word);
for($i=0;$i+1<$len;$i++){
$ngram = mb_substr($word, $i, 2);
while(mb_strlen($ngram) < 4){
$ngram .= "x";
}
$ngrams[$i]=$ngram;
}
return implode(" ",$ngrams);
}
Step 4 code:
SELECT my_column,
( MATCH(my_full_text_column )
AGAINST('+word_1 +word_2 +word_n' IN BOOLEAN MODE)
) AS score LIMIT 10
Step 7 code:
$word = "stacoverflou"; // Intentionally misspelled
$actual_word_ngrams = Bigrams();
//that return
//stxx taxx acxx coxx ovxx vexx erxx rfxx flxx loxx ouxx
SELECT word,
( MATCH( ngrams )
AGAINST('$actual_word_ngrams' IN BOOLEAN MODE)
) AS score LIMIT 1
That return stackoverflow an can be used with ohers words (if many) to much more accurate search like as step 4
END

ORDER BY in MySql based on LIKE condition

I am facing difficulty in sorting the result based on field in mysql. Say for example I am searching the word "Terms" then I should get the results which starts with 'Terms' first and then 'Terms and' as next and then 'Terms and conditions' and so on.
Any one please help out who to fetch the search result based on my requirements in efficient manner using mysql query.
SELECT * FROM your_table WHERE your_column LIKE "Terms%" ORDER BY your_column;
Based on the storage engine and mysql version you probably can use the full text search capabilities of MySQL. For example:
SELECT *, MATCH (your_column) AGAINST ('Terms' IN BOOLEAN MODE) AS relevance
FROM your_table
WHERE MATCH (your_column) AGAINST ('Terms' IN BOOLEAN MODE)
ORDER BY relevance
You can find more info here: http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
Or if you don't want FTS another possible solution where ordering is strictly based on the length (difference) of the strings.
SELECT * FROM your_table WHERE your_column LIKE "Terms%" ORDER BY ABS(LENGTH(your_column) - LENGTH('Terms'));
You are looking for fulltext search. Below a very simple example
SELECT id,name MATCH (name) AGAINST ('string' > 'string*' IN BOOLEAN MODE) AS score
FROM tablename WHERE MATCH (name) AGAINST ('string' > 'string*' IN BOOLEAN MODE)
ORDER BY score DESC
The advantage of this is that you can control the value of words. This is very basic, you can 'up' some matches or words (or 'down' them)
In my example an exact match ('string') would get a higher score than the string with something attached ('string*'). The following line is even one step broader:
'string' > 'string*' > '*string*'
This documentation about fulltextsearch explains allot. It's a long read, but worth it and complete.
Don't use fulltext index if you search for prefix string!
Using LIKE "Term%" the optimizer will be able to use a potential index on your_column:
SELECT * FROM your_table
WHERE your_column LIKE "Terms%"
ORDER BY CHAR_LENGTH(your_column),your_column
Note the ORDER BY clause: it first sorts by string length, and only use alphabetcal order to sort strings of equal length.
And please, use CHAR_LENGTH and not LENGTH as the first count the number of characters, whereas the later count number of bytes. Using a variable length encoding such as utf8, this would made a difference.