I have a table of medical diagnostic codes that users are able to perform a keyword search against. I have a column of descriptive text as well as a column of synonyms, both of which are considered. Results are presented in an auto-suggest format and the current implementation of the query is too slow for deployment:
SELECT
ID AS data, CONCAT('[', ICD10, '] ', description) AS value,
MAX(MATCH(description) AGAINST("fracture forearm current init oth" IN BOOLEAN MODE) +
(MATCH(synonyms) AGAINST("fracture forearm current init oth" IN BOOLEAN MODE) * 0.5)) AS relevance
FROM Code
WHERE
(MATCH(description) AGAINST("fracture forearm current init oth" IN BOOLEAN MODE) OR
MATCH(synonyms) AGAINST ("fracture forearm current init oth" IN BOOLEAN MODE)) AND
isPCS = 0 AND
isEnabled = 1 AND
ICD10 IS NOT NULL AND
description IS NOT NULL
GROUP BY ID
ORDER BY relevance DESC
LIMIT 100
There are ~170K rows in the table, though the latter four static constraints reduce it to ~94K rows, of which ~16K have synonyms. A typical query takes 0.45 seconds on my desktop (i7-4770K) and about 0.75 seconds on our development server (a lower-end Xeon). Removing the ORDER BY keyword reduces it to 0.02 and 0.05 seconds, respectively.
I had expected that sorting the results would be trivial compared to the full-text search, but this doesn't appear to be the case. Am I missing a glaring inefficiency?
I'm also looking into eventually rebuilding this functionality on top of Lucene/Solr (opinions/suggestions welcomed), but I'd like to have a better understanding of this behaviour, and an optimised interim solution wouldn't hurt either.
If you order by relevance limit 100, it means MySQL has to find all rows that match your condition, evaluate your relevance formula, do a filesort, and take the first 100 of them.
If you don't order, it means MySQL has to find any 100 rows that fit the conditions, and can stop execution there.
So it is not the filesort after finding the result that makes it slow, it is that is has to find all results before doing the filesort (and there are probably a lot more than 100 rows that has at least some of the words you are looking for).
But there is actually an optimization you can use here: use a fulltext index on both of your columns together:
CREATE FULLTEXT INDEX idxft_Code_descr_syn ON Code (description, synonyms);
and then directly search in both columns together and order by the fulltext relevance directly without recalculating:
SELECT
ID AS data, CONCAT('[', ICD10, '] ', description) AS value,
MATCH(description, synonyms)
AGAINST("fracture forearm current init oth" IN BOOLEAN MODE) AS relevance
FROM Code
WHERE
MATCH(description, synonyms)
AGAINST("fracture forearm current init oth" IN BOOLEAN MODE) AND
isPCS = 0 AND
isEnabled = 1 AND
ICD10 IS NOT NULL AND
description IS NOT NULL
ORDER BY relevance
LIMIT 100
This will slightly change your relevance compared to your current order, because it will not weigh the synomym column differently than the description column, but since the result had been normalized for their own single column, your current weights may not have had the expected effect anyway.
The order by relevance will still require a full table search, but because of the way fulltext indexes work (they are supposed to order by the relevance), you will probably get a descent speedbump out of it (though any of your mentioned specialized search engines will be faster than a general purpose MySQL. If they are necessary for 170k rows is for you to test. More RAM might sometimes be worth a shot too. But that is an entirely different topic.)
Related
10 million rows. MySQL server V. 5.7 Two indexes called "tagline" and "experience".
This statement takes < 1 second:
SELECT count(*) FROM pa
WHERE MATCH(tagline) AGAINST('"developer"' IN BOOLEAN MODE);
This statement also takes < 1 second:
SELECT count(*) FROM pa
WHERE MATCH(experience) AGAINST('"python"' IN BOOLEAN MODE);
This combined statement takes 30 seconds:
SELECT count(*) FROM pa
WHERE MATCH(tagline) AGAINST('"developer"' IN BOOLEAN MODE)
AND MATCH(experience) AGAINST('"python"' IN BOOLEAN MODE);
Similar problem outlined here. Essentially slight alterations to fulltext match make it useless:
https://medium.com/hackernoon/dont-waste-your-time-with-mysql-full-text-search-61f644a54dfa
Change the last one to
SELECT count(*) FROM pa
WHERE MATCH(tagline, experience) AGAINST('+developer +python' IN BOOLEAN MODE)
and add
FULLTEXT(tagline, experience)
(I am assuming you are using Engine=InnoDB.)
Be aware that when using MATCH, it is performed first; anything else. In your case, one MATCH was performed, then it struggled to perform the other, there is way to run a second MATCH efficiently.
Went with Sphinx. https://www.youtube.com/watch?v=OP0c26k_iQc
Fairly easy way of upgrading the capabilities of MySQL without committing to a new stack.
I have table with quite many rows (~2M).
When i search it like
SELECT * FROM product WHERE
MATCH(name) AGAINST('+some' '+word' IN BOOLEAN MODE)
it works like charm and finds what i need in less than 0.5s.
But when i search for 2 sets of words like this
SELECT * FROM product WHERE
MATCH(name) AGAINST('+some' '-word' IN BOOLEAN MODE)
OR
MATCH(name) AGAINST('+something' '-other' IN BOOLEAN MODE)
search takes sometimes over minute.
I would expect it to work 2 times slower (it's 2 searches), maybe a bit more (you still have to compare results and remove duplicates, but if there are only few results it should not take long), but not so much longer. After adding OR it works slower, than LIKE "%...%" OR LIKE "%...%"
Anyone can tell me what am i doing wrong?
Unfortunately for you, fulltext indexes have some limitations, and not being able to properly merge the results of two independent fulltext searches is one them:
The Index Merge optimization algorithm has the following known limitations:
[...]
Index Merge is not applicable to full-text indexes.
Fortunately for you, fulltext searches can be more complex, so you can merge your searches yourself. Your second query can be written as a single search using:
SELECT * FROM product WHERE
MATCH(name) AGAINST('(+something -other) (+some -word)' IN BOOLEAN MODE)
This defines two search sets and is ok if either of the two (...) matches - which is an or.
Alternatively, you can use a union instead of an or, which allows MySQL to actually run two independent fulltext searches and then combine the two results, which is basically what you are thinking of:
SELECT * FROM product WHERE
MATCH(name) AGAINST('+some -word' IN BOOLEAN MODE)
UNION
SELECT * FROM product WHERE
MATCH(name) AGAINST('+something -other' IN BOOLEAN MODE)
This also works for more complicated situations, e.g. merging searches on different columns, but will not work that easy if you want to do something else than or.
I have a MySQL MyISAM table with a full text index on the keywords column and 20 millions rows. It works well when a search for rare words, for example:
SELECT count(*) FROM books WHERE MATCH(keywords) AGAINST ('+DUCK' IN BOOLEAN MODE)
(0.005s, 2k results)
But when I search for a more common terms it is much slowers:
SELECT count(*) FROM books WHERE MATCH(keywords) AGAINST ('+YES' IN BOOLEAN MODE)
(5s, 2millions results)
It makes sens because the last one returns much more rows, but then how can I pre-filter the rows before the text search? This doesn't work:
SELECT count(*) FROM books WHERE date > "2019-09-23" AND MATCH(keywords) AGAINST ('+YES' IN BOOLEAN MODE)
(5s, 0 result)
MyISAM's (and maybe InnoDB's) FULLTEXT will always do the MATCH first, then any other clauses. Hence, adding that extra filter does not help with speed.
Think of it this way... A FT index is constructed to test the entire table(s) for the MATCH clause. It is not ready to handle any filtering before it goes to work. So, you are stuck with FT first, then filter the results the other way but without benefit of any indexes.
Table type: MyISAM
Rows: 120k
Data Length: 30MB
Index Length: 40MB
my.ini, MySQL 5.6.2 Windows
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 16M
Windows Server 2012, 12GB RAM, SSD 400MB/s
1 Slow Query:
SELECT article_id, title, author, content, pdate, MATCH(author, title, content)
AGAINST('Search Keyword') AS score FROM articles ORDER BY score DESC LIMIT 10;
Executing this query takes 352ms uses index. After profiling, it shows that most of the time is spent on Creating sort index. (Complete detail: http://pastebin.com/raw/jT58DCN5)
2 Faster Query:
SELECT article_id, title, author, content, pdate, MATCH(author, title, content)
AGAINST('Search Keyword') AS score FROM articles LIMIT 10;
Executing this query takes 23ms and does a full table scan, I don't like full table scans.
The problem / question is, Query #1 is the one that I need to use, since the sorting is very important.
Is there anything I can do about speeding up that query / re-writting it and achieve the same result (As #1)?
Appreciate any input and help.
Maybe you are simply expecting too much? 350ms for doing a
MATCH(author, title, content) AGAINST('Search Keyword')
ORDER BY
on 120k records doesn't sound too shabby to me; especially if content is 'bigish' ...
Keep in mind that for your "SLOW QUERY" to work, the system has to read every row, calculate the score and then in the end sort all scores, figure out the lowest 10 values and then return all relevant row information for it. If you leave out the ORDER BY then it simply picks the first 10 rows and only needs to calculate the score for those 10 rows.
That said, I think the EXPLAIN is a bit misleading in that it seems to blame everything on the SORT while most probably it is the MATCH that takes up most of the time. I'm guessing the MATCH() operator is executed in a 'lazy' manner and thus only is run when the data is asked for which in this case is while the sorting is happening.
To figure this out simply add a new column score and split up the query into 2 parts.
UPDATE articles SET score = MATCH() etc... => will take about 300 ms I guess
SELECT article_id, title, author, content, pdate, score FROM articles ORDER BY score DESC LIMIT 10; => will take about 50 ms I guess
Off course this is no working solution, but if I'm right it will show you that your problem is not with the SORT but rather with the fulltext searching...
PS: you forgot to mention what the indexes are on the table, might be useful to know too. cf https://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html
Try these variations:
AGAINST('word')
AGAINST('+word')
AGAINST('+word' IN BOOLEAN MODE)
Try
SELECT ... MATCH ...,
FROM tbl
WHERE MATCH ... -- repeat the test here
...
The test is to eliminate the rows that do not match at all, thereby greatly diminishing the number of rows to sort. (Again, depending on + and BOOLEAN.)
(I usually use all three: +, BOOLEAN, and WHERE MATCH.)
key_buffer_size = 2G may help also.
You should consider moving to InnoDB, where FT is faster.
I've been looking for resources to explain how this query exactly sorts retrieved items by relevance, and haven't been able to find any.
Hopefully one of you can explain the logistics of it to me?
SELECT *, MATCH(body, subject) AGAINST ('words' IN BOOLEAN MODE) AS relevance
FROM `messages`
WHERE MATCH(body, subject) AGAINST ('words' IN BOOLEAN MODE)
ORDER BY relevance DESC
In this case, I know that first half of this query searches through the messages.body and messages.subject columns for the search terms "words". It then returns those results, (regardless of the Boolean Operators) in what is essential a "random order" (ordered by what is found first, then found 2nd, and so on).
What I don't understand, however, is how MySQL interprets the WHERE clause and the rest of the query. How does repeating the first half of code reorder the results by relevance?
For example, an ORDER BY clause that sorts a users.user_id column by desc. numerical order MAKES SENSE to me because each row/cell has a clear order (e.g. - 3 , 2 , 1, and so on)
But how does (going back to the original query) MySQL interpret these "word" results (words, obviously not having any values/numbers/clear-order) and sort them according to relevance?
Is it because the Boolean Full-text Search gives hidden numerical values to these search terms? Like if the AGAINST clause read:
AGAINST ('+apple -macintosh ~microsoft >windows' IN BOOLEAN MODE)
Like "apple" gets a value of 100, "macintosh" a value of -100, "microsoft" a value of 20, and "windows" a value of 40 (to reflect the Operator Effects)?
I know that this is oversimplifying the process (especially when considering if a column contains more than one of these search terms), but that is the best I got.
What I basically need, is a layman-terms explanation of the WHERE clause's (the 2nd half of query code's) effect on the query results as a whole.