How to improve mysql NATURAL LANGUAGE MODE search query? - mysql

This is my query
SELECT * FROM myTable WHERE MATCH (name) AGAINST ("Apple M1" IN NATURAL LANGUAGE MODE)
if I search Apple M1 as results i get Orange M1 then third or more position i get Apple M-1 – which is the value i stored and I was assuming should be first!
my question is: there is a way to fine tune the mysql search?

They best way to improve MySQL Natural Language Mode search is to use Boolean Full-Text Searches instead. It will do the same as Natural Language Mode search, but you can use additional modifiers to finetune your results, e.g. by
> <
These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The > operator increases the contribution and the < operator decreases it.
There is one minor difference, boolean mode search will not order automatically according to relevance, so you have to order yourself.
SELECT * FROM myTable
WHERE MATCH (name) AGAINST (">Apple M1" IN BOOLEAN MODE)
ORDER BY MATCH (name) AGAINST (">Apple M1" IN BOOLEAN MODE) desc
And a remark: both versions of fulltext search will not find M-1 if you match against M1 (even with a minimum wordlength setting of 2). It will only look for exakt (usually case-insensitive) word matches, it does not look for similar words (unless you use *). It will "just" weigh the combination of (exact) words by some algorithm, and, if you use them, the modifiers.
Update Some additional clarification according to the comments:
If you match against Apple M1, it returns rows that contain (case-insensitive) Apple or M1 in any order, so e.g. M1 apple, Apple M4, Apple M-1 and Orange M1. It will not find Apples M4 or Orange M-1, because they are not exactly that words. E.g. like '%M-1%' wouldn't find Apple M1 either. But if you like, you can match against Apple* to find Apple and Apples, but it's always at the end of the word, *Apple* is not possible, you have to use like '%Apple%' then.
These rows are then ordered by the scoring algorithm, that will basically score words that are less common in your texts higher than very common words. And if you add >Apple, it will give Apple a higher value. It will just be a number, you can add them to your select, e.g. select ..., MATCH (name) AGAINST (">Apple M1" IN BOOLEAN MODE) as score to get a feeling for that.
There are some other things to consider:
only words that have a minimum length are added to the index. That length is given by innodb_ft_min_token_size for innodb or ft_min_word_len for myisam. So you should set it to e.g. 2 to include M1 (otherwise, this word will not have any effect in your search. Since in your example, you found Orange M1, I assume it is set correctly).
- is usually considered a hyphen. So M-1 in your text will be split up into two words M and 1 (that may or may not be included according to your mininum word lenght setting, so maybe set it to 1). You can change that behaviour by adding - to the characterset (see Fine-Tuning MySQL Full-Text Search, the part beginning with Modify a character set file), but this will then not find blue-green anymore if you search for blue and/or green.
the full text search uses stopwords. These words are not included in your index. This list includes a and i, so even with minimum wordlength of 1, you would not find them. You can edit that list.
Some ideas about your potential problem about M1/M-1. To adjust that to your exact requirements, you would have to add more information about your searches and data (and would be maybe another question), but some ideas:
You can replace userinput that contains - by including both versions to your search query: once with -, but enclosed in "", once without. So if the user enters Apple M-1, you would create a search for Apple M1 "M-1" (that would work with or without a modified characterset, but without a new characterset, your min word length has to be 1). If the user enters M1, you should detect that and replace that by M1 "M-1" too.
Another alternative would be to save an additional column with clean, hyphenless words and add that column to the full text index and then match (name, clean_name) against ("M1" ....
And you can of course combine like and match, e.g. if you detect a product number in your input, you can use something like where match(...) against(...) or product_id like 'M%1%', or where match(...) against(...) or product_id = 'M-1' or product_id = 'M1' or even where match(...) against(...) or name like '%M%1%', but the latter would probably be a lot slower and contain a lot of noise. And it might not score correctly, but at least it will be in the resultset.
But as I said, that would depend on your data and your requirements.

Related

How to do fulltext search in multiple columns in MySQL, quickly?

I know this question has been asked several times.. but , let me explain.
I have a table with 450k records of users (id, first name, last name, address, phone number, etc ..).
I want to search users by thei first name and/or their last name.
I used these queries :
SELECT * FROM correspondants WHERE nom LIKE 'Renault%' AND prénom LIKE 'r%';
and
SELECT * FROM correspondants WHERE CONCAT(nom, CHAR(32), prénom= LIKE 'Renault r%';
It works well, but with a too high duration (1,5 s). This is my problem.
To fix it, I tried with MATCH and AGAINST with a full text index on both colums 'nom' and 'prénom' :
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault r');
It's very quick (0,000 s ..) but result is bad, I don't obtain what I should have.
For example, with LIKE function, results are :
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(3 results).
And with MATCH/AGAINST :
327380 RENAULT Luc
1559 RENAULT Marina
17280 RENAULT Anne
(...)
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(...)
436696 SEZNEC-RENAULT Helene
(...)
(115 results !)
What is the best way to do a quick and efficient text search on both columns with a "AND" search ? (and what about indexes)
Fulltext search doesn't do pattern-matching as LIKE string comparisons do. Fulltext search only searches for full words, not fragments like r%.
Also there's a minimum size of word, controlled by the ft_min_word_len configuration variable. To avoid making the fulltext index too large, it doesn't index words smaller than that variable. And therefore short words are ignored when you search, so r is ignored.
There's also no choice in fulltext indexing to search for words in a specific position like at the beginning of a string. So your search for renault may be found in the middle of the string.
To solve these issues, you could do the following:
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault')
AND CONCAT(nom, CHAR(32), prénom) LIKE 'Renault r%';
This would use the fulltext index to find a small subset of your 450,000 rows that have the word renault somewhere in the string. Then the second term in the search would be done without help from an index, but only against the subset of rows that match the first term.
That particular query is best done this way:
INDEX(nom, prénom)
WHERE non = 'Relault' AND prénom LIKE 'R%'
I recommend that you add that index and add code to your application to handle different requests in different ways.
Do not hide an indexed column inside a function call, such as CONCAT(nom, ...), it will not be able to use the index; instead it will check every row, performing the CONCAT for every row and then doing the LIKE. Very slow.
Except for cases of initials (as above), you should mostly avoid very short names. However, here is another case where you can make it work with extra code:
WHERE nom = 'Lu'
(with the same index). Note that using any flavor of MATCH is likely to be much less efficient.
So, if you are given a full last name, use WHERE nom =. If you are given a prefix, then it might work to use WHERE nom LIKE 'Prefix%' Etc.
FULLTEXT is best used for cases where you have full words scattered in longer text, which is not your case since you have nom and prénom split out.
Perhaps you should not use MATCH for anything in this schema.

Full text search in MySQL - multiple search terms, partial search terms, with apostrophe, without apostrophe

I am having an issue with a full text search that allows the user enter:-
multiple search terms
multiple partial terms
search terms with apostrophes
search terms without apostrophes
I have the search working for all but one instance - when the user keys in a partial search term or full search term without the apostrophe.
Within the database there is data saved with Bishop’s Stafford (with apostrophe) and Bishops Stafford (without apostrophe) and I need to fetch all records when the name is searched for.
For place name Bishop’s Stafford, hotel, suite
If the user keys in Bishop’s S, hotel, suite or indeed any version of Bishop’s Stafford containing the apostrophe – the search will work
USE `waf-bit`;
SELECT townDetails FROM town_overview
WHERE MATCH(townDetails) against('+Bishop\'s S* +hotel +suite' IN BOOLEAN MODE)
OR MATCH(townDetails) against('+Bishops S* +hotel +suite' IN BOOLEAN MODE)
ORDER BY townDetails ASC LIMIT 50 OFFSET 0
If however the user keys in Bishops S hotel, suite – or indeed any version of Bishops Stafford without the apostrophe – the search will not work, it will not find the version of Bishop’s Stafford with the apostrophe.
USE `waf-bit`;
SELECT townDetails FROM town_overview
WHERE MATCH(townDetails) against('+Bishops S* +hotel +suite' IN BOOLEAN MODE)
OR MATCH(townDetails) against('+Bishops S* +hotel +suite' IN BOOLEAN MODE)
ORDER BY townDetails ASC LIMIT 50 OFFSET 0
I do not want to use LIKE as the dataset is too large, I can set up a cross reference table that would indicate which place names contain apostrophes but if I can fix this in the search itself that is the preferred option.
Any help would be greatly appreciated.
My suggestion is to use UNION operation for these two queries. UNION will also automatically remove duplicates (you should use UNION ALL to see duplicates)

Full text search order by closest match

SELECT user_id, user_name.fullname, live, likes,
MATCH (fullname, email, live) AGAINST (:search_I IN BOOLEAN MODE) AS relevance
FROM profile LEFT JOIN user_name ON user_id=user_id
WHERE MATCH (fullname, email, live) AGAINST (:search_II IN BOOLEAN MODE)
ORDER BY relevance DESC
bindValue(':search_I', $search...);
bindValue(':search_II', $search...);//PDO can't use same one twice
I have a query use FULL TEXT search, I need to order by the closest match on top.
However this query is not working, It didn't order anything.
I did a testing, search 123#hotmail.com
2 rows in my db, abc#hotmail.com & 123#hotmail.com
It return 2 rows but didn't put the closest match on top(123#hotmail.com)
anyone know where is the problems?
By default MySQL full text search has a minimum word length of 3 (see here).
So, your example of '123#hotmail.com' is only matching on 'hotmail' and the two are equivalent.
You can change the default (and rebuild the index). But, I'd suggest that you do testing with 'abcd#hotmail.com' instead.
EDIT:
The definition of a word is buried a bit in the documentation:
The MySQL FULLTEXT implementation regards any sequence of true word
characters (letters, digits, and underscores) as a word. That sequence
may also contain apostrophes (“'”), but not more than one in a row.
This means that aaa'bbb is regarded as one word, but aaa''bbb is
regarded as two words. Apostrophes at the beginning or the end of a
word are stripped by the FULLTEXT parser; 'aaa'bbb' would be parsed as
aaa'bbb.
Because of the where clause, you can see that there is a match to both email addresses. That match would have to be on 'hotmail'. The 'com' and email name get chopped off because of the default minimum word length.

Interesting MATCH AGAINST issue: not searching some words

I have a table like this:
|====brand=====|======title================|
|....Apple.....|...iPhone 5 32 GB..........|
|....Sony......|...Bluetooth Headset.......|
And i am using FULL TEXT SEARCHING basically like this:
SELECT
*,
MATCH(brand,title) AGAINST ('some words') as score
FROM
table
So when i searched headset, mysql giving 1.812121 score truely.
But when i searched iphone; mysql giving 0 score.
P.s: ft_min_word_len = 2
From http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
A natural language search interprets the search string as a phrase in natural human language (a phrase in free text). There are no special operators. The stopword list applies. In addition, words that are present in 50% or more of the rows are considered common and do not match. Full-text searches are natural language searches if no modifier is given.
So the bit about common words sounds likely.

MySQL - FULLTEXT in BOOLEAN mode + Relevance using views field

I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks
Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).