I know this question has been asked several times.. but , let me explain.
I have a table with 450k records of users (id, first name, last name, address, phone number, etc ..).
I want to search users by thei first name and/or their last name.
I used these queries :
SELECT * FROM correspondants WHERE nom LIKE 'Renault%' AND prénom LIKE 'r%';
and
SELECT * FROM correspondants WHERE CONCAT(nom, CHAR(32), prénom= LIKE 'Renault r%';
It works well, but with a too high duration (1,5 s). This is my problem.
To fix it, I tried with MATCH and AGAINST with a full text index on both colums 'nom' and 'prénom' :
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault r');
It's very quick (0,000 s ..) but result is bad, I don't obtain what I should have.
For example, with LIKE function, results are :
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(3 results).
And with MATCH/AGAINST :
327380 RENAULT Luc
1559 RENAULT Marina
17280 RENAULT Anne
(...)
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(...)
436696 SEZNEC-RENAULT Helene
(...)
(115 results !)
What is the best way to do a quick and efficient text search on both columns with a "AND" search ? (and what about indexes)
Fulltext search doesn't do pattern-matching as LIKE string comparisons do. Fulltext search only searches for full words, not fragments like r%.
Also there's a minimum size of word, controlled by the ft_min_word_len configuration variable. To avoid making the fulltext index too large, it doesn't index words smaller than that variable. And therefore short words are ignored when you search, so r is ignored.
There's also no choice in fulltext indexing to search for words in a specific position like at the beginning of a string. So your search for renault may be found in the middle of the string.
To solve these issues, you could do the following:
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault')
AND CONCAT(nom, CHAR(32), prénom) LIKE 'Renault r%';
This would use the fulltext index to find a small subset of your 450,000 rows that have the word renault somewhere in the string. Then the second term in the search would be done without help from an index, but only against the subset of rows that match the first term.
That particular query is best done this way:
INDEX(nom, prénom)
WHERE non = 'Relault' AND prénom LIKE 'R%'
I recommend that you add that index and add code to your application to handle different requests in different ways.
Do not hide an indexed column inside a function call, such as CONCAT(nom, ...), it will not be able to use the index; instead it will check every row, performing the CONCAT for every row and then doing the LIKE. Very slow.
Except for cases of initials (as above), you should mostly avoid very short names. However, here is another case where you can make it work with extra code:
WHERE nom = 'Lu'
(with the same index). Note that using any flavor of MATCH is likely to be much less efficient.
So, if you are given a full last name, use WHERE nom =. If you are given a prefix, then it might work to use WHERE nom LIKE 'Prefix%' Etc.
FULLTEXT is best used for cases where you have full words scattered in longer text, which is not your case since you have nom and prénom split out.
Perhaps you should not use MATCH for anything in this schema.
Related
I have a table which store some datas. This is my table structure.
Course
Location
Wolden
New York
Sertigo
Seatlle
Monad
Chicago
Donner
Texas
I want to search from that table for example with this keyword Sertigo Seattle and it will return row number two as a result.
I have this query but doesn't work.
SELECT * FROM courses_data a WHERE CONCAT_WS(' ', a.Courses, a.Location) LIKE '%Sertigo Seattle%'
Maybe anyone knows how to make query to achieve my needs?
If you want to search against the course and location then use:
SELECT *
FROM courses_data
WHERE Course = 'Sertigo' AND Location = 'Seattle';
Efficient searching is usually implemented by preparing the search string before running the actual search:
You split the search string "Sertigo Seattle" into two words: "Sertigo" and "Seattle". You trim those words (remove enclosing white space characters). You might also want to normalize the words, perhaps convert them to all lower case to implement a case insentive search.
Then you run a search for the discrete words:
SELECT *
FROM courses_data
WHERE
(Course = 'Sertigo' AND Location = 'Seattle')
OR
(Course = 'Seattle' AND Location = 'Sertigo');
Of course that query is created using a prepared statement and parameter binding, using the extracted and trimmed words as dynamic parameters.
This is is much more efficient than using wildcard based search with the LIKE operator. Because the database engine can make use of the indexes you (hopefully) created for that table. You can check that by using EXPLAIN feature MySQL offers.
Also it does make sense to measure performance: run different search approaches in a loop, say 1000 times, and take the required time. You will get a clear and meaningful example. Also monitoring CPU and memory usage in such a test is of interest.
This is my query
SELECT * FROM myTable WHERE MATCH (name) AGAINST ("Apple M1" IN NATURAL LANGUAGE MODE)
if I search Apple M1 as results i get Orange M1 then third or more position i get Apple M-1 – which is the value i stored and I was assuming should be first!
my question is: there is a way to fine tune the mysql search?
They best way to improve MySQL Natural Language Mode search is to use Boolean Full-Text Searches instead. It will do the same as Natural Language Mode search, but you can use additional modifiers to finetune your results, e.g. by
> <
These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The > operator increases the contribution and the < operator decreases it.
There is one minor difference, boolean mode search will not order automatically according to relevance, so you have to order yourself.
SELECT * FROM myTable
WHERE MATCH (name) AGAINST (">Apple M1" IN BOOLEAN MODE)
ORDER BY MATCH (name) AGAINST (">Apple M1" IN BOOLEAN MODE) desc
And a remark: both versions of fulltext search will not find M-1 if you match against M1 (even with a minimum wordlength setting of 2). It will only look for exakt (usually case-insensitive) word matches, it does not look for similar words (unless you use *). It will "just" weigh the combination of (exact) words by some algorithm, and, if you use them, the modifiers.
Update Some additional clarification according to the comments:
If you match against Apple M1, it returns rows that contain (case-insensitive) Apple or M1 in any order, so e.g. M1 apple, Apple M4, Apple M-1 and Orange M1. It will not find Apples M4 or Orange M-1, because they are not exactly that words. E.g. like '%M-1%' wouldn't find Apple M1 either. But if you like, you can match against Apple* to find Apple and Apples, but it's always at the end of the word, *Apple* is not possible, you have to use like '%Apple%' then.
These rows are then ordered by the scoring algorithm, that will basically score words that are less common in your texts higher than very common words. And if you add >Apple, it will give Apple a higher value. It will just be a number, you can add them to your select, e.g. select ..., MATCH (name) AGAINST (">Apple M1" IN BOOLEAN MODE) as score to get a feeling for that.
There are some other things to consider:
only words that have a minimum length are added to the index. That length is given by innodb_ft_min_token_size for innodb or ft_min_word_len for myisam. So you should set it to e.g. 2 to include M1 (otherwise, this word will not have any effect in your search. Since in your example, you found Orange M1, I assume it is set correctly).
- is usually considered a hyphen. So M-1 in your text will be split up into two words M and 1 (that may or may not be included according to your mininum word lenght setting, so maybe set it to 1). You can change that behaviour by adding - to the characterset (see Fine-Tuning MySQL Full-Text Search, the part beginning with Modify a character set file), but this will then not find blue-green anymore if you search for blue and/or green.
the full text search uses stopwords. These words are not included in your index. This list includes a and i, so even with minimum wordlength of 1, you would not find them. You can edit that list.
Some ideas about your potential problem about M1/M-1. To adjust that to your exact requirements, you would have to add more information about your searches and data (and would be maybe another question), but some ideas:
You can replace userinput that contains - by including both versions to your search query: once with -, but enclosed in "", once without. So if the user enters Apple M-1, you would create a search for Apple M1 "M-1" (that would work with or without a modified characterset, but without a new characterset, your min word length has to be 1). If the user enters M1, you should detect that and replace that by M1 "M-1" too.
Another alternative would be to save an additional column with clean, hyphenless words and add that column to the full text index and then match (name, clean_name) against ("M1" ....
And you can of course combine like and match, e.g. if you detect a product number in your input, you can use something like where match(...) against(...) or product_id like 'M%1%', or where match(...) against(...) or product_id = 'M-1' or product_id = 'M1' or even where match(...) against(...) or name like '%M%1%', but the latter would probably be a lot slower and contain a lot of noise. And it might not score correctly, but at least it will be in the resultset.
But as I said, that would depend on your data and your requirements.
I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks
Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).
My user table has a column "name" which contains information like this:
Joe Lee
Angela White
I want to search for either first name or last name efficiently. First name is easy, I can do
SELECT * FROM user WHERE name LIKE "ABC%"
But for last name, if I do
SELECT * FROM user WHERE name LIKE "%ABC"
That would be extremely slow.
So I am thinking about counting the characters of the input, for example, "ABC" has 3 characters, and if I can search only the last three characters in name column, that would be great. So I want something like
SELECT * FROM user WHERE substring(name, end-3, end) LIKE "ABC%"
Is there anything in MySQL that can do this?
Thanks so much!
PS. I cannot do fulltext because our search engine doesn't support that.
The reason that
WHERE name LIKE '%ith'
is a slow way to look for 'John Smith' by last name is the same reason that
WHERE Right(name, InStr(name, ' ' )) LIKE 'smi%'
or any other expression on the column is slow. It defeats the use of the index for quick lookup and leaves the MySQL server doing a full table scan or full index scan.
If you were using Oracle (that is, if you worked for a formerly wealthy employer) you could use function indexes. As it is you have to add some extra columns or some other helping data to accelerate your search.
Your smartest move is to split your first and last names into separate columns. Several other people have pointed out good reasons for doing that.
If you can't do that you could try creating an extra column which contains the name string reversed, and create an index on that column. That column will have, for example, 'John Smith' stored as 'htimS nhoJ'. Then you can search as follows.
WHERE nameReversed LIKE CONCAT(REVERSE('ith'),'%')
This search will use the index and be decently fast. I've had good success with it.
You're close. In MySQL you should be able to use InStr(str, substr) and Right(str, index) to do the following:
SELECT * FROM user WHERE Right(name, InStr(name, " ")) LIKE "ABC%"
InStr(name, " ") returns the index of the Space character (you may have to play with the " " syntax). This index is then used in the Right() function to search for only the last name (basically; problems arise when you have multiple names, multiple spaces etc). LIKE "ABC%" would then search for a last name starting with ABC.
You cannot use a fixed index as names that are more than 3 or less than 3 characters long would not return properly as you suggest.
However, as Zane said, it's a much better practise to use seperate fields.
If it is a MyIsam table, you may use Free text search to do the same.
You can use the REGEXP operator:
SELECT * FROM user WHERE name REGEXP "ABC$"
http://dev.mysql.com/doc/refman/5.1/en/regexp.html
I have a myISAM table with FULLTEXT index , trying to do
SELECT
lk.id,
lk.address
FROM
lk
WHERE MATCH
lk.address
AGAINST('235 regent street, london w1b 2et');
I get results but only the ones who got the word "london" inside, or ones who got the word "street" inside. I know that 3 ft_min_word_len character words aren't indexed so "235","w1b","2et" are ignored, but what about "regent" ?
What is the STANDARD way of doing this? fuzzy matching an address.
thanks
The answer is to use MATCH AGAINST('...' IN BOOLEAN MODE) , and to add + in front of every word.
Or use other characters like explained in:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html
It needs fine tuning depending on your searched text, and how you got it.