MySQL - FULLTEXT in BOOLEAN mode + Relevance using views field - mysql

I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks

Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).

Related

Search in multiple column with at least 2 words in keyword

I have a table which store some datas. This is my table structure.
Course
Location
Wolden
New York
Sertigo
Seatlle
Monad
Chicago
Donner
Texas
I want to search from that table for example with this keyword Sertigo Seattle and it will return row number two as a result.
I have this query but doesn't work.
SELECT * FROM courses_data a WHERE CONCAT_WS(' ', a.Courses, a.Location) LIKE '%Sertigo Seattle%'
Maybe anyone knows how to make query to achieve my needs?
If you want to search against the course and location then use:
SELECT *
FROM courses_data
WHERE Course = 'Sertigo' AND Location = 'Seattle';
Efficient searching is usually implemented by preparing the search string before running the actual search:
You split the search string "Sertigo Seattle" into two words: "Sertigo" and "Seattle". You trim those words (remove enclosing white space characters). You might also want to normalize the words, perhaps convert them to all lower case to implement a case insentive search.
Then you run a search for the discrete words:
SELECT *
FROM courses_data
WHERE
(Course = 'Sertigo' AND Location = 'Seattle')
OR
(Course = 'Seattle' AND Location = 'Sertigo');
Of course that query is created using a prepared statement and parameter binding, using the extracted and trimmed words as dynamic parameters.
This is is much more efficient than using wildcard based search with the LIKE operator. Because the database engine can make use of the indexes you (hopefully) created for that table. You can check that by using EXPLAIN feature MySQL offers.
Also it does make sense to measure performance: run different search approaches in a loop, say 1000 times, and take the required time. You will get a clear and meaningful example. Also monitoring CPU and memory usage in such a test is of interest.

Speedup levenshtein query

I have a multi-user database management system of about 1 million records, its structure is as below:
Backend (MySQL)
"DNames" table
"Fullname" field
"ID" field
Frontend (MS Access)
"levenshtein" function
"lev" query
"lev_dist" field (calculated levenshtein distance using function above, sorted asc)
"Fullname" field
"ID" field
"srch" textbox in "result" form
My problem is that when I run the query (i.e. use "srch" textbox) without sorting it's fast enough, but when I use sort it takes about 30 to 90 sec to complete (depending on pc specs). I need the sort operation to find the top 10 (closest) match between the text in "srch" textbox and the database, so how can I speed up the process? Is there a way to make it reach 5 second max? This process may run from 5 PCs simultaneously. I tried using MySQL levenshtein function , yet it took 2 minuts!!
Would you accept a compromise? Find all words within a 'small' distance in perhaps 1ms (if the data is cached in the buffer_pool)?
Build a table with about 5M-10M rows (based on your 1M 'words'). It would have two columns -- F(word), word.
Lookup F(word) to get a list of possible words.
F(word) is a set of strings -- Take the 'word' and drop each letter, plus the original word. For example:
word --> ord, wrd, wod, wor, word
letter --> etter, ltter, leter, lettr. lette, letter
(Note that 'leter' occurs twice)
Table and query:
CREATE TABLE ricks_leven ()
fword VARCHAR(22) NOT NULL, -- F(word)
word VARCHAR(22) NOT NULL, -- the desired word
PRIMARY KEY(fword, word)
) ENGINE=InnoDB;
SELECT word, COUNT(*) AS ct
FROM ricks_leven
WHERE fword IN ('etter', 'ltter', 'leter', 'lettr'. 'lette', 'letter')
GROUP BY word
ORDER BY ct DESC
LIMIT 10;
A perfect match will automatically come first in the output. Some other "likely" misspellings may come next. I don't know if the Levenshtein distance orders the results in the same way.
This algorithm covers these common typos, all of which have a small Levenshtein distance:
any one-letter drop,
adjacent letter transposition (distance=2, but important),
a letter being added in any location.
A compromise between speed and completeness:
Use my technique. If you get some results, then quit.
Fall back onto slow Levenshtein search.

How to do fulltext search in multiple columns in MySQL, quickly?

I know this question has been asked several times.. but , let me explain.
I have a table with 450k records of users (id, first name, last name, address, phone number, etc ..).
I want to search users by thei first name and/or their last name.
I used these queries :
SELECT * FROM correspondants WHERE nom LIKE 'Renault%' AND prénom LIKE 'r%';
and
SELECT * FROM correspondants WHERE CONCAT(nom, CHAR(32), prénom= LIKE 'Renault r%';
It works well, but with a too high duration (1,5 s). This is my problem.
To fix it, I tried with MATCH and AGAINST with a full text index on both colums 'nom' and 'prénom' :
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault r');
It's very quick (0,000 s ..) but result is bad, I don't obtain what I should have.
For example, with LIKE function, results are :
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(3 results).
And with MATCH/AGAINST :
327380 RENAULT Luc
1559 RENAULT Marina
17280 RENAULT Anne
(...)
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(...)
436696 SEZNEC-RENAULT Helene
(...)
(115 results !)
What is the best way to do a quick and efficient text search on both columns with a "AND" search ? (and what about indexes)
Fulltext search doesn't do pattern-matching as LIKE string comparisons do. Fulltext search only searches for full words, not fragments like r%.
Also there's a minimum size of word, controlled by the ft_min_word_len configuration variable. To avoid making the fulltext index too large, it doesn't index words smaller than that variable. And therefore short words are ignored when you search, so r is ignored.
There's also no choice in fulltext indexing to search for words in a specific position like at the beginning of a string. So your search for renault may be found in the middle of the string.
To solve these issues, you could do the following:
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault')
AND CONCAT(nom, CHAR(32), prénom) LIKE 'Renault r%';
This would use the fulltext index to find a small subset of your 450,000 rows that have the word renault somewhere in the string. Then the second term in the search would be done without help from an index, but only against the subset of rows that match the first term.
That particular query is best done this way:
INDEX(nom, prénom)
WHERE non = 'Relault' AND prénom LIKE 'R%'
I recommend that you add that index and add code to your application to handle different requests in different ways.
Do not hide an indexed column inside a function call, such as CONCAT(nom, ...), it will not be able to use the index; instead it will check every row, performing the CONCAT for every row and then doing the LIKE. Very slow.
Except for cases of initials (as above), you should mostly avoid very short names. However, here is another case where you can make it work with extra code:
WHERE nom = 'Lu'
(with the same index). Note that using any flavor of MATCH is likely to be much less efficient.
So, if you are given a full last name, use WHERE nom =. If you are given a prefix, then it might work to use WHERE nom LIKE 'Prefix%' Etc.
FULLTEXT is best used for cases where you have full words scattered in longer text, which is not your case since you have nom and prénom split out.
Perhaps you should not use MATCH for anything in this schema.

Performance of LIKE 'xyz%' v/s LIKE '%xyz'

I was wondering how the LIKE operator actually work.
Does it simply start from first character of the string and try matching pattern, one character moving to the right? Or does it look at the placement of the %, i.e. if it finds the % to be the first character of the pattern, does it start from the right most character and starts matching, moving one character to the left on each successful match?
Not that I have any use case in my mind right now, just curious.
edit: made question narrow
If there is an index on the column, putting constant characters in the front will lead your dbms to use a more efficient searching/seeking algorithm. But even at the simplest form, the dbms has to test characters. If it is able to find it doesn't match early on, it can discard it and move onto the next test.
The LIKE search condition uses wildcards to search for patterns within a string. For example:
WHERE name LIKE 'Mickey%'
will locate all values that begin with 'Mickey' optionally followed by any number of characters. The % is not case sensitive and not accent sensitive and you can use multiple %, for example
WHERE name LIKE '%mouse%'
will return all values with 'mouse' (or 'Mouse' or 'mousé') in it.
The % is inclusive, meaning that
WHERE name like '%A%'
will return all that starts with an 'A', contain 'A' or end with 'A'.
You can use _ (underscore) for any character on a single position:
WHERE name LIKE '_at%'
will give you all values with 'a' as the second letter and 't' as the third. The first letter can be anything. For example: 'Batman'
In T-SQL, if you use [] you can find values in a range.
WHERE name LIKE '[c-f]%'
it will find any value beginning with letter between c and f, inclusive. Meaning it will return any value that start with c, d, e or f. This [] is T-SQL only. Use [^ ] to find values not in a range.
Finding all values that contain a number:
WHERE name LIKE '%[0-9]%'
returns everything that has a number in it. Example: 'Godfather2'
If you are looking for all values with the 3rd position to be a '-' (dash) use two underscores:
WHERE NAME '__-%'
It will return for example: 'Lo-Res'
Finding the values with names ends in 'xyz' use:
WHERE name LIKE '%xyz'
returns anything that ends with 'xyz'
Finding a % sign in a name use brackets:
WHERE name LIKE '%[%]%'
will return for example: 'Top%Movies'
Searching for [ use brackets around it:
WHERE name LIKE '%[[]%'
gives results as: 'New York [NY]'
The database collation's sort order determines both case sensitivety and the sort order for the range of characters. You can optionally use COLLATE to specify collation sort order used by the LIKE operator.
Usually the main performance bottleneck is IO. The efficiency of the LIKE operator can be only important if your whole table fits in the memory otherwise IO will take most of the time.
AFAIK oracle can use indexes for prefix matching. (like 'abc%'), but these index cannot be used for more complex expressions.
Anyway if you have only this kind of queries you should consider using a simple index on the related column. (Probably this is true for other RDBMS's as well.)
Otherwise LIKE operator is generally slow, but most of the RDBMS have some kind of full text searching solution. I think the main reason of the slowness is that LIKE is too general. Usually full text indexes has lots of different options which can tell the database what you really want to search for, and with these additional information the DB can do its task in a more efficient way.
As a rule of thumb I think if you want to search in a text field and you think performance can be an issue, you should consider your RDBMS's full text searching solution, or the real goal is not text searching, but this is some kind of "design side effect", for example xml/json/statuses stored in a field as text, then probably you should consider choosing a more efficient data storing option. (if there is any...)

mysql fulltext MATCH,AGAINST returning 0 results

I am trying to follow: http://dev.mysql.com/doc/refman/4.1/en/fulltext-natural-language.html
in an attempt to improve search queries, both in speed and the ability to order by score.
However when using this SQL ("skitt" is used as a search term just so I can try match Skittles).
SELECT
id,name,description,price,image,
MATCH (name,description)
AGAINST ('skitt')
AS score
FROM
products
WHERE
MATCH (name,description)
AGAINST ('skitt')
it returns 0 results. I am trying to find out why, I think I might have set my index's up wrong I'm not sure, this is the first time I've strayed away from LIKE!
Here is my table structure and data:
Thank you!
By default certain words are excluded from the search. These are called stopwords. "a" is an example of a stopword. You could test your query by using a word that is not a stopword, or you can disable stopwords:
How can I write full search index query which will not consider any stopwords?
If you want to also match prefixes use the truncation operator in boolean mode:
*
The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it should be appended to the word to be affected. Words match if they begin with the word preceding the * operator.