As I have read in many parts, ngram indexing can improve word searches.
In this old post it says that it could be adapted for mysql but it does not say how: levenshtein alternative
Can anyone put some example as you can use this technique in mysql?
Can use this technique to improve the performance of the levenshtein function for mysql?.
My need is to find approximate text (like levenshtein)
In mi tests I used levenshtein() and levenshtein_ratio() functions from:
http://www.artfulsoftware.com/infotree/qrytip.php?id=552
SELECT *, levenshtein_ratio('stacoverflou',words_column) AS ratio
FROM my_table
ORDER BY ratio DESC
That improve performance (Supposing not misspelled first letter)
SELECT *, levenshtein_ratio('stacoverflou',words_column) AS ratio
FROM my_table
WHERE words_column LIKE 's%'
ORDER BY ratio DESC
Also I found this php library for building ngrams:
https://gist.github.com/Xeoncross/5366393
But I have no idea how to use these ngrams in mysql
I have finally made an algorithm myself:
Generate ngrams algorithm:
I build a words table with 3 columns: ngrams (fulltext), word (UNIQUE), lang
I used Bigrams() function to make ngrams for each word
I added a char padding to each ngram to skip the full text index minimum word length ex: 'abcd' should be 'ab bc cd' but with padding it look like 'abxx bcxx cdxx'
Search algorithm:
I take the client written words as correct and use them to search in the real table
with mysql fulltext search query using AGAINST('+word_1 +word_2 +word_n' IN BOOLEAN MODE)
if the score (ranking) > 0 Mission accomplished and ngrams not used (show result to client)
If the score (ranking) = 0 (maybe misspelled words) then use ngrams words table to retrieve correct words
Retrieve correct words from ngrams algorithm:
For each word generate ngrams and preform a fulltext search query using AGAINST('abxx bcxx cdxx' IN BOOLEAN MODE) FROM words table (where we have the ngrams column) and retrieve the correct word. Note that here no (+) prefix to ngrams
Rebuild the search like as step 4
If score > 0 Mission accomplished -> show results -> END
If score still 0 then make another query but this time without (+) +word prefix and IN NATURAL LANGUAGE MODE -> show results -> END
Step 2 code:
// original from : https://gist.github.com/Xeoncross/5366393
// modified for working also with unicode characters
function Bigrams($word){
$ngrams = array();
$len = mb_strlen($word);
for($i=0;$i+1<$len;$i++){
$ngram = mb_substr($word, $i, 2);
while(mb_strlen($ngram) < 4){
$ngram .= "x";
}
$ngrams[$i]=$ngram;
}
return implode(" ",$ngrams);
}
Step 4 code:
SELECT my_column,
( MATCH(my_full_text_column )
AGAINST('+word_1 +word_2 +word_n' IN BOOLEAN MODE)
) AS score LIMIT 10
Step 7 code:
$word = "stacoverflou"; // Intentionally misspelled
$actual_word_ngrams = Bigrams();
//that return
//stxx taxx acxx coxx ovxx vexx erxx rfxx flxx loxx ouxx
SELECT word,
( MATCH( ngrams )
AGAINST('$actual_word_ngrams' IN BOOLEAN MODE)
) AS score LIMIT 1
That return stackoverflow an can be used with ohers words (if many) to much more accurate search like as step 4
END
Related
Background:
DB: MySQL 8.0, InnoDB engine;
Table size: About 2M rows of 2G data;
FULL_TEXT index column: sentence (TEXT data type)
https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
Old Query using SQL LIKE:
SELECT * FROM books WHERE sentence LIKE '%This is a sample search input string%' and author_id = 5 and publisher_id = 23;
New Query using MySQL FULL_TEXT search:
SELECT * FROM books WHERE MATCH (sentence) AGAINST ('This is a sample search input string') and author_id = 5 and publisher_id = 23 LIMIT 1;
Problems:
I expect a lot of search speed boost from using LIKE to FULL_TEXT(match... against). But based on my testing, this isn't the case:
For an input string with <10 words, full_text search is faster than LIKE;
For an input string with ~25 words, the full_text search can take 3+ seconds to return which is similar to LIKE.
And longer the string, the worse speed full_text search has which can take more than 15s.
Profiling the query:
https://dev.mysql.com/doc/refman/8.0/en/show-profile.html
By looking at the profiling result, 90% of the time is spent on "FULLTEXT initialization"
Optimization I've tried which haven't brought speed improvement:
6.1 Rewrite query trying to use other indexes together with full-text index:
select * from books as b1 join books b2 on b1.author_id = b2.author_id and b1.publisher_id = b2.publisher_id WHERE b2.author_id = 5 and b2.publisher = 23 and MATCH (b1.source) AGAINST ('Sample input string') LIMIT 1;
6.2 Only select the document_id instead of the whole record:
SELECT id FROM books WHERE MATCH (sentence) AGAINST ('This is a sample search input string') and author_id = 5 and publisher_id = 23 LIMIT 1;
Questions:
Are there any other ways that I could try to improve the search speed? According to this doc: https://dev.mysql.com/doc/refman/8.0/en/fulltext-fine-tuning.html
I could try adding more stop words, run OPTIMIZE TABLE, move current table to a new one, or upgrading the hardware. But I'm not sure if it's worth to try those methods at all.
Toss the short words and any stop words
Prefix the rest with '+' and add "IN BOOLEAN MODE"
AGAINST ('+sample +search +input +string' IN BOOLEAN MODE)
If you need the words exactly like your LIKE, then do
WHERE MATCH(..) AGAINST(...)
AND ... LIKE '%This is a sample search input string%'
AND ...
The FULLTEXT search will be fast, but may find documents that have the words in some other order. The LIKE will be fast because it is only checking only a few rows (the ones that FT found).
I am facing difficulty in sorting the result based on field in mysql. Say for example I am searching the word "Terms" then I should get the results which starts with 'Terms' first and then 'Terms and' as next and then 'Terms and conditions' and so on.
Any one please help out who to fetch the search result based on my requirements in efficient manner using mysql query.
SELECT * FROM your_table WHERE your_column LIKE "Terms%" ORDER BY your_column;
Based on the storage engine and mysql version you probably can use the full text search capabilities of MySQL. For example:
SELECT *, MATCH (your_column) AGAINST ('Terms' IN BOOLEAN MODE) AS relevance
FROM your_table
WHERE MATCH (your_column) AGAINST ('Terms' IN BOOLEAN MODE)
ORDER BY relevance
You can find more info here: http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
Or if you don't want FTS another possible solution where ordering is strictly based on the length (difference) of the strings.
SELECT * FROM your_table WHERE your_column LIKE "Terms%" ORDER BY ABS(LENGTH(your_column) - LENGTH('Terms'));
You are looking for fulltext search. Below a very simple example
SELECT id,name MATCH (name) AGAINST ('string' > 'string*' IN BOOLEAN MODE) AS score
FROM tablename WHERE MATCH (name) AGAINST ('string' > 'string*' IN BOOLEAN MODE)
ORDER BY score DESC
The advantage of this is that you can control the value of words. This is very basic, you can 'up' some matches or words (or 'down' them)
In my example an exact match ('string') would get a higher score than the string with something attached ('string*'). The following line is even one step broader:
'string' > 'string*' > '*string*'
This documentation about fulltextsearch explains allot. It's a long read, but worth it and complete.
Don't use fulltext index if you search for prefix string!
Using LIKE "Term%" the optimizer will be able to use a potential index on your_column:
SELECT * FROM your_table
WHERE your_column LIKE "Terms%"
ORDER BY CHAR_LENGTH(your_column),your_column
Note the ORDER BY clause: it first sorts by string length, and only use alphabetcal order to sort strings of equal length.
And please, use CHAR_LENGTH and not LENGTH as the first count the number of characters, whereas the later count number of bytes. Using a variable length encoding such as utf8, this would made a difference.
I have a very big table with strings.
Field "words":
- dog
- champion
- cat
- this is a cat
- pool
- champ
- boots
...
In my example, if a select query is looking for the given string "championship", it won't find it because this string is not in the table.
In that case, I want the query to return "champion" from the table, i.e. the longest string in the table that begins the given word "championship".
The possible match (if found) is the longest one in table between championship, or championshi, or championsh, or champions, ..., or cham, or cha, or ch, or C.
Question: I want to return longest string in table that starts a given string.
I need high speed. Is there a way to create index and query in order to have fast execution of queries?
Here's one query that will return the specified result:
SELECT t.mycol
FROM mytable t
WHERE 'championship' LIKE CONCAT(t.mycol,'%')
ORDER
BY LENGTH(t.mycol) DESC
LIMIT 1
This query can't do a index range scan, it's going to have to be full scan, but it may be able to use an index to satisfy the query.
If you can restrict the search to a finite number of leading letters that need to match to be considered a "hit", you could include another predicate. For example, to match at least 4 characters:
SELECT t.mycol
FROM mytable t
WHERE 'championship' LIKE CONCAT(t.mycol,'%')
AND t.mycol LIKE 'cham%'
ORDER
BY LENGTH(t.mycol) DESC
LIMIT 1
--or--
AND t.mycol >= 'cham'
AND t.mycol < 'chan'
You are a little vague with 'the longest string in the table that begins the given word "championship".' Would "championing" count as a match?
Perhaps the following will help. If you have an index on words, then the following will return the last word before the given word. It should maximize the initial sequence of matches:
select word
from t
where words <= 'championship'
order by words desc
limit 1;
This isn't exactly what you are asking for, but it might work in practice.
EDIT:
If you are looking for an exact match, then the following should use an index on words effectively and return what you want:
select word
from t
where word in ('championship', 'championshi', 'championsh', 'champions', 'champion',
'champio', 'champi', 'champ', 'cham', 'cha', 'ch', 'c')
order by word desc
limit 1;
It is a bit brute force, but it should have the property of using the index to speed up the query.
Have a look at this article:
http://blog.fatalmind.com/2010/09/29/finding-the-best-match-with-a-top-n-query/
It explains the solution from this SO question:
How to use index efficienty in mysql query
The solution pattern looks like this:
select words
from (
select words
from yourtable
where words <= 'championship'
order by words desc
limit 1
) tmp
where 'championship' like concat (words, '%')
First off there seems to be no way to get an exact match using a full-text search. This seems to be a highly discussed issue when using the full-text search method and there are lots of different solutions to achieve the desired result, however most seem very inefficient. Being I'm forced to use full-text search due to the volume of my database I recently had to implement one of these solutions to get more accurate results.
I could not use the ranking results from the full-text search because of how it works. For instance if you searched for a movie called Toy Story and there was also a movie called The Story Behind Toy Story that would come up instead of the exact match because it found the word Story twice and Toy.
I do track my own rankings which I call "Popularity" each time a user access a record the number goes up. I use this datapoint to weight my results to help determine what the user might be looking for.
I also have the issue where sometimes need to fall back to a LIKE search and not return an exact match. I.e. searching Goonies should return The Goonies (most popular result)
So here is an example of my current stored procedure for achieving this:
DECLARE #Title varchar(255)
SET #Title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE #Title2 varchar(255)
SET #Title2 = REPLACE(#title, '"', '')
--get top 100 results using full-text search and sort them by popularity
SELECT TOP(100) id, title, popularity As Weight into #TempTable FROM movies WHERE CONTAINS(title, #Title) ORDER BY [Weight] DESC
--check if exact match can be found
IF EXISTS(select * from #TempTable where Title = #title2)
--return exact match
SELECT TOP(1) * from #TempTable where Title = #title2
ELSE
--no exact match found, try using like with wildcards
SELECT TOP(1) * from #TempTable where Title like '%' + #title2 + '%'
DROP TABLE #TEMPTABLE
This stored procedure is executed about 5,000 times a minute, and crazy enough it's not bringing my server to it's knees. But I really want to know if there was a more efficient approach to this? Thanks.
You should use full text search CONTAINSTABLE to find the top 100 (possibly 200) candidate results and then order the results you found using your own criteria.
It sounds like you'd like to ORDER BY
exact match of the phrase (=)
the fully matched phrase (LIKE)
higher value for the Popularity column
the Rank from the CONTAINSTABLE
But you can toy around with the exact order you prefer.
In SQL that looks something like:
DECLARE #title varchar(255)
SET #title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE #title2 varchar(255)
SET #title2 = REPLACE(#title, '"', '')
SELECT
m.ID,
m.title,
m.Popularity,
k.Rank
FROM Movies m
INNER JOIN CONTAINSTABLE(Movies, title, #title, 100) as [k]
ON m.ID = k.[Key]
ORDER BY
CASE WHEN m.title = #title2 THEN 0 ELSE 1 END,
CASE WHEN m.title LIKE #title2 THEN 0 ELSE 1 END,
m.popularity desc,
k.rank
See SQLFiddle
This will give you the movies that contain the exact phrase "Toy Story", ordered by their popularity.
SELECT
m.[ID],
m.[Popularity],
k.[Rank]
FROM [dbo].[Movies] m
INNER JOIN CONTAINSTABLE([dbo].[Movies], [Title], N'"Toy Story"') as [k]
ON m.[ID] = k.[Key]
ORDER BY m.[Popularity]
Note the above would also give you "The Goonies Return" if you searched "The Goonies".
If got the feeling you don't really like the fuzzy part of the full text search but you do like the performance part.
Maybe is this a path: if you insist on getting the EXACT match before a weighted match you could try to hash the value. For example 'Toy Story' -> bring to lowercase -> toy story -> Hash into 4de2gs5sa (with whatever hash you like) and perform a search on the hash.
In Oracle I've used UTL_MATCH for similar purposes. (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)
Even though using the Jaro Winkler algorithm, for instance, might take awhile if you compare the title column from table 1 and table 2, you can improve performance if you partially join the 2 tables. I have in some cases compared person names on table 1 with table 2 using Jaro Winkler, but limited results not just above a certain Jaro Winkler threshold, but also to names between the 2 tables where the first letter is the same. For instance I would compare Albert with Aden, Alfonzo, and Alberto, using Jaro Winkler, but not Albert and Frank (limiting the number of situations where the algorithm needs to be used).
Jaro Winkler may actually be suitable for movie titles as well. Although you are using SQL server (can't use the utl_match package) it looks like there is a free library called "SimMetrics" which has the Jaro Winkler algorithm among other string comparison metrics. You can find detail on that and instructions here: http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics
I have a problem or rather an understanding problem with a hyphenated searchstring which is quoted.
In my Table there is a table with a column 'company'.
One of the entries in that column is: A-Z Electro
The following examples are simplified a lot (though the real query is much more complex) - but the effect is still the same.
When I do the following search, I don't get the row with the above mentioned company:
SELECT i.*
FROM my_table i
WHERE MATCH (i.company) AGAINST ('+\"A-Z\" +Electro*' IN BOOLEAN MODE)
GROUP BY i.uid ORDER BY i.company ASC LIMIT 0, 40;
If I do the following search, get the row with the above mentioned company (notice only changed the - to a + before "A-Z":
SELECT i.*
FROM my_table i
WHERE MATCH (i.company) AGAINST ('-\"A-Z\" +Electro*' IN BOOLEAN MODE)
GROUP BY i.uid ORDER BY i.company ASC LIMIT 0, 40;
I also get the row, if I remove the operator completely:
SELECT i.*
FROM my_table i
WHERE MATCH (i.company) AGAINST ('\"A-Z\" +Electro*' IN BOOLEAN MODE)
GROUP BY i.uid ORDER BY i.company ASC LIMIT 0, 40;
Can anyone explain to me this behaviour? Because I would expect, when searching with a +, I should get the result too...
I just checked the table index with myisam_ftdump.
Two-Character-Words are indexed properly as there are entries like
14f2e8 0.7908264 ab
3a164 0.8613265 dv
There is also an entry:
de340 0.6801047 az
I suppose this should be the entry for A-Z - so the search should find this entry, shouldn't it?
The default value of ft_min_word_len is 4. See this link for information on that. In short, your system isn't indexing words of less than 4 characters.
Why is this important? Well:
A-Z is less than 4 characters long
...therefore it's not in the index
...but your first query +"A-Z" states it must be in the index in order for the match to succeed
The other two (match if it's not in the index, match if either this or that is in the index) work because it's not in the index.
The hyphen is a red herring - the reason is because "A-Z" is three characters long and your FT index ignores it.