Speedup levenshtein query - mysql

I have a multi-user database management system of about 1 million records, its structure is as below:
Backend (MySQL)
"DNames" table
"Fullname" field
"ID" field
Frontend (MS Access)
"levenshtein" function
"lev" query
"lev_dist" field (calculated levenshtein distance using function above, sorted asc)
"Fullname" field
"ID" field
"srch" textbox in "result" form
My problem is that when I run the query (i.e. use "srch" textbox) without sorting it's fast enough, but when I use sort it takes about 30 to 90 sec to complete (depending on pc specs). I need the sort operation to find the top 10 (closest) match between the text in "srch" textbox and the database, so how can I speed up the process? Is there a way to make it reach 5 second max? This process may run from 5 PCs simultaneously. I tried using MySQL levenshtein function , yet it took 2 minuts!!

Would you accept a compromise? Find all words within a 'small' distance in perhaps 1ms (if the data is cached in the buffer_pool)?
Build a table with about 5M-10M rows (based on your 1M 'words'). It would have two columns -- F(word), word.
Lookup F(word) to get a list of possible words.
F(word) is a set of strings -- Take the 'word' and drop each letter, plus the original word. For example:
word --> ord, wrd, wod, wor, word
letter --> etter, ltter, leter, lettr. lette, letter
(Note that 'leter' occurs twice)
Table and query:
CREATE TABLE ricks_leven ()
fword VARCHAR(22) NOT NULL, -- F(word)
word VARCHAR(22) NOT NULL, -- the desired word
PRIMARY KEY(fword, word)
) ENGINE=InnoDB;
SELECT word, COUNT(*) AS ct
FROM ricks_leven
WHERE fword IN ('etter', 'ltter', 'leter', 'lettr'. 'lette', 'letter')
GROUP BY word
ORDER BY ct DESC
LIMIT 10;
A perfect match will automatically come first in the output. Some other "likely" misspellings may come next. I don't know if the Levenshtein distance orders the results in the same way.
This algorithm covers these common typos, all of which have a small Levenshtein distance:
any one-letter drop,
adjacent letter transposition (distance=2, but important),
a letter being added in any location.
A compromise between speed and completeness:
Use my technique. If you get some results, then quit.
Fall back onto slow Levenshtein search.

Related

Search in multiple column with at least 2 words in keyword

I have a table which store some datas. This is my table structure.
Course
Location
Wolden
New York
Sertigo
Seatlle
Monad
Chicago
Donner
Texas
I want to search from that table for example with this keyword Sertigo Seattle and it will return row number two as a result.
I have this query but doesn't work.
SELECT * FROM courses_data a WHERE CONCAT_WS(' ', a.Courses, a.Location) LIKE '%Sertigo Seattle%'
Maybe anyone knows how to make query to achieve my needs?
If you want to search against the course and location then use:
SELECT *
FROM courses_data
WHERE Course = 'Sertigo' AND Location = 'Seattle';
Efficient searching is usually implemented by preparing the search string before running the actual search:
You split the search string "Sertigo Seattle" into two words: "Sertigo" and "Seattle". You trim those words (remove enclosing white space characters). You might also want to normalize the words, perhaps convert them to all lower case to implement a case insentive search.
Then you run a search for the discrete words:
SELECT *
FROM courses_data
WHERE
(Course = 'Sertigo' AND Location = 'Seattle')
OR
(Course = 'Seattle' AND Location = 'Sertigo');
Of course that query is created using a prepared statement and parameter binding, using the extracted and trimmed words as dynamic parameters.
This is is much more efficient than using wildcard based search with the LIKE operator. Because the database engine can make use of the indexes you (hopefully) created for that table. You can check that by using EXPLAIN feature MySQL offers.
Also it does make sense to measure performance: run different search approaches in a loop, say 1000 times, and take the required time. You will get a clear and meaningful example. Also monitoring CPU and memory usage in such a test is of interest.

Find a row in the table for string by the longest common tail (the right part) of the substring

There is a simplified table mytable with columns ('id', 'mycolumn') of int and varchar(255) respectively.
What's the fastest way to find a row with the longest common right part (the longest tail) for current string in the mycolumn? The table is huge.
As example, for the string "aaabbbkr84"
we'd find 2 - "wergerbkr84" by the longest found tail "bkr84":
1 - "gbkyugbk9"
2 - "wergerbkr84"
3 - "gbkyugbk4".
To increase the performance, my thoughts were to create another column with the reverted strings. However I'm not sure if it helps and there is no better way. That way I would look up (until I get 0 results to return to the previous one) to look up for
SELECT id, mycolumn FROM mytable
where reverted like '4%';
SELECT id, mycolumn FROM mytable
where reverted like '48%';
SELECT id, mycolumn FROM mytable
where reverted like '48r%';
SELECT id, mycolumn FROM mytable
where reverted like '48rk%';
SELECT id, mycolumn FROM mytable
where reverted like '48rkb%'; ' <-- looking for this one
SELECT id, mycolumn FROM mytable
where reverted like '48rkbb%'; ' 0 results.
0 results found, taking a step back:48rkb. Which means 2 - "wergerbkr84" is the answer.
1 query for to find a row by the longest tail would be preferable (I have a loop of queries above as you see).
However, the performance is #1.
Thanks a lot.
This may be a bit on the exotic side but it sounds like you have an exotic problem.
Have you tried storing the proper tree structure in SQL and doing lookups on it? For example, you could create a trie/prefix tree where there is an edge for each possible character choice among all strings in the database (going in reverse string order).
See the above image for an example of a partially filled-in trie on the three strings in your question (labeled 1, 2, 3). Nodes store references to all strings that are encoded from the edge labels as you traverse from the root. For the search string "48", follow edges labeled '4' and '8' and you will get a reference to the string labeled 2 ("wergerbkr84"). A search proceeds until no edges are labeled with the next character in the search string, at which point the strings with longest matching tail are stored at the current node. Search time is O(n) where n is the length of the search string.
See the following: How to represent a data tree in SQL? Note you will have to consider storage for the edge labels, which is not required in most tree structures.
Also note that there are compressed/space-optimized versions of tries that may be worth investigating depending on your storage requirements.

Sphinx match by first letter

I need simple explanation of why my queries fail to bring the results i need.
Sphinx 2.0.8-id64-release (r3831)
Here is what i have in sphinx.conf:
SELECT
trackid,
title,
artistname,
SUBSTRING(REPLACE(TRIM(`artist_name`), 'the ', ''),1,3) AS artistname_init
....
sql_field_string = title
sql_field_string = artistname
sql_field_string = artistname_init
Additional settings:
docinfo = extern
charset_type = utf-8
min_prefix_len = 1
enable_star = 1
expand_keywords= 0
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z, A..Z->a..z, a..z
Query works. I index my data without problems. However i am failing to make sphinx bring any sensible results. I am using SphinxQL to query.
Example:
select
artistname, artistname_init from myindex
WHERE MATCH('#artistname_init ^t*')
GROUP BY artistname ORDER BY artistname_init ASC limit 0,10;
brings nothing related to the query.
I've tried everything i could think of like:
MATCH('#artistname_init ^t*')
MATCH('#artistname_init[1] t')
MATCH('#artistname_init ^t$')
Can anyone please point where is my mistake and perhaps give me query that will work for my case?
My target is to get results that follow this sorting order:
B (Single letter)
B-T (Single letter + non-alphabet sign after)
B as Blue (Single letter + space after)
Baccara (First letter of single word)
Bad Religion (First letter of several words)
The B (not counting "The ")
The B.Y.Z (Single letter + non-alphabet sign after not counting "The ")
The B 2 B (Single letter + space after not counting "The ")
The Boyzz (First letter of single word not counting "The ")
The Blue Boy (First letter of several words not counting "The ")
Or close to it.
There are a lot of moving parts in what you're trying to do, but I can at least answer the title portion of it. Sphinx offers field-level ranking factors to let you customize the WEIGHT() function – it should be much easier to order the matches the way you want, rather than trying to actually filter out entries that matched the query later than the 1st or 2nd word.
Here's an example, which will return all results with a word starting with "b", sorted by how early that word appears:
SELECT id, artistname, WEIGHT()
FROM myindex
WHERE MATCH('(#artistname (b*))')
ORDER BY WEIGHT() DESC
LIMIT 10
OPTION ranker=expr('sum(100 - min_hit_pos)');
If you want to filter out other cases like "Several other words then B", I think I'd suggest doing that in your application. For example, if the fourth result has the keyword in the 3rd word, only return the first 3 results. That, or actually create a new field in Sphinx without the leading "The", and then add a numeric attribute to the index to show that a word was removed (you can use numeric attributes in your ranker expressions).
As for ranking "B-t" more highly than "Bat", I'm not sure if that's possible without somehow changing Sphinx's concept of alphabetical order.. You could try diving into the source code? ;)
One last note. For this particular kind of query, MySQL (I say MySQL because it's the common way of sourcing a Sphinx index) may actually work just as well. If you strip the leading "The", a B-tree index (which MySQL uses) is a perfectly good way of searching if you're sure you only want results where the query matches the beginning of the field. Sphinx's inverted indexes are kind of overkill for that sort of thing.

MySQL - FULLTEXT in BOOLEAN mode + Relevance using views field

I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks
Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).

mysql fulltext MATCH,AGAINST returning 0 results

I am trying to follow: http://dev.mysql.com/doc/refman/4.1/en/fulltext-natural-language.html
in an attempt to improve search queries, both in speed and the ability to order by score.
However when using this SQL ("skitt" is used as a search term just so I can try match Skittles).
SELECT
id,name,description,price,image,
MATCH (name,description)
AGAINST ('skitt')
AS score
FROM
products
WHERE
MATCH (name,description)
AGAINST ('skitt')
it returns 0 results. I am trying to find out why, I think I might have set my index's up wrong I'm not sure, this is the first time I've strayed away from LIKE!
Here is my table structure and data:
Thank you!
By default certain words are excluded from the search. These are called stopwords. "a" is an example of a stopword. You could test your query by using a word that is not a stopword, or you can disable stopwords:
How can I write full search index query which will not consider any stopwords?
If you want to also match prefixes use the truncation operator in boolean mode:
*
The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it should be appended to the word to be affected. Words match if they begin with the word preceding the * operator.