MySQL best match for data including keywords

MySQL best match for data including keywords - mysql

There are two name columns in my table. (name1, name2)
I want to receive keywords as input and output them in the most similar order among the data including the keywords.
If the user inputs ed, we want the output to be in the order of 'ed', 'Ed Sheeran' and 'Ahmedzidan'.
(The order of 'Ed Sheeran' and 'Ahmed Zidan' may vary depending on the matching method.)
We want the word 'ed' to be the most similar and immediately followed by the word 'ed'.
I don't know how to do exact matching.
The above 'ed' is searched even if it is included in either name1 or name2.
There is no priority between the two.
The method I am using now:
select
((LENGTH(name1) - LENGTH(( 'ed')))) + ((LENGTH(name) - LENGTH(( 'ed')))
) as score
from user
where name like '%ed%' or name2 like '%ed%'
order by score asc
Another way:
select
(CASE WHEN name1 = 'ed' or name2 = 'ed' THEN 4
WHEN name1 like 'ed%' or name2 like 'ed%' THEN 3
WHEN name1 like '%ed' or name2 like '%ed' THEN 2
WHEN name1 like '%ed%' or name2 like '%ed%' THEN 1
END
)
as score
from user
where name like '%ed%' or name2 like '%ed%'
order by score desc
However, both results are different from what I thought, and I don't know which one is faster.
I tried using a full-text index, but it seems to require too much sacrifice(?) to search for one alphabet.
And it was too slow when when the user was typing keywords in long words.
Example: keyword : ed -> 0.2s , keyword : ed Sheeran -> 5s.
What is the best way?
If the above two methods are the best, which one could be faster?

Let me discuss the performance impact of each part of the query:
The WHERE has OR and LIKE with a leading wildcard. Each of those forces the query to do a full scan, checking every row.
I don't need to discuss further; all other aspects (including the lengthy CASE) are less important in judging speed. Things like POSITION vs the alternative might shave off 1%.
If the table is huge (and cannot be cached in RAM), then this would help some: INDEX(name1, name2) The trick here is to change a table scan into an index scan.
All work is done in the "buffer_pool" in RAM. When a table is bigger than RAM, and the query needs to look at all the rows, the processing must bump things out of the buffer_pool to load data from disk. I/O is likely to be the biggest factor in performance.
The table's BTree contains all the columns for all the rows. The INDEX mentioned contains one row for each name1, name2 and whatever column(s) comprise the PRIMARY KEY. That is, the index is likely to be smaller than the table. Hence the index might sit in RAM, whereas the data would have to be page in. (It's about I/O.)

I think you can use POSITION function and order by it. There is no need such using select CASE becouse there is no such a logic oparation in your query and duplicate LIKE function, that to much wasteting time. With using POSTION function you can get the result like you want, if you just want to order "Ed Sharon" first than followed By other "ed" like "Mr. Bambang Ed".
SELECT name, POSITION('a' IN name) pos FROM user WHERE name LIKE '%a%' ORDER BY pos ASC

Related

How to make mysql query fast while searchs with like

I have three table and I have to search them with a like match. The query runs over 10,000 records. It works fine but take 4 seconds to give results. What can I do to improve the speed and take it down to 1 second?
profile_category_table
----------------------
restaurant
sea food restaurant
profile_keywords_table
----------------------
rest
restroom
r.s.t
company_profile_table
---------------------
maha restaurants
indian restaurants
Query:
SELECT name
FROM (
(SELECT PC_name AS name
FROM profile_category_table
WHERE PC_status=1
AND PC_parentid!=0
AND (regex_replace('[^a-zA-Z0-9\-]','',remove_specialCharacter(PC_name)) LIKE '%rest%')
GROUP BY PC_name)
UNION
(SELECT PROFKEY_name AS name
FROM profile_keywords_table
WHERE PROFKEY_status=1
AND (regex_replace('[^a-zA-Z0-9\-]','',remove_specialCharacter(PROFKEY_name)) LIKE '%rest%')
GROUP BY PROFKEY_name)
UNION
(SELECT COM_name AS name
FROM company_profile_table
WHERE COM_status=1
AND (regex_replace('[^a-zA-Z0-9\-]','',remove_specialCharacter(COM_name)) LIKE '%rest%')
GROUP BY COM_name))a
ORDER BY IF(name LIKE '%rest%',1,0) DESC LIMIT 0, 2
And I add INDEX FOR THAT columns too.
if a user search with text rest in textbox..the auto suggestions results should be..
results
restaurant
sea food restaurant
maha restaurants
indian restaurants
rest
restroom
r.s.t
i used regex_replace('[^a-zA-Z0-9-]','',remove_specialCharacter(COM_name) to remove special characters from the field value and to math with that keyword..

There are lots of thing you can consider:
The main killer of performance here is probably the regex_replace() ... like '%FOO%'. Given that you are applying function on the columns, indices are not going to take effect, leaving you several full table scans. Not to mention regex replace is going to be heavy weight. For the sake of optimization, you may
Keep a separate column, which stored the "sanitized" data, for which you create indices on, and leaving your query like where pc_name_sanitized like '%FOO%'
I am not sure if it is available in MySql, but in a lot of DMBS, there is a feature called function-based index. You can consider making use of it to index the regex replace function
However even after the above changes, you will find the performance is not very attractive. In most case, using like with wildcard at the front is avoiding indices to be used. If possible, try to do exact match, or have the beginning of string provided, e.g. where pc_name_sanitized like 'FOO%'
As mentioned by other users mentioned, using UNION is also a performance killer. Try to use UNION ALL instead if possible.

I'm going to say don't filter on the query. Do that on whatever language you're programming in. Regex_replace is a heavy operation regardless of the environment and you're doing this several times on a query of 10,000 records with a union of who knows how many more.

Rewrite it completely.
UNION statements are killing performance, and you're doing the LIKE on too many fields.
Moreover you're searching into a temporary table (SELECT field FROM (...subquery...)), so without any indexes, which is really slow (1/1 chance to go through full-table scan for each row).

Since you use union in between all queries, you can remove the group by option in all queries and you select only column having "rest" in it. so remove the function "IF(name LIKE '%rest%',1,0)"in the order by clause.

MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard

I recently implemented the UDFs of the Damerau–Levenshtein algorithms into MySQL, and was wondering if there is a way to combine the fuzzy matching of the Damerau–Levenshtein algorithm with the wildcard searching of the Like function? If I have the following data in a table:
ID | Text
---------------------------------------------
1 | let's find this document
2 | let's find this docment
3 | When the book is closed
4 | The dcument is locked
I want to run a query that would incorporate the Damerau–Levenshtein algorithm...
select text from table where damlev('Document',tablename.text) <= 5;
...with a wildcard match to return IDs 1, 2, and 4 in my query. I'm not sure of the syntax or if this is possible, or whether I would have to approach this differently. The above select statement works fine in issolation, but is not working on individual words. I would have to change the above SQL to...
select text from table where
damlev('let's find this document',tablename.text) <= 5;
...which of course returns just ID 2. I'm hoping there is a way to combine the fuzzy and wildcard together if I want all records returned that have the word "document" or variations of it appearing anyway within the Text field.

In working with person names, and doing fuzzy lookups on them, what worked for me was to create a second table of words. Also create a third table that is an intersect table for the many to many relationship between the table containing the text, and the word table. When a row is added to the text table, you split the text into words and populate the intersect table appropriately, adding new words to the word table when needed. Once this structure is in place, you can do lookups a bit faster, because you only need to perform your damlev function over the table of unique words. A simple join gets you the text containing the matching words.
A query for a single word match would look something like this:
SELECT T.* FROM Words AS W
JOIN Intersect AS I ON I.WordId = W.WordId
JOIN Text AS T ON T.TextId = I.TextId
WHERE damlev('document',W.Word) <= 5
and two words would look like this (off the top of my head, so may not be exactly correct):
SELECT T.* FROM Text AS T
JOIN (SELECT I.TextId, COUNT(I.WordId) AS MatchCount FROM Word AS W
JOIN Intersect AS I ON I.WordId = W.WordId
WHERE damlev('john',W.Word) <= 2
OR damlev('smith',W.Word) <=2
GROUP BY I.TextId) AS Matches ON Matches.TextId = T.TextId
AND Matches.MatchCount = 2
The advantages here, at the cost of some database space, is that you only have to apply the time-expensive damlev function to the unique words, which will probably only number in the 10's of thousands regardless of the size of your table of text. This matters, because the damlev UDF will not use indexes - it will scan the entire table on which it's applied to compute a value for every row. Scanning just the unique words should be much faster. The other advantage is that the damlev is applied at the word level, which seems to be what you are asking for. Another advantage is that you can expand the query to support searching on multiple words, and can rank the results by grouping the matching intersect rows on TextId, and ranking on the count of matches.

Mysql table or query optimisation

I am running the following query and however I change it, it still takes almost 5 seconds to run which is completely unacceptable...
The query:
SELECT cat1, cat2, cat3, PRid, title, genre, artist, author, actors, imageURL,
lowprice, highprice, prodcatID, description
from products
where title like '%' AND imageURL <> '' AND cat1 = 'Clothing and accessories'
order by userrating desc
limit 500
I've tried taking out the "like %", taking out the "imageURl <> ''" but still the same. I've tried returning only 1 colum, still the same.
I have indexes on almost every column in the table, certainly all the columns mentioned in the query.
This is basically for a category listing. If I do a fulltext search for something in the title column which has a fulltext index, it takes less than a second.
Should I add another fulltext index to column cat1 and change the query focus to "match against" on that column?
Am I expecting too much?
The table has just short of 3 million rows.

You said you had an index on every column. Do you have an index such as?
alter table products add index (cat1, userrating)
If you don't, give it a try. Run that query and let me know if it run faster.
Also, I assume you're actually setting some kind of filter instead of the % on the title, field, right?

You should rather have the cat1 as a integer, then a string in these 3 million rows. You must also index correctly. If indexing all columns only improved, then it'd be a default thing the system would do.
Apart from that, title LIKE '%' doesn't do anything. I guess you use it to search so it becomes `title LIKE 'search%'
Do you use any sort of framework to fetch this? Getting 500 rows with a lot of columns can exhaust the system if your framework saves this to a large array. It may probably not be the case, but:
Try running a ordinary $query = mysql_query() and while($row = mysql_fetch_object($query)).

I suggest to add an index with the columns queried: title, imageURL and cat1.
Second improvement: use the SQL server cache, it will deadly improve the speed.
Last improvement: if you query is always like that, only the values change, then use prepared statements.

Well, I am quite sure that a % as the first char in a LIKE clause, gives you a full table scan for that column (in your case you won't have that full table scan executed because you already have restricting clauses in the AND clause).
Beside that try to add an index on cat1 column. Also, try to add other criterias to your query, in order to reduce the size of the dataset - your working data set (the number of rows that matches your query, without the LIMIT clause) might be too big also.

optimize SELECT query, knowing that we are dealing with a limited range

I am trying to include in a MYSQL SELECT query a limitation.
My database is structured in a way, that if a record is found in column one then only 5000 max records with the same name can be found after that one.
Example:
mark
..mark repeated 5000 times
john
anna
..other millions of names
So in this table it would be more efficent to find the first Mark, and continue to search maximum 5000 rows down from that one.
Is it possible to do something like this?

Just make a btree index on the name column:
CREATE INDEX name ON your_table(name) USING BTREE
and mysql will silently do exactly what you want each time it looks for a name.

Try with:
SELECT name
FROM table
ORDER BY (name = 'mark') DESC
LIMIT 5000
Basicly you sort mark 1st then the rest follow up and gets limited.

Its actually quite difficult to understand your desired output .... but i think this might be heading in the right direction ?
(SELECT name
FROM table
WHERE name = 'mark'
LIMIT 5000)
UNION
(SELECT name
FROM table
WHERE name != 'mark'
ORDER BY name)
This will first get upto 5000 records with the first name as mark then get the remainder - you can add a limit to the second query if required ... using UNION
For performance you should ensure that the columns used by ORDER BY and WHERE are indexed accordingly

If you make sure that the column is properly indexed, MySQL will take care off optimisation for you.
Edit:
Thinking about it, I figured that this answer is only useful if I specify how to do that. user nobody beat me to the punch: CREATE INDEX name ON your_table(name) USING BTREE
This is exactly what database indexes are designed to do; this is what they are for. MySQL will use the index itself to optimise the search.

MySQL: SELECT(x) WHERE vs COUNT WHERE?

This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.

If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.

AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.

Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.

Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.

Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL best match for data including keywords - mysql

Related

How to make mysql query fast while searchs with like

MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard

Mysql table or query optimisation

optimize SELECT query, knowing that we are dealing with a limited range

MySQL: SELECT(x) WHERE vs COUNT WHERE?

Categories

Resources