How to improve search speeds in this situation? - mysql

I have a search implemented on my site, it runs the following queries:
SELECT COUNT(mov_id) AS total_things
FROM content
WHERE con_status = 1 AND con_incomplete = 0 AND con_type = 1
AND ((con_title) LIKE ('%search keyword%')
OR soundex(con_title) LIKE soundex('search keyword')
OR MATCH (con_title) AGAINST ('search keyword'));
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
| 1 | SIMPLE | movies | ref | con_type | con_type | 12 | const,const,const | 11804 | Using where |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
64058 Queries
Total time: 200817, Average time: 3.13492459958163
Taking 2 to 25 seconds to complete
Rows analyzed 1882 - 12104
SELECT
con_id,
con_title,
con_desc,
MATCH (con_title) AGAINST ('search keyword') AS relevancy
FROM content
WHERE con_status = 1 AND con_incomplete = 0 AND con_type = 1
AND ((con_title) LIKE ('%search keyword%')
OR soundex(con_title) LIKE soundex('search keyword')
OR MATCH (con_title) AGAINST ('search keyword'))
ORDER BY relevancy DESC
LIMIT 0, 24;
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
| 1 | SIMPLE | movies | ref | con_type | con_type | 12 | const,const,const | 11803 | Using where; Using filesort |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
78321 Queries
Total time: 200657, Average time: 2.56198209930925
Taking 2 to 16 seconds to complete
Rows analyzed 0 - 15752
This basically works like a ghetto "fuzzy search" to ignore typos people might make.
Unfortunately, its very slow (even if I remove soundex() or FULLTEXT searching. How to improve search speeds in this situation?

The part of the WHERE clause that hurts is the first % after LIKE. To speed it up, you could normalize the keywords, moving them to a separate table:
table moviekeywords: movieid, keyword
table movies: movieid, ...
This allows you to search through the moviekeywords table using an = condition, or at least like 'humphrey%'. Both variants can be made expremely fast with an index.

As long as you keep using soundex and LIKE(%nnn%) you will be running a full scan of all of an intermediate result. To illustrate this: If you omitted your other predicates (on con_status, con_incomplete AND con_type columns) you would always be running a full table scan.
I suggest dropping or scaling back your fuzzy predicates. For example, just running LIKE('nnn%') will be MUCH faster than %nnn% (if that column is indexed) but of course your search results will not be as fuzzy. Perhaps make soundex an advanced search option that does not always run.
If you can't compromise on any of those issues then at least make sure that your con_status, con_incomplete AND con_type columns are all indexed.

Think about Andomar's solution again - most keyword searches allow you to specify multiple keywords. You can't do that with your current query. And there's no problem with "The Terminator" - for that, you'd just add one keyword, "Terminator".
And with an index on the keyword column, it will be fast.

I made my "fuzzy search" a fallback option if COUNT on the original stricter query returns no results. My results have been pretty fast so far using
SOUNDS LIKE ('blah')

So it looks like you only have around 15,000 rows. If you don't expect your table to grow past a hundred thousand entries or so, maybe you should just keep all the titles in memory and avoid hitting the database until you know which entries you want.
That is, at startup and at periodic intervals, just query all the titles out of the database, split each one into words, and keep a mapping of words to row keys. This should take less than 1MB of memory, accessing it should be quite fast, and most importantly you can add whatever fuzzy matching or heuristic scoring mechanisms you like (without modifying your schema).
Just a thought.

Related

Why does the query execute so much slower when all the columns involved are the same and only the where condition changes?

I have this query:
SELECT 1 AS InputIndex,
IF(TRIM(DeviceInput1Name = '', 0, IF(INSTR(DeviceInput1Name, '|') > 0, 2, 1)) AS InputType,
(SELECT Value1_1 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueLeft,
(SELECT Value1_2 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueRight
FROM devices
WHERE DeviceIMEI = 'Some_Search_Value';
This completes fairly quickly (in up to 0.01 seconds). However, running the same query with WHERE clause as such
WHERE DeviceIMEI = 'Some_Other_Search_Value';
makes it run for upwards of 14 seconds! Some search values finish very quickly, while others run way too long.
If I run EXPLAIN on either query, I get the following:
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| 1 | PRIMARY | devices | ref | DeviceIMEI | DeviceIMEI | 28 | const | 1 | Using where |
| 3 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
Also, here's the actual number of records, just so it's clear:
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Search_Value';
+----------+
| count(*) |
+----------+
| 1017946 |
+----------+
1 row in set (0.17 sec)
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Other_Search_Value';
+----------+
| count(*) |
+----------+
| 306100 |
+----------+
1 row in set (0.04 sec)
Any ideas why changing a search value in the WHERE clause would cause the query to execute so slowly, even when the number of physical records to search through is lower?
Note there is no need for you to rewrite the query, just explain why the above happens.
UPDATE: I have tried running two separate queries instead of one with dependent subqueries to get the information I need (first I select DeviceID from devices by DeviceIMEI, then select from devicevalues by DeviceID I got from the previous query) and all queries return instantly. I suppose the only solution is to run these queries in a transaction, so I'll be making a stored procedure to do this. This, however, still doesn't answer my question which puzzles me.
I dont think that 1017946 is equivalent to the number of rows returned by your very first query.Your first query returns all rows from devices with some correlated queries,your count query returns all common rows between the 2 tables.If this is so the problem might be cardinality issues namely some_other_values constitute a much larger proportion of the rows in your first query than some_value so Mysql chooses a table scan.
If I understand correctly, the query is the same, and only the searched value changes.
There are three real possibilities as I can see, the first much likelier than the others:
The fast query only appears to be fast. And that's why it is in the MySQL query cache already. Try disabling the cache, running with NO_SQL_CACHE, or run the slow query twice. If the second way round runs in 0.01s instead of 14s, you'll know this is the case.
One query has to look way more records than the other. An IMEI may have lots of rows in devicevalues, another might have next no none. Apparently you are in such a condition, and what makes this unlikely is (apart from the times involved) the fact that it is the slower IMEI which actually has less matches.
The slow query is indeed slow. This means that a particular subset of data is hard to locate or hard to retrieve. The first may be due to an overdue reindexing or to filesystem fragmentation of very large indexes. The second can also be due to fragmentation of the tablespace, or to other condition which splits up records (for example the database is partitioned). A search in a small partition is wont to be faster than a search in a large partition.
But the time differences involved aren't equal in the three cases, and a 1400x difference seems to me an unlikely consequence of (2) or (3). The first possibility seems way more appealing.
Update you seem not to be using indexes on your subqueries. Have you an index such as
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime);
If you can, you can try a covering index:
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime, Value1_1, Value1_2);

MySql not picking correct index for few queries

I'm running follwing query on the table, I'm changing values in the where condition, while running in one case it's taking one index and another case taking it's another(wrong??) index.
row count for query 1 is 402954 it's taking approx 1.5 sec
row count for query 2 is 52097 it's taking approx 35 sec
Both queries query 1 and query 2 are same , only I'm changing values in the where condition
query 1
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM campaign_logs
WHERE
domain = 'xxx' AND
campaign_id='123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2015-02-12 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
EXPLAIN of above query
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| 1 | SIMPLE | campaign_logs | range | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaignid_domain_logtype_logtime_index | 468 | NULL | 402954 | Using where |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
query 2
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM stats.campaign_logs
WHERE
domain = 'yyy' AND
campaign_id='345' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2014-02-05 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
explain of above query
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| 1 | SIMPLE | campaign_logs | index_merge | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaign_id_index,domain_index | 153,153 | NULL | 52097 | Using intersect(campaign_id_index,domain_index); Using where; Using filesort |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
Query 1 is using correct index because I have composite index
Query 2 is using index merge , it's taking long time to execute
Why MySql using different indexes for same query
I know we can mention USE INDEX in the query , but why MySql is not picking correct index in this case??. am I doing anything wrong??
No, you're not doing anything wrong.
As Chipmonkey stated in comments, sometimes MySQL will choose the wrong execution plan because of outdated table statistics. You can update the table statistics by performing ANALYZE TABLE.
Still, MySQL optimizer isn't that sophisticated. It sees that in both cases, MySQL will have to visit both the secondary index and then perform a lookup to the clustered index to get the actual table data, so when it saw that perhaps the second query had better selectivity by using the two separate indexes and merging them, you can't blame it too much just because it guessed wrong.
I'm guessing that if you had a covering index so that MySQL could perform the entire query with just the index, it will favor that index over performing a merge.
Try adding subscriber_id to the end of your multi-column index to get a covering index.
Otherwise, use USE INDEX or FORCE INDEX, because that's what they're there for. You know more about the data than MySQL does.
I suggest you try this:
Add this permutation of your compound index.
(campaign_id,domain,log_time,log_type,subscriber_id)
Change your query to remove the WHERE log_type IN() criterion, thus allowing the aggregate function to use all the records it finds in the range scan on log_time. Including subscriber_id in the index should allow the whole query to be satisfied directly from the index. That is, this is a covering index.
Finally, you can filter on your log_type values by wrapping the whole query in
SELECT *
FROM (/*the whole query*/) x
WHERE log_type IN
('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED')
ORDER BY log_type
This should give you better, and more predictable, performance.
(Unless the log_types you want are a tiny subset of the records, in which case please ignore this suggestion.)

Can't optimise mySQL query

I am running a query to retrieve some game levels from a MySQL database. The query itself takes around 0.00025 seconds to execute on a base that contains 40 level strings. I thought it was satisfactory, until I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
I tried optimising by using explain and explain extended and adjusting the columns accordingly(adding indexes), but am always getting the same performance. What I noticed also is that MySQL didn't use indexes where they were available but instead did a full-table scan.
Results from EXPLAIN EXTENDED:
table id select_type type possible_keys key key_len ref rows Extra
users 1 SIMPLE ALL PRIMARY,id NULL NULL NULL 7 Using temporary; Using filesort
AllTime 1 SIMPLE ref PRIMARY,userid PRIMARY 4 Test.users.id 1
query:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname, AllTime.levelstr
FROM AllTime
INNER JOIN users
ON AllTime.userid=users.id
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
The tables:
users
| id(int) | nickname(varchar) |
| (Primary, Auto_increment) | |
|---------------------------|-------------------|
| 1 | username1 |
| 2 | username2 |
| 3 | username3 |
| ... | ... |
and AllTime
| id(int) | userid(int) | levelname(varchar) | levelstr(text) |
| (Primary, Auto_increment) | (index) | | |
|---------------------------|-------------|--------------------|----------------|
| 1 | 2 | levelname1 | levelstr1 |
| 2 | 2 | levelname2 | levelstr2 |
| 3 | 3 | levelname3 | levelstr3 |
| 4 | 1 | levelname4 | levelstr4 |
| 5 | 1 | levelname5 | levelstr5 |
| 6 | 1 | levelname6 | levelstr6 |
| 7 | 2 | levelname7 | levelstr7 |
Is there a way to optimize this query or would I be better off by calling two consecutive queries from php just to avoid the warning?
I am just learning MySQL, so please take that information into account when replying, thank you :)
I'm assuming you're using InnoDB.
For an INNER JOIN, MySQL typically starts with the table with the fewest rows, in this case users. However, since you just want the latest 20 AllTime records joined with the corresponding user records, you actually should start with AllTime since with the LIMIT, it will be the smaller data set.
Use STRAIGHT_JOIN to force the join order:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname,
AllTime.levelstr
FROM AllTime
STRAIGHT_JOIN users
ON users.id = AllTime.userid
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
It should be able to use the primary key on the AllTime table and follow it in descending order. It'll grab all the data on the same pages as it goes.
It should also use the primary key on the users table to grab the id and nickname. If there are more than just two columns, you might add a multi-column covering index on (id, nickname) to improve the speed.
If you can, convert the levelstr column to VARCHAR so that the data is stored on the same page as the rest of the data, otherwise, it has to go fetch the text columns separately. This assumes that your columns are under the 8000 byte row limit for InnoDB. There is no way to avoid the USING TEMPORARY unless you get rid of the text column.
Most likely, your host has identified this query by using the slow query log, which can identify all queries that don't use an index, or they may have red flagged it because of the Using temporary.
it doesn't look like the query has a problem.
Review the application code. Most likely the issue is in the code
Check MySQL query execution plan
possibly you are missing an index
Make sure you cache the data in Application and Database (fyi, sometimes you can load the whole database into Application memory)
Make sure you use a connection pool
Create a view (a very small chance for improvement)
Try to remove the "Order By" clause (again a very small chance it will improve the performance)
The query itself takes around 0.00025 seconds ... I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
Ask the website host for more details about why this query has been flagged for attention. A query that trivial is not going to cause strain on anything unless it is being called very frequently.
Find out how many times that query is being run. I will bet you a nickel that your site is getting hammered by a bot and being executed hundreds or thousands of times per minute. If so, then that's your real problem.
LIMIT ($value_from_php),20; -- if $value_form_php is huge, then the query is slow. This is because all the 'old' pages need to be scanned before getting to the 20 you need.
By "remembering where you left off" you can make every page equally fast. See this for further details: http://mysql.rjweb.org/doc.php/pagination

MySQL query with 2 joins, large keylen leads to 'Copying to tmp table on disk' process hanging forever

I'm sure I must be doing something stupid, but as is often the case I can't figure out what it is.
I'm trying to run this query:
SELECT `f`.`FrenchWord`, `f`.`Pronunciation`, `e`.`EnglishWord`
FROM (`FrenchWords` f)
INNER JOIN `FrenchEnglishMappings` m ON `m`.`FrenchForeignKey`=`f`.`id`
INNER JOIN `EnglishWords` e ON `e`.`id`=`m`.`EnglishForeignKey`
WHERE `f`.`Pronunciation` = '[whatever]';
When I run it, what happens seems quite weird to me. I get the results of the query fine, 2 rows in about 0.002 seconds.
However, I also get a huge spike in CPU and SHOW PROCESSLIST shows two identical processes for that query with state 'Copying to tmp table on disk'. These seem to keep running endlessly until I kill them or the system freezes.
None of the tables involved is big - between 100k and 600k rows each. tmp_table_size and max_heap_table_size are both 16777216.
Edit: EXPLAIN on the statement gives:
+edit reduced keylen of Pronunciation to 112
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| 1 | SIMPLE | f | ref | PRIMARY,Pronunciation | Pronunciation | 112 | const | 2 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | m | ref | tmpindex,CombinedIndex,FrenchForeignKey,EnglishForeignKey | tmpindex | 4 | dict.f.id | 1 | Using index |
| 1 | SIMPLE | e | eq_ref | PRIMARY,id | PRIMARY | 4 | dict.m.EnglishForeignKey | 1 | |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
I'd be grateful if someone could point out what might be causing this. What I really don't understand is what MySQL is doing - surely if the query is complete then it doesn't need to do anything else?
UPDATE
Thanks for all the responses. I learnt something from all of them. This query was made massively faster after following the advice of nrathaus. I added a PronunciationHash binary(16) column to FrenchWords that contains unhex( md5 ( Pronunciation ) ). That is indexed with a keylen of 16 (vs 600+ for the varchar index on Pronunciation), and queries are much faster now.
As said by the EXPLAIN, you key size is HUGE : 602, this requires MySQL to write down the data.
You need to reduce (greatly) the keylen, I believe recommended is below 128.
I suggest you create a column called MD5_FrenchWord which will contain the MD5 value of FrenchWord. Then use this column for the GROUP BY. This assumes that you are looking for similarities, when you group by rather than the actual value
You are misusing GROUP BY. This clause is entirely pointless unless you also have a summary function such as MAX(something) or COUNT(*) in your SELECT clause.
Try removing GROUP BY and see if it helps.
It's not clear what you're trying to do with GROUP BY. But you might try SELECT DISTINCT if you're trying to dedup your result set.
Looking further at this question, it seems like you might benefit from a couple of compound indexes.
First, can you make sure your table declarations have NOT NULL in as many columns as possible?
Second, you're retrieving Pronunciation, FrenchWord, and id from your Frenchwords table, so try this compound index on that table. Your query will then be able to get what it needs directly from the index, saving a bunch of disk io. Notice that Pronunciation is mentioned first in the compound index declaration because that's the value you're searching for. This allows MySQL to do a lookup on the index, and get the other information it needs directly from the index, without thrashing back to the table itself.
(Pronunciation, FrenchWord, id)
You're retrieving Englishword from Englishwords looking it up by id. So, the same reasoning can apply to this compound index.
(id, Englishword)
Finally, I can't tell what your ORDER BY is for, once you use SELECT DISTINCT. You might try getting rid of it. But it probably makes no difference.
Give this a try. If your MySQL server is still thrashing after you make these changes, you have some kind of configuration problem.

How can I speed up this SQL query on MySQL 4.1?

I have a SQL query that takes a very long time to run on MySQL (it takes several minutes). The query is run against a table that has over 100 million rows, so I'm not surprised it's slow. In theory, though, it should be possible to speed it up as I really only want to get back the rows from the large table (let's call it A) that have a reference in another table, B.
So my query is:
SELECT id FROM A, B where A.ref = B.ref;
(A has over 100 million rows; B has just a few thousand).
I've added INDEXes:
alter table A add index(ref);
alter table B add index(ref);
But it's still very slow (several minutes -- I'd be happy with one minute).
Unfortunately, I'm stuck with MySQL 4.1.22, so I can't use views.
I'd rather not copy all of the relevant rows from A into a separate, smaller table, as the rows that I need will change from time to time. On the other hand, at the moment that's the only solution I can think of.
Any suggestions welcome!
EDIT: Here's the output of running EXPLAIN on my query:
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| 1 | SIMPLE | B | ALL | B_ref,ref | NULL | NULL | NULL | 16718 | Using where |
| 1 | SIMPLE | A | ref | A_REF,ref | A_ref | 4 | DATABASE.B.ref | 5655 | |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
(In redacting my original query example, I chose to use "ref" as my column name, which happens to be the same as one of the types, but hopefully that's not too confusing...)
The query optimizer is probably already doing the best that it can, but in the unlikely event that it's reading the giant table (A) first, you can explicitly tell it to read B first using the STRAIGHT_JOIN syntax:
SELECT STRAIGHT_JOIN id FROM B, A where B.ref = A.ref;
From the answers, it seems like you're doing the most efficient thing you can with the SQL. The A table seems to be the big problem, how about splitting it into three individual tables, kind of like a local version of sharding? Alternatively, is it worth denormalising the B table into the A table, assuming B doesn't have too many columns?
Finally, you could just have to buy a faster box to run it on - there's no substitute for horsepower!
Good luck.
SELECT id FROM A JOIN B ON A.ref = B.ref
You may be able to optimize further by using an appropriate type of join e.g. LEFT JOIN
http://en.wikipedia.org/wiki/Join_(SQL)