Ideas to improve this query indexes - mysql

I feel like the following query is too slow:
(1679.1ms)
SELECT `media_files` . *
FROM `media_files`
INNER JOIN `playlist_media_files` ON `media_files`.`id` = `playlist_media_files`.`media_file_id`
WHERE `media_files`.`type`
IN (
'AudioFile'
)
AND `playlist_media_files`.`playlist_id` =7
ORDER BY media_files.artist ASC , media_files.release_year ASC , media_files.album ASC , media_files.disc_number ASC , media_files.position ASC
EXPLAIN:
+----+-------------+----------------------+--------+---------------------------------------------------------------------------------------+-------------------------------------------+---------+---------------------------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+--------+---------------------------------------------------------------------------------------+-------------------------------------------+---------+---------------------------------------------------------+------+---------------------------------+
| 1 | SIMPLE | playlist_media_files | ref | index_playlist_media_files_on_playlist_id,index_playlist_media_files_on_media_file_id | index_playlist_media_files_on_playlist_id | 4 | const | 3782 | Using temporary; Using filesort |
| 1 | SIMPLE | media_files | eq_ref | PRIMARY,index_media_files_on_type | PRIMARY | 4 | mydb.playlist_media_files.media_file_id | 1 | Using where |
+----+-------------+----------------------+--------+---------------------------------------------------------------------------------------+-------------------------------------------+---------+---------------------------------------------------------+------+---------------------------------+
Every column is indexed.
Any MySQL expert can tell how can it be improved by looking at the explain?
The multiple ORDER BY is killing the performance.
Edit: removed private URLs from comments
Update: it seems I can do something like concat(..fields..) AS sort for a late ORDER BY sort.

For this query you would probably benefit from having a composite index on (playlist_id,media_file_id) in the playlist_media_files table, which would let mysql to use only this index to know which row to read from media_files without having to read actual data from playlist_media_files to see what is the value of media_file_id for every row that satisfies playlist_id = 7 condition (a lot of them do).
You should see additional using index for the first row of explain output.
Mysql would still have to create temporary table to sort it by so many columns, but sorting 4k rows in memory is not so expensive.
So basically try:
ALTER TABLE `playlist_media_files`
ADD INDEX `playlist_media_composite` ( `playlist_id` , `media_file_id` ) ;
and see the results.
Edit: I tried to simulate the same problem on my test db, creating the same tables and filling them with random 400k rows using php, trying to get similar index cardinality.
Without composite index the same query has following execution plan:
1 SIMPLE playlist_media_files ref playlist_id,media_file_id playlist_id 4 const 3925 Using temporary; Using filesort
1 SIMPLE media_files eq_ref PRIMARY PRIMARY 4 test.playlist_media_files.media_file_id 1 Using where
And the average result is about:
Showing rows 0 - 29 ( 2,702 total, Query took 0.0359 sec)
After adding the composite index and doing ANALYZE TABLE playlist_media_files explain shows:
1 SIMPLE playlist_media_files ref playlist_id,media_file_id,playlist_media_composite playlist_media_composite 4 const 3925 Using index; Using temporary; Using filesort
1 SIMPLE media_files eq_ref PRIMARY PRIMARY 4 test.playlist_media_files.media_file_id 1 Using where
And the average result:
Showing rows 0 - 29 ( 2,702 total, Query took 0.0176 sec)
However in both cases sorting was done in memory (and still creating tmp table and sorting takes 80% of the time here) and looking at your profiling screenshot most of the time is lost on copying temporary table to disc. Thats where the difference comes from. My tables have only columns required for this query, and probably my random strings weren't as long as yours, while you have a lot more columns and you are selecting all of them, sorting only on few. So your temporary table doesn't fit in the memory and obviously doing things on disc has to be a lot slower.
So your main focus here should be either on increasing buffer sizes to accomodate your big select or limiting number of columns to select that maybe you don't need that much.

What's the meaning of the orderbys? Using it in more than one variable just doesn't make sense in this case.
Why not just order by one thing?
You might be having problems with database design normalization, are you familiar with that?

Related

Optimize query?

My query took 28.39 seconds to run. How can I optimize it?
explain SELECT distinct UNIX_TIMESTAMP(timestamp)*1000 as timestamp,count(a.sig_name) as counter from event a,network n where n.fsi='pays' and n.net=inet_ntoa(a.ip_src) group by date(timestamp) order by timestamp asc;
+----+-------------+-------+--------+---------------+---------+---------+--- ---+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 8177074 | Using temporary; Using filesort |
| 1 | SIMPLE | n | eq_ref | PRIMARY,fsi | PRIMARY | 77 | func | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+
So generally looking at your query, we find that table event a is examining 8,177,074 rows. That is likely the "root" of the slowness, so we want to look at how to reduce the search space using indexes.
The main condition on event a is
n.net=inet_ntoa(a.ip_src)
The problem here is that we need to perform a calculation (inet_ntoa) on every row of a.ip_src, so there is no alternative but to scan the entire table. A potentially better solution would be to invert the comparison and ensure that a.ip_src is indexed.
a.ip_src=inet_aton(n.net)
This will only be better if we are matching less rows in n than we are in a. If that is not the case, you should seriously consider caching the result of this function in the table and creating an index on that.
Lastly I am guessing the timestamp column is in event a, in which case an index will potentially help with ordering and grouping though may not. You could try a multi_column index on (ip_src,timestamp)
Make it a practice to introduce at-least index on columns which can be used in WHERE/JOIN clauses. I've used the at-least because in many cases one should try to use PRIMARY/FOREIGN KEY relations. So if something is already a primary/foriegn key there is no need to index it further.
The above query can be simply improved by introducing the INDEX through the following query:
ALTER TABLE events ADD INDEX idx_ev_ipsrc (ip_src);
Here idx_ev_ipsrc = Name of the index key, and ip_src is the column to be indexed.
Even further enhancement:
Introduce multi-colum index on network table using following query:
ALTER TABLE network ADD INDEX idx_net_fsi_net (fsi,net);
The above will result in even low number of rows.
Note: The above queries are for MySql and can be tailored for other DBs easily.

MySql not picking correct index for few queries

I'm running follwing query on the table, I'm changing values in the where condition, while running in one case it's taking one index and another case taking it's another(wrong??) index.
row count for query 1 is 402954 it's taking approx 1.5 sec
row count for query 2 is 52097 it's taking approx 35 sec
Both queries query 1 and query 2 are same , only I'm changing values in the where condition
query 1
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM campaign_logs
WHERE
domain = 'xxx' AND
campaign_id='123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2015-02-12 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
EXPLAIN of above query
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| 1 | SIMPLE | campaign_logs | range | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaignid_domain_logtype_logtime_index | 468 | NULL | 402954 | Using where |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
query 2
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM stats.campaign_logs
WHERE
domain = 'yyy' AND
campaign_id='345' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2014-02-05 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
explain of above query
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| 1 | SIMPLE | campaign_logs | index_merge | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaign_id_index,domain_index | 153,153 | NULL | 52097 | Using intersect(campaign_id_index,domain_index); Using where; Using filesort |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
Query 1 is using correct index because I have composite index
Query 2 is using index merge , it's taking long time to execute
Why MySql using different indexes for same query
I know we can mention USE INDEX in the query , but why MySql is not picking correct index in this case??. am I doing anything wrong??
No, you're not doing anything wrong.
As Chipmonkey stated in comments, sometimes MySQL will choose the wrong execution plan because of outdated table statistics. You can update the table statistics by performing ANALYZE TABLE.
Still, MySQL optimizer isn't that sophisticated. It sees that in both cases, MySQL will have to visit both the secondary index and then perform a lookup to the clustered index to get the actual table data, so when it saw that perhaps the second query had better selectivity by using the two separate indexes and merging them, you can't blame it too much just because it guessed wrong.
I'm guessing that if you had a covering index so that MySQL could perform the entire query with just the index, it will favor that index over performing a merge.
Try adding subscriber_id to the end of your multi-column index to get a covering index.
Otherwise, use USE INDEX or FORCE INDEX, because that's what they're there for. You know more about the data than MySQL does.
I suggest you try this:
Add this permutation of your compound index.
(campaign_id,domain,log_time,log_type,subscriber_id)
Change your query to remove the WHERE log_type IN() criterion, thus allowing the aggregate function to use all the records it finds in the range scan on log_time. Including subscriber_id in the index should allow the whole query to be satisfied directly from the index. That is, this is a covering index.
Finally, you can filter on your log_type values by wrapping the whole query in
SELECT *
FROM (/*the whole query*/) x
WHERE log_type IN
('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED')
ORDER BY log_type
This should give you better, and more predictable, performance.
(Unless the log_types you want are a tiny subset of the records, in which case please ignore this suggestion.)

How do I remove temporary and filesort from my SQL query?

I have been trying to create an index in MySQL, but keep getting temporary and filesort whenever I run an explain on my query.
A simplified version of my tables looks like:
ordered_products
op_id INT UNSIGNED NOT NULL AUTO_INCREMENT
op_orderid INT UNSIGNED NOT NULL
op_orderdate TIMESTAMP NOT NULL
op_productid INT UNSIGNED NOT NULL
products
p_id INT UNSIGNED NOT NULL AUTO_INCREMENT
p_productname VARCHAR(128) NOT NULL
p_enabled TINYINT NOT NULL
The 'ordered_products' table currently has more than 1,000,000 rows and is a record of all products that have been ordered, as well as the orders that they belong to. This table grows rapidly.
The 'products' table currently has around 3,000 rows and contains a list of products that are for sale.
The site displays a list of the top products for a given period (normally the last 3 days) and my query looks like:
SELECT COUNT(op.op_productid) AS ProductCount, op.op_productid
FROM ordered_products op
LEFT JOIN products p ON op.op_productid=p.p_id
WHERE op.op_orderdate>='2014-03-08 00:00:00'
AND p.p_enabled=1
GROUP BY op.op_productid
ORDER BY ProductCount DESC, p.p_productname ASC
When I run that query, it normally takes around 800 milliseconds (0.8 seconds) to execute, which is ridiculous. We've remedied this with caching, however whenever the cache expires, we have a slowdown. I need to fix this.
I have tried to index the tables, but no matter what I try, I can't avoid temporary and filesort. The output from EXPLAIN is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE p index PRIMARY,idx_enabled_id_name idx_enabled_id_name 782 \N 1477 Using where; Using index; Using temporary; Using filesort
1 SIMPLE op ref idx_pid_oid_date idx_pid_oid_date 4 test_store.p.p_id 9 Using where; Using index
If I remove the GROUP BY, the filesort disappears, however I need it to ensure the ProductCount value shows me every product count rather than a total sum of all products.
If I remove the GROUP BY and the ORDER BY ProductCount, both temporary and filesort disappear, but now I am left with a very bad result set.
Can anyone please help me solve this? I have tried a multitude of different indexes, and have tried rewriting the SQL numerous times, but can never succeed.
Any help would be greatly appreciated.
You can't get rid of the temp table and filesort while you are using ORDER BY on a calculated column ProductCount. There's no index for the calculated column, so it has to do do the sorting at the time of the query.
I tried experimentally to reproduce your results. I can put an index on op_productid and then the optimizer might use it to perform the GROUP BY.
mysql> EXPLAIN SELECT COUNT(op.op_productid) AS ProductCount, op.op_productid
FROM ordered_products op FORCE INDEX (op_productid) STRAIGHT_JOIN products p
ON op.op_productid=p.p_id
WHERE op.op_orderdate>='2014-03-08 00:00:00' AND p.p_enabled=1
GROUP BY op.op_productid ORDER BY null;
In my case, I had to use STRAIGHT_JOIN and FORCE INDEX to override the optimizer. But that might be due to my test environment, where I have only 1 or 2 rows per table for testing, and it throws off the optimizer's choices. In your real data, it might make a more sensible choice.
Also, don't use LEFT JOIN if you have conditions in the WHERE clause that make the join implicitly an inner join. Learn the types of joins and how they work -- don't always use LEFT JOIN by default.
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| 1 | SIMPLE | op | index | op_productid | op_productid | 4 | NULL | 5 | Using where |
| 1 | SIMPLE | p | ALL | PRIMARY | NULL | NULL | NULL | 1 | Using where |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
Your only alternative is to store a denormalized table, where the counts are persisted. Then if your cache fails, it isn't an expensive query to refresh the cache.

MySQL query with 2 joins, large keylen leads to 'Copying to tmp table on disk' process hanging forever

I'm sure I must be doing something stupid, but as is often the case I can't figure out what it is.
I'm trying to run this query:
SELECT `f`.`FrenchWord`, `f`.`Pronunciation`, `e`.`EnglishWord`
FROM (`FrenchWords` f)
INNER JOIN `FrenchEnglishMappings` m ON `m`.`FrenchForeignKey`=`f`.`id`
INNER JOIN `EnglishWords` e ON `e`.`id`=`m`.`EnglishForeignKey`
WHERE `f`.`Pronunciation` = '[whatever]';
When I run it, what happens seems quite weird to me. I get the results of the query fine, 2 rows in about 0.002 seconds.
However, I also get a huge spike in CPU and SHOW PROCESSLIST shows two identical processes for that query with state 'Copying to tmp table on disk'. These seem to keep running endlessly until I kill them or the system freezes.
None of the tables involved is big - between 100k and 600k rows each. tmp_table_size and max_heap_table_size are both 16777216.
Edit: EXPLAIN on the statement gives:
+edit reduced keylen of Pronunciation to 112
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| 1 | SIMPLE | f | ref | PRIMARY,Pronunciation | Pronunciation | 112 | const | 2 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | m | ref | tmpindex,CombinedIndex,FrenchForeignKey,EnglishForeignKey | tmpindex | 4 | dict.f.id | 1 | Using index |
| 1 | SIMPLE | e | eq_ref | PRIMARY,id | PRIMARY | 4 | dict.m.EnglishForeignKey | 1 | |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
I'd be grateful if someone could point out what might be causing this. What I really don't understand is what MySQL is doing - surely if the query is complete then it doesn't need to do anything else?
UPDATE
Thanks for all the responses. I learnt something from all of them. This query was made massively faster after following the advice of nrathaus. I added a PronunciationHash binary(16) column to FrenchWords that contains unhex( md5 ( Pronunciation ) ). That is indexed with a keylen of 16 (vs 600+ for the varchar index on Pronunciation), and queries are much faster now.
As said by the EXPLAIN, you key size is HUGE : 602, this requires MySQL to write down the data.
You need to reduce (greatly) the keylen, I believe recommended is below 128.
I suggest you create a column called MD5_FrenchWord which will contain the MD5 value of FrenchWord. Then use this column for the GROUP BY. This assumes that you are looking for similarities, when you group by rather than the actual value
You are misusing GROUP BY. This clause is entirely pointless unless you also have a summary function such as MAX(something) or COUNT(*) in your SELECT clause.
Try removing GROUP BY and see if it helps.
It's not clear what you're trying to do with GROUP BY. But you might try SELECT DISTINCT if you're trying to dedup your result set.
Looking further at this question, it seems like you might benefit from a couple of compound indexes.
First, can you make sure your table declarations have NOT NULL in as many columns as possible?
Second, you're retrieving Pronunciation, FrenchWord, and id from your Frenchwords table, so try this compound index on that table. Your query will then be able to get what it needs directly from the index, saving a bunch of disk io. Notice that Pronunciation is mentioned first in the compound index declaration because that's the value you're searching for. This allows MySQL to do a lookup on the index, and get the other information it needs directly from the index, without thrashing back to the table itself.
(Pronunciation, FrenchWord, id)
You're retrieving Englishword from Englishwords looking it up by id. So, the same reasoning can apply to this compound index.
(id, Englishword)
Finally, I can't tell what your ORDER BY is for, once you use SELECT DISTINCT. You might try getting rid of it. But it probably makes no difference.
Give this a try. If your MySQL server is still thrashing after you make these changes, you have some kind of configuration problem.

How to improve search speeds in this situation?

I have a search implemented on my site, it runs the following queries:
SELECT COUNT(mov_id) AS total_things
FROM content
WHERE con_status = 1 AND con_incomplete = 0 AND con_type = 1
AND ((con_title) LIKE ('%search keyword%')
OR soundex(con_title) LIKE soundex('search keyword')
OR MATCH (con_title) AGAINST ('search keyword'));
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
| 1 | SIMPLE | movies | ref | con_type | con_type | 12 | const,const,const | 11804 | Using where |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
64058 Queries
Total time: 200817, Average time: 3.13492459958163
Taking 2 to 25 seconds to complete
Rows analyzed 1882 - 12104
SELECT
con_id,
con_title,
con_desc,
MATCH (con_title) AGAINST ('search keyword') AS relevancy
FROM content
WHERE con_status = 1 AND con_incomplete = 0 AND con_type = 1
AND ((con_title) LIKE ('%search keyword%')
OR soundex(con_title) LIKE soundex('search keyword')
OR MATCH (con_title) AGAINST ('search keyword'))
ORDER BY relevancy DESC
LIMIT 0, 24;
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
| 1 | SIMPLE | movies | ref | con_type | con_type | 12 | const,const,const | 11803 | Using where; Using filesort |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
78321 Queries
Total time: 200657, Average time: 2.56198209930925
Taking 2 to 16 seconds to complete
Rows analyzed 0 - 15752
This basically works like a ghetto "fuzzy search" to ignore typos people might make.
Unfortunately, its very slow (even if I remove soundex() or FULLTEXT searching. How to improve search speeds in this situation?
The part of the WHERE clause that hurts is the first % after LIKE. To speed it up, you could normalize the keywords, moving them to a separate table:
table moviekeywords: movieid, keyword
table movies: movieid, ...
This allows you to search through the moviekeywords table using an = condition, or at least like 'humphrey%'. Both variants can be made expremely fast with an index.
As long as you keep using soundex and LIKE(%nnn%) you will be running a full scan of all of an intermediate result. To illustrate this: If you omitted your other predicates (on con_status, con_incomplete AND con_type columns) you would always be running a full table scan.
I suggest dropping or scaling back your fuzzy predicates. For example, just running LIKE('nnn%') will be MUCH faster than %nnn% (if that column is indexed) but of course your search results will not be as fuzzy. Perhaps make soundex an advanced search option that does not always run.
If you can't compromise on any of those issues then at least make sure that your con_status, con_incomplete AND con_type columns are all indexed.
Think about Andomar's solution again - most keyword searches allow you to specify multiple keywords. You can't do that with your current query. And there's no problem with "The Terminator" - for that, you'd just add one keyword, "Terminator".
And with an index on the keyword column, it will be fast.
I made my "fuzzy search" a fallback option if COUNT on the original stricter query returns no results. My results have been pretty fast so far using
SOUNDS LIKE ('blah')
So it looks like you only have around 15,000 rows. If you don't expect your table to grow past a hundred thousand entries or so, maybe you should just keep all the titles in memory and avoid hitting the database until you know which entries you want.
That is, at startup and at periodic intervals, just query all the titles out of the database, split each one into words, and keep a mapping of words to row keys. This should take less than 1MB of memory, accessing it should be quite fast, and most importantly you can add whatever fuzzy matching or heuristic scoring mechanisms you like (without modifying your schema).
Just a thought.