JOINs being done in weird order; messing up ORDER BY? - mysql

Let's say I have three tables - users, servers and payments. Each user can have multiple servers and each server can have multiple payments. Let's also say I wanted to find the most recent payments and get info about the servers / customers those payments are attached to. Here's a query that could do this:
SELECT *
FROM payments p
JOIN customers c ON p.custID = c.custID
JOIN servers s ON s.serverID = p.serverID
WHERE c.hold = 0
AND c.archive = 0
ORDER BY p.paymentID DESC
LIMIT 10;
The problem is that when I run EXPLAIN on this query I get this:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE c ref PRIMARY,hold_archive hold_archive 3 const,const 28728 Using where; Using index; Using temporary; Using filesort
1 SIMPLE p ref custID custID 5 customers.custID 3 Using where
1 SIMPLE s eq_ref PRIMARY PRIMARY 4 payments.serverID 1 Using index
The problem is that the query takes a while to run. If I remove the ORDER BY it becomes 10x as fast. But I need the ORDER BY. Here's the EXPLAIN when I remove the ORDER BY:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE c ref PRIMARY,hold_archive hold_archive 3 const,const 28728 Using where; Using index
1 SIMPLE p ref custID custID 5 customers.custID 3 Using where
1 SIMPLE s eq_ref PRIMARY PRIMARY 4 payments.serverID 1 Using index
So the big difference here is that "Using temporary" and "Using filesort" are missing from the Extra column.
It seems like the reason, in this case, is that the column I'm doing the ORDER BY on isn't the first column in the EXPLAIN.
Another observation. If I remove one of the WHERE clauses (whilst keeping the ORDER BY) it speeds up similarily, but I need both WHERE's. Here's an example EXPLAIN of that:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE p index custID,serverID PRIMARY 4 NULL 10 Using where
1 SIMPLE c eq_ref PRIMARY,hold_archive PRIMARY 4 payments.custID 1 Using where
1 SIMPLE s eq_ref PRIMARY PRIMARY 4 payments.serverID 1 Using index
Here the ORDER BY column /is/ being done on the first column of the EXPLAIN. But why is MySQL re-arranging the order the tables are JOINed in and how can I make it so it doesn't do that? You can force indexes in MySQL but it doesn't seem like that'd help..
Any ideas?

10x faster -- It can find "any 10 rows" a lot faster than "find all possible rows, sort them, then deliver 10".
Having WHERE and ORDER BY hit different columns is hard to optimize.
What percentage of payments have hold=0 and archive=0? It sounds like a small percentage? How many rows in each table?
Does anything else need INDEX(hold, archive)? If not, get rid of it. It seems to be only causing trouble here.
If hold=0 and archive=0 is common, then you would prefer the execution to go like your 3rd EXPLAIN -- that is scan payments in descending order. With most of them matching the WHERE, it will usually` need to hit not much more than 10 rows before finding 10 matching rows.
Another solution (other than getting rid of the index) is to change JOIN to STRAIGHT_JOIN in the query. This tells the Optimizer that you know better, and payments should be scanned first, customers second. That works well if my previous paragraph applies.
But the query will screw up (by being slow) if, say, you look for archive=1.

Related

Query speed drops on two "=" comparisons in WHERE clause

I have a music database with a table for releases and the release titles. This "releases_view" gets the title/title_id and the alternative title/alternative title_id of a track. This is the code of the view:
SELECT
t1.`title` AS title,
t1.`id` AS title_id,
t2.`title` AS title_alt,
t2.`id` AS title_alt_id
FROM
releases
LEFT JOIN titles t1 ON t1.`id`=`releases`.`title_id`
LEFT JOIN titles t2 ON t2.`id`=`releases`.`title_alt_id`
The title_id and title_alt_id fields in the joined tables are both int(11), title and title_alt are varchars.
The issue
This query will take less than 1 ms:
SELECT * FROM `releases_view` WHERE title_id=12345
This query will take less then 1 ms, too:
SELECT * FROM `releases_view` WHERE title_id=12345 OR title_alt_id!=54321
BUT: This query will take 0,2 s. It's 200 times slower!
SELECT * FROM `releases_view` WHERE title_id=20956 OR title_alt_id=38849
As soon I have two comparisons using "=" in the WHERE clause, things really get slow (although all queries only have a couple of results).
Can you help me to understand what is going on?
EDIT
´EXPLAIN´ shows a USING WHERE for the title_alt_id, but I do not understand why. How can I avoid this?
** EDIT **
Here is the EXPLAIN DUMP.
id select_type table partitions type possible_keys key key_len ref rows Extra
1 SIMPLE releases NULL ALL NULL NULL NULL NULL 76802 Using temporary; Using filesort
1 SIMPLE t1 NULL eq_ref PRIMARY PRIMARY 4 db.releases.title_id 1
1 SIMPLE t2 NULL eq_ref PRIMARY PRIMARY 4 db.releases.title_alt_id 1 Using where
The "really slow" is because the Optimizer does not work well with OR.
Plan A (of the Optimizer): Scan the entire table, evaluating the entire OR.
Plan B: "Index Merge Union" could be used for title_id = 20956 OR title_alt_id = 38849 if you have separate indexes in title_id and title_alt_id: use each index to get two lists of PRIMARY KEYs and "merge" the lists, then reach into the table to get *. Multiple steps, not cheap. So Plan B is rarely used.
title_id = 12345 OR title_alt_id != 54321 is a mystery, since it should return most of the table. Please provide EXPLAIN SELECT....
LEFT JOIN (as opposed to JOIN) needs to assume that the row may be missing in the 'right' table.

Optimization of a Virtuemart Attribute Query

I have a select query below, what it does is it selects all the products matching a certain attribute from a Virtuemart table. The attribute table is rather large (almost 6000 rows). Is there any way to optimize the query below or are there any other process that might be helpful, I already tried adding indexes to one and even two tables.
SELECT DISTINCT `jos_vm_product`.`product_id`,
`jos_vm_product_attribute`.`attribute_name`,
`jos_vm_product_attribute`.`attribute_value`,
`jos_vm_product_attribute`.`product_id`
FROM (`jos_vm_product`)
RIGHT JOIN `jos_vm_product_attribute`
ON `jos_vm_product`.`product_id` = `jos_vm_product_attribute`.`product_id`
WHERE ((`jos_vm_product_attribute`.`attribute_name` = 'Size')
AND ((`jos_vm_product_attribute`.`attribute_value` = '6.5')
OR (`jos_vm_product_attribute`.`attribute_value` = '10')))
GROUP BY `jos_vm_product`.`product_sku`
ORDER BY CONVERT(`jos_vm_product_attribute`.`attribute_value`, SIGNED INTEGER)
LIMIT 0, 24
Here is the results of the EXPLAIN table:
id select_type table type possible_keys key key_len ref rows Extras
1 SIMPLE jos_vm_product_attribute range idx_product_attribute_name,attribute_value,attribute_name attribute_value 765 NULL 333 Using where; Using temporary; Using filesort
1 SIMPLE jos_vm_product eq_ref PRIMARY PRIMARY 4 shoemark_com_shop.jos_vm_product_attribute.product_id
Any help would be greatly appreciated. Thanks.
Replacing the jos_vm_product_attribute.attribute_name index with a composite index on jos_vm_product_attribute.attribute_name and jos_vm_product_attribute.attribute_value (in that order) should help this query. Currently, it's only using an index in the WHERE condition for jos_vm_product_attribute.attribute_value, but this new index will be usable for both parts of the WHERE condition.

MySQL - joining tables on same table multiple times with different conditions takes forever

Simplifying, I have four tables.
ref_TagGroup (top-level descriptive containers for various tags)
ref_Tag (tags with name and unique tagIDs)
ref_Product
ref_TagMap (TagID,Container,ContainerType)
A fifth table, ref_ProductFamily exists but is not directly part of this query.
I use the ref_TagMap table to map tags to products, but also to map Tags to TagGroups and also to product families. The ContainerType is set to PROD/TAGGROUP/PRODFAM accordingly.
So, I want to return the tag group, tagname and the number of products AND product families that the tag is mapped to...so results like:
GroupName | TagName | TagHitCnt
My question is, why does the first query come back in milliseconds, the second query comes back in milliseconds but the third query (which is just adding an "OR" condition to include both tag to product and tag to family mappings) takes forever (well, over ten minutes anyway...I haven't let it run all night yet.)
QUERY 1:
SELECT ref_taggroup.groupname,ref_tag.tagname,COUNT(DISTINCT IFNULL(ref_product.familyid,ref_product.id + 100000000),ref_product.name) AS 'taghitcnt'
FROM (ref_taggroup,ref_tag,ref_product)
LEFT JOIN ref_tagmap GROUPMAP ON GROUPMAP.containerid=ref_taggroup.groupid
LEFT JOIN ref_tagmap PRODMAP ON PRODMAP.containerid=ref_product.id
WHERE
GROUPMAP.tagid=ref_tag.tagid AND GROUPMAP.containertype='TAGGROUP'
AND
PRODMAP.tagid=ref_tag.tagid AND PRODMAP.containertype='PROD'
GROUP BY tagname
ORDER BY groupname,tagname ;
QUERY 2:
SELECT ref_taggroup.groupname,ref_tag.tagname,COUNT(DISTINCT IFNULL(ref_product.familyid,ref_product.id + 100000000),ref_product.name) AS 'taghitcnt'
FROM (ref_taggroup,ref_tag,ref_product)
LEFT JOIN ref_tagmap GROUPMAP ON GROUPMAP.containerid=ref_taggroup.groupid
LEFT JOIN ref_tagmap PRODFAMMAP ON PRODFAMMAP.containerid=ref_product.familyid
WHERE
GROUPMAP.tagid=ref_tag.tagid AND GROUPMAP.containertype='TAGGROUP'
AND
PRODFAMMAP.tagid=ref_tag.tagid AND PRODFAMMAP.containertype='PRODFAM'
GROUP BY tagname
ORDER BY groupname,tagname ;
QUERY 3:
SELECT ref_taggroup.groupname,ref_tag.tagname,COUNT(DISTINCT IFNULL(ref_product.familyid,ref_product.id + 100000000),ref_product.name) AS 'taghitcnt'
FROM (ref_taggroup,ref_tag,ref_product)
LEFT JOIN ref_tagmap GROUPMAP ON GROUPMAP.containerid=ref_taggroup.groupid
JOIN ref_tagmap PRODMAP ON PRODMAP.containerid=ref_product.id
JOIN ref_tagmap PRODFAMMAP ON PRODFAMMAP.containerid=ref_product.familyid
WHERE
GROUPMAP.tagid=ref_tag.tagid AND GROUPMAP.containertype='TAGGROUP'
AND
((PRODMAP.tagid=ref_tag.tagid AND PRODMAP.containertype='PROD')
OR
(PRODFAMMAP.tagid=ref_tag.tagid AND PRODFAMMAP.containertype='PRODFAM' ))
GROUP BY tagname
ORDER BY groupname,tagname ;
--
To answer a question that may come up, the COUNT Distinct ifnull in the select is designed to return one record for large numbers of products that are grouped into families and one record for each 'standalone' product that isn't in a family as well. This code works well in other queries.
I've tried doing a UNION on the first two queries, and that works and comes back very quickly, but it's not practical for other reasons that I won't go into here.
What is the best way to do this? What am I doing wrong?
Thanks!
ADDING EXPLAIN OUTPUT
QUERY1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE GROUPMAP ALL 5640 Using where; Using temporary; Using filesort
1 SIMPLE ref_tag ref PRIMARY PRIMARY 4 lsslave01.GROUPMAP.tagid 1 Using index
1 SIMPLE ref_taggroup ref PRIMARY PRIMARY 4 lsslave01.GROUPMAP.containerid 3 Using index
1 SIMPLE PRODMAP ALL 5640 Using where; Using join buffer
1 SIMPLE ref_product eq_ref PRIMARY PRIMARY 4 lsslave01.PRODMAP.containerid 1
QUERY2
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE GROUPMAP ALL 5640 Using where; Using temporary; Using filesort
1 SIMPLE ref_tag ref PRIMARY PRIMARY 4 lsslave01.GROUPMAP.tagid 1 Using index
1 SIMPLE ref_taggroup ref PRIMARY PRIMARY 4 lsslave01.GROUPMAP.containerid 3 Using index
1 SIMPLE PRODFAMMAP ALL 5640 Using where; Using join buffer
1 SIMPLE ref_product ref FixtureType FixtureType 5 lsslave01.PRODFAMMAP.containerid 39 Using where
QUERY3
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE GROUPMAP ALL 5640 Using where; Using temporary; Using filesort
1 SIMPLE ref_tag ref PRIMARY PRIMARY 4 lsslave01.GROUPMAP.tagid 1 Using index
1 SIMPLE ref_taggroup ref PRIMARY PRIMARY 4 lsslave01.GROUPMAP.containerid 3 Using index
1 SIMPLE PRODMAP ALL 5640 Using join buffer
1 SIMPLE PRODFAMMAP ALL 5640 Using where; Using join buffer
1 SIMPLE ref_product eq_ref PRIMARY,FixtureType PRIMARY 4 lsslave01.PRODMAP.containerid 1 Using where
enter code here
One more update for anyone who is interested:
I finally let the third query above run to completion. It took right around 1000 seconds. Dividing this time by the time it takes each of the queries (1 or 2) to run, we get a number around 6000...which is very close to the size of the ref_tagmap table that we're using in our dev environment (much larger in production). So, it looks like we're running one query against each record in that table...but I still can't see why.
Any help would be much appreciated...and I mean seriously, seriously appreciated.
This is less an "answer" than a couple of observations/suggestions.
First, I'm curious whether you could GROUP BY on an integer ID instead of the tag name? I'd change the ref_TagMap.containertype field to hold tinyint enumerated values representing the three possible values of TAGGROUP, PROD and PRODFAM. An indexed tinyint field should be slightly faster than an index of string values. It probably won't make much difference though because it's the second conditional in the join clause and there isn't that much spread in the indexed values anyway.
Next is the observation/reminder that when the first half of an OR statement evaluates to FALSE often, then you're making MySQL evaluate both halves of the conditional every time. So you want to put the condition most likely to evaluate to TRUE first (aka prior to the OR).
I doubt either of those two issues are your real problem... though the issue in the second paragraph may play a part. Seems like the quickest way to a performant version of query 3 may be to simply populate a temp table with the results from the first two queries and pull from that temp table to get the results you're looking for from the third. Perhaps in doing so you'll discover why that third query is so slow.

Strange Performance Issues with INNER JOIN vs. LEFT JOIN

I was using a query that looked similar to this one:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`, `views_sum`
WHERE `views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`
GROUP BY `episodes`.`id`
... which takes ~0.1s to execute. But it's problematic, because some episodes don't have a corresponding views_sum row, so those episodes aren't included in the result.
What I want is NULL values when a corresponding views_sum row doesn't exist, so I tried using a LEFT JOIN instead:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`
LEFT JOIN `views_sum` ON (`views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`)
GROUP BY `episodes`.`id`
This query produces the same columns, and it also includes the few rows missing from the 1st query.
BUT, the 2nd query takes 10 times as long! A full second.
Why is there such a huge discrepancy between the execution times when the result is so similar? There's nowhere near 10 times as many rows — it's like 60 from the 1st query, and 70 from the 2nd. That's not to mention that the 10 additional rows have no views to sum!
Any light shed would be greatly appreciated!
(There are indexes on episodes.id, views_sum.index, and views_sum.key.)
EDIT:
I copy-pasted the SQL from above, and here are the EXPLAINs, in order:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE views_sum ref index,key index 27 const 6532 Using where; Using temporary; Using filesort
1 SIMPLE episodes eq_ref PRIMARY PRIMARY 4 db102914_itw.views_sum.key 1 Using where
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE episodes ALL NULL NULL NULL NULL 70 Using temporary; Using filesort
1 SIMPLE views_sum ref index,key index 27 const 6532
Here's the query I ultimately came up with, after many, many iterations. (The SQL_NO_CACHE flag is there so I can test execution times.)
SELECT SQL_NO_CACHE e.*, IFNULL(SUM(vs.`clicks`), 0) as `clicks`
FROM `episodes` e
LEFT JOIN
(SELECT * FROM `views_sum` WHERE `index` = "episode") vs
ON vs.`key` = e.`id`
GROUP BY e.`id`
Because the ON condtion views_sum.index = "episode" is static, i.e., isn't dependent on the row it's joined to, I was able to get a massive performance boost by first using a subquery to limit the views_sum table before joining.
My query now takes ~0.2s. And what's even better, the time doesn't grow as you increase the offset of the query (unlike my first LEFT JOIN attempt). It stays the same, even if you do a sort on the clicks column.
You should have a combined index on views_sum.index and views_sum.key. I suspect you will always use both fields together if i look at the names. Also, I would rewrite the first query to use a proper INNER JOIN clause instead of a filtered cartesian product.
I suspect the performance of both queries will be much closer together if you do this. And, more importantly, much faster than they are now.
edit: Thinking about it, I would probably add a third column to that index: views_sum.clicks, which probably can be used for the SUM. But remember that multi-column indexes can only be used left to right.
It's all about the indexes. You'll have to play around with it a bit or post your database schema on here. Just as a rough guess i'd say you should make sure you have an index on views_sum.key.
Normally, a LEFT JOIN will be slower than an INNER JOIN or a CROSS JOIN because it has to view the first table differently. Put another way, the difference in time isn't related to the size of the result, but the full size of the left table.
I also wonder if you're asking MySQL to figure things out for you that you should be doing yourself. Specifically, that SUM() function would normally require a GROUP BY clause.

Why is this query using where instead of index?

EXPLAIN EXTENDED SELECT `board` . *
FROM `board`
WHERE `board`.`category_id` = '5'
AND `board`.`board_id` = '0'
AND `board`.`display` = '1'
ORDER BY `board`.`order` ASC
The output of the above query is
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE board ref category_id_2 category_id_2 9 const,const,const 4 100.00 Using where
I'm a little confused by this because I have an index that contains the columns that I'm using in the same order they're used in the query...:
category_id_2 BTREE No No
category_id 33 A
board_id 33 A
display 33 A
order 66 A
The output of EXPLAIN can sometimes be misleading.
For instance, filesort has nothing to do with files, using where does not mean you are using a WHERE clause, and using index can show up on the tables without a single index defined.
Using where just means there is some restricting clause on the table (WHERE or ON), and not all record will be returned. Note that LIMIT does not count as a restricting clause (though it can be).
Using index means that all information is returned from the index, without seeking the records in the table. This is only possible if all fields required by the query are covered by the index.
Since you are selecting *, this is impossible. Fields other than category_id, board_id, display and order are not covered by the index and should be looked up.
It is actually using index category_id_2.
It's using the index category_id_2 properly, as shown by the key field of the EXPLAIN.
Using where just means that you're selecting only some rows by using the WHERE statement, so you won't get the entire table back ;)