JOIN vs UNION vs IN() - big tables and many WHERE conditions

JOIN vs UNION vs IN() - big tables and many WHERE conditions - mysql

I use MySQL 5.5 and I have 3 tables created for testing:
attributes (entity_id, cid, aid, value) - indexes: ALL
items (entity_id, price, currency) - indexes: entity_id
rates (currency_from, currency_to, rate) - indexes: NONE
I need to count the results for specified conditions (search by attributes) and select X rows ordered by some column.
The query should support searching in item attributes (attributes table).
I have a query like this at first:
SELECT i.entity_id, i.price * COALESCE(r.rate, 1) AS final_price
FROM items i
JOIN attributes a ON a.entity_id = i.entity_id
LEFT JOIN rates r ON i.currency = r.currency_from AND r.currency_to = 'EUR'
WHERE a.cid = 4 AND ( (a.aid >= 10 AND a.value > 2000) OR (a.aid <= 10 AND a.value > 5) )
HAVING final_price BETWEEN 0 AND 9000
ORDER BY final_price DESC
LIMIT 20
but it's quite slow on big tables. The where conditions can be bigger (even to 30 params) and use CAST(a.value as SIGNED) to use BETWEEN sometimes (for range values).
For example:
SELECT
i.entity_id,
i.price * COALESCE(r.rate, 1) AS final_price
FROM
attributes a
JOIN items i
ON a.entity_id = i.entity_id
LEFT JOIN rates r
ON i.currency = r.currency_from
AND r.currency_to = 'EUR'
WHERE
a.cid = 4 AND (
(a.aid = 10 AND CAST(a.value AS SIGNED) BETWEEN 2000 AND 2014)
OR (a.aid = 121 AND CAST(a.value AS SIGNED) BETWEEN 40 AND 60)
OR (a.aid = 45 AND CAST(a.value AS SIGNED) BETWEEN 770 AND 1500)
OR (a.aid = 95 AND CAST(a.value AS SIGNED) BETWEEN 12770 AND 15500)
OR (a.aid = 98 AND a.value = 'some value')
OR (a.aid = 199 AND a.value = 'some another value')
OR (a.aid = 102 AND a.value = 1)
OR (a.aid = 112 AND a.value = 42) )
GROUP BY
i.entity_id
HAVING
COUNT(i.entity_id) = 7
AND final_price BETWEEN 0 AND 9000
ORDER BY
final_price DESC
LIMIT 20
I group by COUNT() equal to 7 (number of attributes to search), because I need to find items with all these attributes.
EXPLAIN for the base query (the first one):
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a ALL entity_id,value NULL NULL NULL 379999 Using where; Using temporary; Using filesort
1 SIMPLE i eq_ref PRIMARY PRIMARY 4 testowa.a.entity_id 1 Using where
1 SIMPLE r ALL NULL NULL NULL NULL 2
I read many topics about comparing UNION vs JOIN vs IN() and the best results gives the second option, but it's too slow all the time.
Is there any way to get better performance here? Why is it so slow?
Should I think about moving some logic (split this query to 3 small) to backend (php/ror) code?

I would restructure your query slightly and have the attributes table first
and then joined to the items. Also, I would have a covering index on the
items table via (entity_id, price) and an index on your attributes table
ON (cid, aid, value, entity_id), and your rates table index
ON (currency_from, currency_to, rate). This way, all are covering indexes
and the engine won't need to go to the raw data pages to get the data, it can
pull it from the indexes it is already using for the joining / criteria.
SELECT
i.entity_id,
i.price * COALESCE(r.rate, 1) AS final_price
FROM
attributes a
JOIN items i
ON a.entity_id = i.entity_id
LEFT JOIN rates r
ON i.currency = r.currency_from
AND r.currency_to = 'EUR'
WHERE
a.cid = 4 AND ( (a.aid >= 10 AND a.value > 2000) OR (a.aid <= 10 AND a.value > 5) )
HAVING
final_price BETWEEN 0 AND 9000
ORDER BY
final_price DESC
LIMIT 20
So, although this would help the query you have provided, could you show some other where you would have many more criteria conditions... you mentioned it could be as many (or more) than 30. Looking at more might alter the query slightly.
As for your updated query with multiple criteria, I would then add an IN() clause for all the "aid" values after the "a.cid = 4". This way, before it has to hit all the "OR" conditions, if it fails on the "aid" not being one you consider, it never has to hit those... such as
a.cid = 4
AND a.id in ( 10, 121, 45, 95, 98, 199, 102 )
AND ( rest of the complex aid, casting and between criteria )

Related

Optimize SQL query with many inner joins on same table

I'm stuck with a performance issue:
A shop has an article filter with categories "color", "size", "gender" and "feature". All those details are stored inside an article_criterias table, that looks like this:
Table layout of article_criterias is; this table has about 36.000 rows:
article_id | group | option | option_val
100 | "size" | "35" | 35.00
100 | "size" | "36" | 36.00
100 | "size" | "36½" | 36.50
100 | "color" | "40" | 40.00
100 | "color" | "50" | 50.00
100 | "gender" | "1" | 1.00
101 | "size" | "40" | 40.00
...
We have a SQL query that is built dynamically, based on which criteria are currently selected. The query is good for 2-3 criteria, but will get very slow when selecting more than 5 options (each additional INNER JOIN roughly doubles the execution time)
How can we make this SQL faster, maybe even replacing the inner joins with a more performant concept?
This is the query (the logic is correct, just the performance is bad):
-- This SQL is generated when the user selected the following criteria
-- gender: 1
-- color: 80 + 30
-- size 36 + 37 + 38 + 39 + 42 + 46
SELECT
criteria.group AS `key`,
criteria.option AS `value`
FROM articles
INNER JOIN article_criterias AS criteria ON articles.id = criteria.article_id
INNER JOIN article_criterias AS criteria_gender
ON criteria_gender.article_id = articles.id AND criteria_gender.group = "gender"
INNER JOIN article_criterias AS criteria_color1
ON criteria_color1.article_id = articles.id AND criteria_color1.group = "color"
INNER JOIN article_criterias AS criteria_size2
ON criteria_size2.article_id = articles.id AND criteria_size2.group = "size"
INNER JOIN article_criterias AS criteria_size3
ON criteria_size3.article_id = articles.id AND criteria_size3.group = "size"
INNER JOIN article_criterias AS criteria_size4
ON criteria_size4.article_id = articles.id AND criteria_size4.group = "size"
INNER JOIN article_criterias AS criteria_size5
ON criteria_size5.article_id = articles.id AND criteria_size5.group = "size"
INNER JOIN article_criterias AS criteria_size6
ON criteria_size6.article_id = articles.id AND criteria_size6.group = "size"
INNER JOIN article_criterias AS criteria_size7
ON criteria_size7.article_id = articles.id AND criteria_size7.group = "size"
WHERE
AND (criteria_gender.option IN ("1"))
AND (criteria_color1.option IN ("80", "30"))
AND (criteria_size2.option_val BETWEEN 35.500000 AND 36.500000)
AND (criteria_size3.option_val BETWEEN 36.500000 AND 37.500000)
AND (criteria_size4.option_val BETWEEN 37.500000 AND 38.500000)
AND (criteria_size5.option_val BETWEEN 38.500000 AND 39.500000)
AND (criteria_size6.option_val BETWEEN 41.500000 AND 42.500000)
AND (criteria_size7.option_val BETWEEN 45.500000 AND 46.500000)

Key/value tables are really a nuisance. However, in order to find certain criteria matches aggregate your data:
select
a.*,
ac.group AS "key",
ac.option AS "value"
from articles a
join article_criterias ac on ac.article_id = a.article_id
where a.article_id in
(
select article_id
from article_criterias
group by article_id
having sum("group" = 'gender' and option = '1') > 0
and sum("group" = 'color' and option in ('30','80')) > 0
and sum("group" = 'size' and option_val between 35.5 and 36.5) > 0
and sum("group" = 'size' and option_val between 36.5 and 37.5) > 0
and sum("group" = 'size' and option_val between 37.5 and 38.5) > 0
and sum("group" = 'size' and option_val between 38.5 and 39.5) > 0
and sum("group" = 'size' and option_val between 41.5 and 42.5) > 0
and sum("group" = 'size' and option_val between 45.5 and 46.5) > 0
)
order by a.article_id, ac.group, ac.option;
This gets you all articles that are available for gender 1, colors 30 and/or 80, and all listed size ranges, along with all their options. (The size ranges are a bit strange, though; a size 36.5 would meet two ranges for instance.) You get the idea: group by article_id and use HAVING in order to only get article_ids that meet the critria.
As to indexes you'll want
create index idx on article_criterias(article_id, "group", option, option_val);

As suggested by #affan-pathan adding index did solve the issue:
CREATE INDEX text_option
ON `article_criterias` (`article_id`, `group`, `option`);
CREATE INDEX numeric_option
ON `article_criterias` (`article_id`, `group`, `option_val`);
Those two indexes cut the execute time of the above query form nearly 1 minute to less than 50 milliseconds!!

I understand indexs you create solved your problem,
but just to play with a pseudo alternative (which avoid multiple INNER JOIN), can you try something like this? (I did test with just three condition. Your condition should be inserted in inner query. To select only the record who meets all conditions, you have to change last WHERE condition (WHERE max = 3, using the number of conditions you wrote above; so if you are using 5 conditions, you should write WHERE max = 5). (I changed the name of columns groups and option, for my ease of use).
It's just an idea so pls do some tests and check for performance and pls let me know...
CREATE TABLE CRITERIA (ARTICLE_ID INT, GROU VARCHAR(10), OPT VARCHAR(20), OPTION_VAL NUMERIC(12,2));
CREATE TABLE ARTICLES (ID INT);
INSERT INTO CRITERIA VALUES (100,'size','35',35);
INSERT INTO CRITERIA VALUES (100,'size','36',36);
INSERT INTO CRITERIA VALUES (100,'color','40',40);
INSERT INTO CRITERIA VALUES (100,'gender','1',1);
INSERT INTO CRITERIA VALUES (200,'size','36.2',36.2);
INSERT INTO CRITERIA VALUES (300,'size','36.2',36.2);
INSERT INTO ARTICLES VALUES (100);
INSERT INTO ARTICLES VALUES (200);
INSERT INTO ARTICLES VALUES (300);
-------------------------------------------------------
SELECT D.article_id, D.GROU, D.OPT
FROM (SELECT C.*
, #o:=CASE WHEN #h=ARTICLE_ID THEN #o ELSE cumul END max
, #h:=ARTICLE_ID AS a_id
FROM (SELECT article_id,
B.GROU, B.OPT,
#r:= CASE WHEN #g = B.ARTICLE_ID THEN #r+1 ELSE 1 END cumul,
#g:= B.ARTICLE_ID g
FROM CRITERIA B
CROSS JOIN (SELECT #g:=0, #r:=0) T1
WHERE (B.GROU='gender' AND B.OPT IN ('1'))
OR (B.GROU='color' AND B.OPT IN ('40', '30'))
OR (B.GROU='size' AND B.OPT BETWEEN 35.500000 AND 36.500000)
ORDER BY article_id
) C
CROSS JOIN (SELECT #o:=0, #h:=0) T2
ORDER BY ARTICLE_ID, CUMUL DESC) D
WHERE max=3
;
Output:
article_id GROU OPT
100 gender 1
100 color 40
100 size 36

How can I optimise mySQL to use JOINs instead of nested IN queries?

I have a query which combines a user's balance at a number of locations and uses a nested subquery to combine data from the customer_balance table and the merchant_groups table. There is a second piece of data required from the customer_balance table that is unique to each merchant.
I'd like to optimise my query to return a sum and a unique value i.e. the order of results is important.
For instance, there may be three merchants in a merchant_group:
id | group_id | group_member_id
1 12 36
2 12 70
3 12 106
The user may have a balance at 2 locations but not all in the customer_balance table:
id | group_member_id | user_id | balance | personal_note
1 36 420 1.00 "Likes chocolate"
2 70 420 20.00 null
Notice there isn't a 3rd row in the balance table.
What I'd like to end up with is the ability to pull the sum of the balance as well as the most appropriate personal_note.
So far I have this working in all situations with the following query:
SELECT sum(c.cash_balance) as cash_balance,n.customer_note FROM customer_balance AS c
LEFT JOIN (SELECT customer_note, user_id FROM customer_balance
WHERE user_id = 420 AND group_member_id = 36) AS n on c.user_id = n.user_id
WHERE c.user_id = 420 AND c.group_id IN (SELECT group_member_id FROM merchant_group WHERE group_id = 12)
I can change out the group_member_id appropriately and I will always get the combined balance as expected and the appropriate note. i.e. what I'm looking for is:
balance: 21.00
customer_note: "Likes Chocolate" OR null (depending on the group_member_id)
Is it possible to optimise this query without using resource heavy nested queries e.g. using a JOIN? (or some other method).
I have tried a number of options, but cannot get it working in all situations. The following is the closest I have gotten, except this doesn't return the correct note:
SELECT sum(cb.balance), cb.personal_note FROM customer_balance AS cb
LEFT JOIN merchant_group AS mg on mg.group_member_id = cb.group_member_id
WHERE cb.user_id = 420 && mg.group_id = 12
ORDER BY (mg.group_member_id = 106)
I also tried another option (but since lost the query) that works, but not when the group_member_id = 106 - because there was no record in one table (but this is a valid use case that I'd like to cater for).
Thanks!

This should be equivalent but without subselect
SELECT
sum(c.cash_balance) as cash_balance
, n.customer_note
FROM customer_balance AS c
LEFT JOIN customer_balance as n on ( c.user_id = n.user_id AND n.group_member_id = 36 AND n.user_id = 420 )
INNER JOIN merchant_group as mg on ( c.group_id = mg.group_member_id AND mg.group_id = 12)
WHERE c.user_id = 420

Is there better way to do this query?

SELECT *
FROM a
WHERE a.re_id = 3443499
AND a.id IN
(
SELECT b.rsp_id FROM b
WHERE b.f_id = 9
GROUP BY b.rsp_id
HAVING FIND_IN_SET(16, GROUP_CONCAT(b.o_id)) > 0
AND FIND_IN_SET(15, GROUP_CONCAT(b.o_id)) > 0
UNION
SELECT b.rsp_id FROM b
WHERE b.f_id = 4
GROUP BY b.rsp_id
HAVING FIND_IN_SET(5, GROUP_CONCAT(b.o_id)) > 0
)
ORDER BY id DESC
Here "f_id" is array and its values are those in first parameter of "FIND_IN_SET" function.
For example
9=>(
16,
15
),
4=>(
5
)
Sample data for those 2 folumns in table b, 2 columns f_id and o_id
f_id o_id
9 15
9 18
9 23
4 5
3 8

The gist of this answer is that the current query does not run. So, fix the syntax and ask another question.
First, you could write the query so it is syntactically correct. The query will fail as written, because the first subquery returns at least two rows and the second only one.
Second, use UNION ALL instead of UNION, unless you specifically want to incur the overhead of removing duplicates.
Third, the ORDER BY will generate an error.
Fourth, the GROUP_CONCAT() is dangerous and unnecessary.
I'm not 100% sure this is the intention, but I would start with a query like this:
SELECT a.id, a.re_id
FROM a
WHERE a.re_id = 3443499 AND
a.id IN (SELECT b.rsp_id
FROM b
WHERE b.f_id = 9
GROUP BY b.rsp_id
HAVING MAX(b.o_id = 16) > 0 AND
MAX(b.o_id = 15) > 0
)
UNION ALL
SELECT b.rsp_id, NULL
FROM b
WHERE b.f_id = 4
GROUP BY b.rsp_id
HAVING MAX(b.o_id = 5) > 0
ORDER BY id;
Then, if you want this optimized, I would suggest asking another question, along with relevant information about the table structures and current performance.

Why does this WHERE clause make my query 180 times slower?

the following query executes in 1.6 seconds
SET #num :=0, #current_shop_id := NULL, #current_product_id := NULL;
#this query limits the results of the query within it by row number (so that only 250 products get displayed per store)
SELECT * FROM (
#this query adds row numbers to the query within it
SELECT *, #num := IF( #current_shop_id = shop_id, IF(#current_product_id=product_id,#num,#num+1), 0) AS row_number, #current_shop_id := shop_id AS shop_dummy, #current_product_id := product_id AS product_dummy FROM (
SELECT shop, shops.shop_id AS
shop_id, p1.product_id AS
product_id
FROM products p1 LEFT JOIN #this LEFT JOIN gets the favorites count for each product
(
SELECT fav3.product_id AS product_id, SUM(CASE
WHEN fav3.current = 1 AND fav3.closeted = 1 THEN 1
WHEN fav3.current = 1 AND fav3.closeted = 0 THEN -1
ELSE 0
END) AS favorites_count
FROM favorites fav3
GROUP BY fav3.product_id
) AS fav4 ON p1.product_id=fav4.product_id
INNER JOIN sex ON sex.product_id=p1.product_id AND
sex.sex=0 AND
sex.date >= SUBDATE(NOW(),INTERVAL 1 DAY)
INNER JOIN shops ON shops.shop_id = p1.shop_id
ORDER BY shop, sex.DATE, product_id
) AS testtable
) AS rowed_results WHERE
rowed_results.row_number>=0 AND
rowed_results.row_number<(7)
adding AND shops.shop_id=86 to the final WHERE clause causes the query to execute in 292 seconds:
SET #num :=0, #current_shop_id := NULL, #current_product_id := NULL;
#this query limits the results of the query within it by row number (so that only 250 products get displayed per store)
SELECT * FROM (
#this query adds row numbers to the query within it
SELECT *, #num := IF( #current_shop_id = shop_id, IF(#current_product_id=product_id,#num,#num+1), 0) AS row_number, #current_shop_id := shop_id AS shop_dummy, #current_product_id := product_id AS product_dummy FROM (
SELECT shop, shops.shop_id AS
shop_id, p1.product_id AS
product_id
FROM products p1 LEFT JOIN #this LEFT JOIN gets the favorites count for each product
(
SELECT fav3.product_id AS product_id, SUM(CASE
WHEN fav3.current = 1 AND fav3.closeted = 1 THEN 1
WHEN fav3.current = 1 AND fav3.closeted = 0 THEN -1
ELSE 0
END) AS favorites_count
FROM favorites fav3
GROUP BY fav3.product_id
) AS fav4 ON p1.product_id=fav4.product_id
INNER JOIN sex ON sex.product_id=p1.product_id AND
sex.sex=0 AND
sex.date >= SUBDATE(NOW(),INTERVAL 1 DAY)
INNER JOIN shops ON shops.shop_id = p1.shop_id AND
shops.shop_id=86
ORDER BY shop, sex.DATE, product_id
) AS testtable
) AS rowed_results WHERE
rowed_results.row_number>=0 AND
rowed_results.row_number<(7)
I would have thought limiting the shops table with AND shops.shop_id=86 would reduce execution time. Instead, execution time appears to depend upon the number of rows in the products table with products.shop_id equal to the specified shops.shop_id. There are about 34K rows in the products table with products.shop_id=86, and execution time is 292 seconds. For products.shop_id=50, there are about 28K rows, and execution time is 210 seconds. For products.shop_id=175, there are about 2K rows, and execution time is 2.8 seconds. What is going on?
EXPLAIN EXTENDED for the 1.6 second query is:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1203 100.00 Using where
2 DERIVED <derived3> ALL NULL NULL NULL NULL 1203 100.00
3 DERIVED sex ALL product_id_2,product_id NULL NULL NULL 526846 75.00 Using where; Using temporary; Using filesort
3 DERIVED p1 eq_ref PRIMARY,shop_id,shop_id_2,product_id,shop_id_3 PRIMARY 4 mydatabase.sex.product_id 1 100.00
3 DERIVED <derived4> ALL NULL NULL NULL NULL 14752 100.00
3 DERIVED shops eq_ref PRIMARY PRIMARY 4 mydatabase.p1.shop_id 1 100.00
4 DERIVED fav3 ALL NULL NULL NULL NULL 15356 100.00 Using temporary; Using filesort
SHOW WARNINGS for this EXPLAIN EXTENDED is
-----+
| Note | 1003 | select `rowed_results`.`shop` AS `shop`,`rowed_results`.`shop_id` AS `shop_id`,`rowed_results`.`product_id` AS `product_id`,`rowed_results`.`row_number` AS `row_number`,`rowed_results`.`shop_dummy` AS `shop_dummy`,`rowed_results`.`product_dummy` AS `product_dummy` from (select `testtable`.`shop` AS `shop`,`testtable`.`shop_id` AS `shop_id`,`testtable`.`product_id` AS `product_id`,(#num:=if(((#current_shop_id) = `testtable`.`shop_id`),if(((#current_product_id) = `testtable`.`product_id`),(#num),((#num) + 1)),0)) AS `row_number`,(#current_shop_id:=`testtable`.`shop_id`) AS `shop_dummy`,(#current_product_id:=`testtable`.`product_id`) AS `product_dummy` from (select `mydatabase`.`shops`.`shop` AS `shop`,`mydatabase`.`shops`.`shop_id` AS `shop_id`,`mydatabase`.`p1`.`product_id` AS `product_id` from `mydatabase`.`products` `p1` left join (select `mydatabase`.`fav3`.`product_id` AS `product_id`,sum((case when ((`mydatabase`.`fav3`.`current` = 1) and (`mydatabase`.`fav3`.`closeted` = 1)) then 1 when ((`mydatabase`.`fav3`.`current` = 1) and (`mydatabase`.`fav3`.`closeted` = 0)) then -(1) else 0 end)) AS `favorites_count` from `mydatabase`.`favorites` `fav3` group by `mydatabase`.`fav3`.`product_id`) `fav4` on(((`mydatabase`.`p1`.`product_id` = `mydatabase`.`sex`.`product_id`) and (`fav4`.`product_id` = `mydatabase`.`sex`.`product_id`))) join `mydatabase`.`sex` join `mydatabase`.`shops` where ((`mydatabase`.`sex`.`sex` = 0) and (`mydatabase`.`p1`.`product_id` = `mydatabase`.`sex`.`product_id`) and (`mydatabase`.`shops`.`shop_id` = `mydatabase`.`p1`.`shop_id`) and (`mydatabase`.`sex`.`date` >= (now() - interval 1 day))) order by `mydatabase`.`shops`.`shop`,`mydatabase`.`sex`.`date`,`mydatabase`.`p1`.`product_id`) `testtable`) `rowed_results` where ((`rowed_results`.`row_number` >= 0) and (`rowed_results`.`row_number` < 7)) |
+------
EXPLAIN EXTENDED for the 292 second query is:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 36 100.00 Using where
2 DERIVED <derived3> ALL NULL NULL NULL NULL 36 100.00
3 DERIVED shops const PRIMARY PRIMARY 4 1 100.00 Using temporary; Using filesort
3 DERIVED p1 ref PRIMARY,shop_id,shop_id_2,product_id,shop_id_3 shop_id 4 11799 100.00
3 DERIVED <derived4> ALL NULL NULL NULL NULL 14752 100.00
3 DERIVED sex eq_ref product_id_2,product_id product_id_2 5 mydatabase.p1.product_id 1 100.00 Using where
4 DERIVED fav3 ALL NULL NULL NULL NULL 15356 100.00 Using temporary; Using filesort
SHOW WARNINGS for this EXPLAIN EXTENDED is
----+
| Note | 1003 | select `rowed_results`.`shop` AS `shop`,`rowed_results`.`shop_id` AS `shop_id`,`rowed_results`.`product_id` AS `product_id`,`rowed_results`.`row_number` AS `row_number`,`rowed_results`.`shop_dummy` AS `shop_dummy`,`rowed_results`.`product_dummy` AS `product_dummy` from (select `testtable`.`shop` AS `shop`,`testtable`.`shop_id` AS `shop_id`,`testtable`.`product_id` AS `product_id`,(#num:=if(((#current_shop_id) = `testtable`.`shop_id`),if(((#current_product_id) = `testtable`.`product_id`),(#num),((#num) + 1)),0)) AS `row_number`,(#current_shop_id:=`testtable`.`shop_id`) AS `shop_dummy`,(#current_product_id:=`testtable`.`product_id`) AS `product_dummy` from (select 'shop.nordstrom.com' AS `shop`,'86' AS `shop_id`,`mydatabase`.`p1`.`product_id` AS `product_id` from `mydatabase`.`products` `p1` left join (select `mydatabase`.`fav3`.`product_id` AS `product_id`,sum((case when ((`mydatabase`.`fav3`.`current` = 1) and (`mydatabase`.`fav3`.`closeted` = 1)) then 1 when ((`mydatabase`.`fav3`.`current` = 1) and (`mydatabase`.`fav3`.`closeted` = 0)) then -(1) else 0 end)) AS `favorites_count` from `mydatabase`.`favorites` `fav3` group by `mydatabase`.`fav3`.`product_id`) `fav4` on(((`fav4`.`product_id` = `mydatabase`.`p1`.`product_id`) and (`mydatabase`.`sex`.`product_id` = `mydatabase`.`p1`.`product_id`))) join `mydatabase`.`sex` join `mydatabase`.`shops` where ((`mydatabase`.`sex`.`sex` = 0) and (`mydatabase`.`sex`.`product_id` = `mydatabase`.`p1`.`product_id`) and (`mydatabase`.`p1`.`shop_id` = 86) and (`mydatabase`.`sex`.`date` >= (now() - interval 1 day))) order by 'shop.nordstrom.com',`mydatabase`.`sex`.`date`,`mydatabase`.`p1`.`product_id`) `testtable`) `rowed_results` where ((`rowed_results`.`row_number` >= 0) and (`rowed_results`.`row_number` < 7)) |
+-----
I am running MySQL client version: 5.1.56. The shops table has a primary index on shop_id:
Action Keyname Type Unique Packed Column Cardinality Collation Null Comment
Edit Drop PRIMARY BTREE Yes No shop_id 163 A
I have analyzed the shop table but this did not help.
I notice that if I remove the LEFT JOIN the difference in execution times drops to 0.12 seconds versus 0.28 seconds.
Cez's solution, namely to use the 1.6-second version of the query and remove irrelevant results by adding rowed_results.shop_dummy=86 to the outer query (as below), executes in 1.7 seconds. This circumvents the problem, but the mystery remains why 292-second query is so slow.
SET #num :=0, #current_shop_id := NULL, #current_product_id := NULL;
#this query limits the results of the query within it by row number (so that only 250 products get displayed per store)
SELECT * FROM (
#this query adds row numbers to the query within it
SELECT *, #num := IF( #current_shop_id = shop_id, IF(#current_product_id=product_id,#num,#num+1), 0) AS row_number, #current_shop_id := shop_id AS shop_dummy, #current_product_id := product_id AS product_dummy FROM (
SELECT shop, shops.shop_id AS
shop_id, p1.product_id AS
product_id
FROM products p1 LEFT JOIN #this LEFT JOIN gets the favorites count for each product
(
SELECT fav3.product_id AS product_id, SUM(CASE
WHEN fav3.current = 1 AND fav3.closeted = 1 THEN 1
WHEN fav3.current = 1 AND fav3.closeted = 0 THEN -1
ELSE 0
END) AS favorites_count
FROM favorites fav3
GROUP BY fav3.product_id
) AS fav4 ON p1.product_id=fav4.product_id
INNER JOIN sex ON sex.product_id=p1.product_id AND sex.sex=0
INNER JOIN shops ON shops.shop_id = p1.shop_id
WHERE sex.date >= SUBDATE(NOW(),INTERVAL 1 DAY)
ORDER BY shop, sex.DATE, product_id
) AS testtable
) AS rowed_results WHERE
rowed_results.row_number>=0 AND
rowed_results.row_number<(7) AND
rowed_results.shop_dummy=86;

After the chat room, and actually creating tables/columns to match the query, I've come up with the following query.
I have started my inner-most query to be on the sex, product (for shop_id) and favorites table. Since you described that ProductX at ShopA = Product ID = 1 but same ProductX at ShopB = Product ID = 2 (example only), each product is ALWAYS unique per shop and never duplicated. That said, I can get the product and shop_id WITH the count of favorites (if any) at this query, yet group on just the product_id .. as shop_id won't change per product I am using MAX(). Since you are always looking by a date of "yesterday" and gender (sex=0 female), I would have the SEX table indexed on ( date, sex, product_id )... I would guess you are not adding 1000's of items every day... Products obviously would have an index on product_id (primary key), and favorites SHOULD have an index on product_id.
From that result (alias "sxFav") we can then do a direct join to the sex and products table by that "Product_ID" to get any additional information you may want, such as name of shop, date product added, product description, etc. This result is then ordered by the shop_id the product is being sold from, date and finally product ID (but you may consider grabbing a description column at inner query and using that as sort-by). This results in alias "PreQuery".
With the order being all proper by shop, we can now add the #MySQLVariable references to get each product assigned a row number similar to how you originally attempted. However, only reset back to 1 when a shop ID changes.
SELECT
PreQuery.*,
#num := IF( #current_shop_id = PreQuery.shop_id, #num +1, 1 ) AS RowPerShop,
#current_shop_id := PreQuery.shop_id AS shop_dummy
from
( SELECT
sxFav.product_id,
sxFav.shop_id,
sxFav.Favorites_Count
from
( SELECT
sex.product_id,
MAX( p.shop_id ) shop_id,
SUM( CASE WHEN F.current = 1 AND F.closeted = 1 THEN 1
WHEN F.current = 1 AND F.closeted = 0 THEN -1
ELSE 0 END ) AS favorites_count
from
sex
JOIN products p
ON sex.Product_ID = p.Product_ID
LEFT JOIN Favorites F
ON sex.product_id = F.product_ID
where
sex.date >= subdate( now(), interval 1 day)
and sex.sex = 0
group by
sex.product_id ) sxFav
JOIN sex
ON sxFav.Product_ID = sex.Product_ID
JOIN products p
ON sxFav.Product_ID = p.Product_ID
order by
sxFav.shop_id,
sex.date,
sxFav.product_id ) PreQuery,
( select #num :=0,
#current_shop_id := 0 ) as SQLVars
Now, if you are looking for specific "paging" information (such as 7 entries per shop), wrap the ENTIRE query above into something like...
select * from ( entire query above ) where RowPerShop between 1 and 7
(or between 8 and 14, 15 and 21, etc as needed)
or even
RowPerShop between RowsPerPage*PageYouAreShowing and RowsPerPage*(PageYouAreShowing +1)

You should move the shops.shop_id=86 to the JOIN condition for shops. No reason to put it outside the JOIN, you run the risk of MySQL JOINing first, then filtering. A JOIN can do the same job the a WHERE clause does, especially if you are not referencing other tables.
....
INNER JOIN shops ON shops.shop_id = p1.shop_id AND shops.shop_id=86
....
Same thing with the sex join:
...
INNER JOIN shops ON shops.shop_id = p1.shop_id
AND sex.date >= SUBDATE(NOW(),INTERVAL 1 DAY)
...
Derived tables are great, but they have no indexes on them. Usually this doesn't matter since they are generally in RAM. But between filtering and sorting with no indexes, things can add up.
Note that in the second query that take much longer, the table processing order changes. The shop table is at the top in the slow query and the p1 table retrieves 11799 rows instead of 1 row in the fast query. It also doesn't use the primary key any more. That's likely where your problem is.
3 DERIVED p1 eq_ref PRIMARY,shop_id,shop_id_2,product_id,shop_id_3 PRIMARY 4 mydatabase.sex.product_id 1 100.00
3 DERIVED p1 ref PRIMARY,shop_id,shop_id_2,product_id,shop_id_3 shop_id 4 11799 100.00

Judging by the discussion, the query planner is performing badly when specifying the shop at a lower level.
Add rowed_results.shop_dummy=86 to the outer query to get the results that you are looking for.

Using Order By NULL in a UNION

I have a query (see below) that I have a custom developed UDF that is used to calculate whether or not certain points are within a polygon (first query in UNION) or circular (second query in UNION) shape.
select e.inquiry_match_type_id
, a.geo_boundary_id
, GeoBoundaryContains(c.tpi_geo_boundary_coverage_type_id, 29.287437, -95.055807, a.lat, a.lon, a.geo_boundary_vertex_id ) in_out
, e.inquiry_id
, e.external_id
, COALESCE(f.inquiry_device_id,0) inquiry_device_id
, b.external_info1
, b.external_info2
, b.geo_boundary_id
, b.geo_boundary_type_id
from geo_boundary_vertex a
join geo_boundary b on b.geo_boundary_id = a.geo_boundary_id
join trackpoint_index_geo_boundary_mem c on c.geo_boundary_id = b.geo_boundary_id
join trackpoint_index_mem d on d.trackpoint_index_id = c.trackpoint_index_id
join inquiry_mem e on e.inquiry_id = b.inquiry_id left
outer join inquiry_device_mem f on f.inquiry_id = e.inquiry_id and f.device_id = 3201
where d.trackpoint_index_id = 3127
and b.geo_boundary_type_id = 3
and e.expiration_date >= now()
group by
a.geo_boundary_id
UNION
select e.inquiry_match_type_id
, b.geo_boundary_id
, GeoBoundaryContains( c.tpi_geo_boundary_coverage_type_id, 29.287437, -95.055807, b.centroid_lat, b.centoid_lon, b.radius ) in_out
, e.inquiry_id
, e.external_id
, COALESCE(f.inquiry_device_id,0) inquiry_device_id
, b.external_info1
, b.external_info2
, b.geo_boundary_id
, b.geo_boundary_type_id
from geo_boundary b
join trackpoint_index_geo_boundary_mem c on c.geo_boundary_id = b.geo_boundary_id
join trackpoint_index_mem d on d.trackpoint_index_id = c.trackpoint_index_id
join inquiry_mem e on e.inquiry_id = b.inquiry_id
left outer join inquiry_device_mem f on f.inquiry_id = e.inquiry_id and f.device_id = 3201
where d.trackpoint_index_id = 3127
and b.geo_boundary_type_id = 2
and e.expiration_date >= now()
group by
b.geo_boundary_id
When I run an explain for the query I get the following:
id select_type table type possible_keys key key_len ref rows Extra
------ -------------- ---------- ------- --------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------- ---------- ------------------------ ------- -------------------------------
1 PRIMARY d const PRIMARY PRIMARY 4 const 1 Using temporary; Using filesort
1 PRIMARY c ref PRIMARY,fk_mtp_idx_geo_boundary_mtp_idx,fk_mtp_idx_geo_boundary_geo_boundary,fk_mtp_idx_geo_boundary_mtp_mem_idx,fk_mtp_idx_geo_boundary_geo_boundary_mem fk_mtp_idx_geo_boundary_mtp_idx 4 const 9
1 PRIMARY b eq_ref PRIMARY,fk_geo_boundary_inquiry,fk_geo_boundary_geo_boundary_type PRIMARY 4 gothim.c.geo_boundary_id 1 Using where
1 PRIMARY e eq_ref PRIMARY PRIMARY 4 gothim.b.inquiry_id 1 Using where
1 PRIMARY f ref fk_inquiry_device_mem_inquiry fk_inquiry_device_mem_inquiry 4 gothim.e.inquiry_id 2
1 PRIMARY a ref fk_geo_boundary_vertex_geo_boundary fk_geo_boundary_vertex_geo_boundary 4 gothim.b.geo_boundary_id 11 Using where
2 UNION d const PRIMARY PRIMARY 4 const 1 Using temporary; Using filesort
2 UNION c ref PRIMARY,fk_mtp_idx_geo_boundary_mtp_idx,fk_mtp_idx_geo_boundary_geo_boundary,fk_mtp_idx_geo_boundary_mtp_mem_idx,fk_mtp_idx_geo_boundary_geo_boundary_mem fk_mtp_idx_geo_boundary_mtp_idx 4 const 9
2 UNION b eq_ref PRIMARY,fk_geo_boundary_inquiry,fk_geo_boundary_geo_boundary_type PRIMARY 4 gothim.c.geo_boundary_id 1 Using where
2 UNION e eq_ref PRIMARY PRIMARY 4 gothim.b.inquiry_id 1 Using where
2 UNION f ref fk_inquiry_device_mem_inquiry fk_inquiry_device_mem_inquiry 4 gothim.e.inquiry_id 2
(null) UNION RESULT <union1,2> ALL (null) (null) (null) (null) (null) Using filesort
12 record(s) selected [Fetch MetaData: 1ms] [Fetch Data: 5ms]
Now, I can split the queries up and use the ORDER BY NULL trick to get rid of the filesort however when I attempt to add that to the end of a UNION it doesn't work.
I am considering splitting the query apart into 2 queries or possibly re-writing it completely not to use a UNION (though that is a bit more difficult of course). The other thing I have working against me is that we have this in production and I'd like to limit changes - I would have loved just to be able to add ORDER BY NULL to the end of the query and be done with it, but it doesn't work w/ the UNION.
Any help would be greatly appreciated.

Normally, ORDER BY can be used for the individual queries within a UNION like this:
(
SELECT *
FROM table1, …
GROUP BY
id
ORDER BY
NULL
)
UNION ALL
(
SELECT *
FROM table2, …
GROUP BY
id
ORDER BY
NULL
)
However, as the docs state:
However, use of ORDER BY for individual SELECT statements implies nothing about the order in which the rows appear in the final result because UNION by default produces an unordered set of rows. Therefore, the use of ORDER BY in this context is typically in conjunction with LIMIT, so that it is used to determine the subset of the selected rows to retrieve for the SELECT, even though it does not necessarily affect the order of those rows in the final UNION result. If ORDER BY appears without LIMIT in a SELECT, it is optimized away because it will have no effect anyway.
This is of course a smart move, however, not too smart, since they forgot to optimize away the ordering behavior of GROUP BY as well.
So as for now, you should add a very high LIMIT to your individual queries:
(
SELECT *
FROM table1, …
GROUP BY
id
ORDER BY
NULL
LIMIT 100000000
)
UNION ALL
(
SELECT *
FROM table2, …
GROUP BY
id
ORDER BY
NULL
LIMIT 100000000
)
I'll post it as a bug to MySQL, hope they'll fix it in the next release, but meanwhile you could use this solution.
Note that a similar solution (using TOP 100%) was used to force ordering of the subqueries in SQL Server 2000, however, it stopped working in 2005 (ORDER BY has no effect in subqueries with TOP 100% for the optimizer).
It is safe to use it though since it won't break your queries even if the optimizer behavior changes in the next releases, but will just make them as slow as they are now.

Maybe try something like
SELECT *
FROM
(
[your entire query here]
) DerivedTable
ORDER BY NULL
I've never used MySQL so forgive me if I'm missing the plot :)
EDIT: What if you run each individual query separately (which, as you say, works), but insert the data into a temporary table. Then, at the end just do a select from the temp table.

Have you tried changing the UNION to UNION ALL?
A UNION tries to remove duplicate rows. In order to do that, it would have to sort the intermediate results what might explain what you are seeing in your execution plan.
From MySQL Union
By default the MySQL UNION removes all
duplicate rows from the result set
even if you don’t explicit using
DISTINCT after the keyword UNION.
If you use UNION ALL explicitly, the
duplicate rows remain in the result
set. You only use this in the cases
that you want to keep duplicate rows
or you are sure that there is no
duplicate rows in the result set.
Edit
I doubt it will make any difference (might even be worse) but could you try following "equivalent" query
select *
from (
select b.geo_boundary_id
, GeoBoundaryContains( c.tpi_geo_boundary_coverage_type_id, 29.287437, -95.055807, b.centroid_lat, b.centoid_lon, b.radius ) in_out
from geo_boundary b
join trackpoint_index_geo_boundary_mem c on c.geo_boundary_id = b.geo_boundary_id
where b.geo_boundary_type_id = 2
group by
b.geo_boundary_id
union all
select a.geo_boundary_id
, GeoBoundaryContains(c.tpi_geo_boundary_coverage_type_id, 29.287437, -95.055807, a.lat, a.lon, a.geo_boundary_vertex_id ) in_out
from geo_boundary_vertex a
join geo_boundary b on b.geo_boundary_id = a.geo_boundary_id
join trackpoint_index_geo_boundary_mem c on c.geo_boundary_id = b.geo_boundary_id
where b.geo_boundary_type_id = 3
group by
a.geo_boundary_id
) s
inner join (
select e.inquiry_match_type_id
, e.inquiry_id
, e.external_id
, COALESCE(f.inquiry_device_id,0) inquiry_device_id
, b.external_info1
, b.external_info2
, b.geo_boundary_id
, b.geo_boundary_type_id
from geo_boundary b
join trackpoint_index_geo_boundary_mem c on c.geo_boundary_id = b.geo_boundary_id
join trackpoint_index_mem d on d.trackpoint_index_id = c.trackpoint_index_id
join inquiry_mem e on e.inquiry_id = b.inquiry_id left
outer join inquiry_device_mem f on f.inquiry_id = e.inquiry_id and f.device_id = 3201
where d.trackpoint_index_id = 3127
and b.geo_boundary_type_id IN (2, 3)
and e.expiration_date >= now()
) r on r.geo_boundary_id = s.geo_boundary_id

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

JOIN vs UNION vs IN() - big tables and many WHERE conditions - mysql

Related

Optimize SQL query with many inner joins on same table

How can I optimise mySQL to use JOINs instead of nested IN queries?

Is there better way to do this query?

Why does this WHERE clause make my query 180 times slower?

Using Order By NULL in a UNION

Categories

Resources