MySQL: Optimize Join - mysql

I currently try to optimize a MySQL statement. It takes about 10 sec and outputs an average difference of two integer. The event table contains 6 cols and is indexed by it's id and also by run_id + every other key.
The Table holds 3308000 rows for run_id 37, 4162050 in total.
Most time seems to be needed for the join, so maybe there is a way to speed it up.
send.element_id and recv.element_id are unique, is there a way to express it in sql which might lead in a better performance?
|-------------------
|Spalte Typ
|-------------------
|run_id int(11)
|element_id int(11)
|event_id int(11) PRIMARY
|event_time int(11)
|event_type varchar(20)
|event_data varchar(20)
The Query:
select avg(recv.event_time-send.event_time)
from
(
select element_id, event_time
from event
where run_id = 37 and event_type='SEND_FLIT'
) send,
(
select element_id, event_time
from event
where run_id = 37 and event_type='RECV_FLIT'
) recv
where recv.element_id = send.element_id
The Explain of the Query:
+----+-------------+------------+------+-----------------------------------------------------+-------------+---------+-------------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+-------------+------------+------+-----------------------------------------------------+-------------+---------+-------------+--------+-----------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 499458 | NULL |
| 1 | PRIMARY | <derived2> | ref | <auto_key0> | <auto_key0> | 4 | element_id | 10 | NULL |
| 3 | DERIVED | event | ref | run_id,run_id_2,run_id_3,run_id_4,run_id_5,run_id_6 | run_id_5 | 26 | const,const | 499458 | Using index condition |
| 2 | DERIVED | event | ref | run_id,run_id_2,run_id_3,run_id_4,run_id_5,run_id_6 | run_id_5 | 26 | const,const | 562556 | Using index condition |
+----+-------------+------------+------+-----------------------------------------------------+-------------+---------+-------------+--------+-----------------------+

One way is to group by element_id and to use sum to determine the difference, which you can then pass to avg.
select avg(diff) from (
select
sum(case when event_type = 'SEND_FLIT' then -1 * event_time else event_time end)
as diff
from event
where run_id = 37
and event_type in ('SEND_FLIT','RECV_FLIT')
group by element_id
) t

Related

How to perform a sum for all previous records

I've been trying to implement the solution here with the added flavour of updating existing records. As an MRE I'm looking to populate the sum_date_diff column in a table with the sum of all the differences between the current row date and the date of every previous row where the current row p1_id matches the previous row p1_id or p2_id. I have already filled out the expected result below:
+-----+------------+-------+-------+---------------+
| id_ | date_time | p1_id | p2_id | sum_date_diff |
+-----+------------+-------+-------+---------------+
| 1 | 2000-01-01 | 1 | 2 | Null |
| 2 | 2000-01-02 | 2 | 4 | 1 |
| 3 | 2000-01-04 | 1 | 3 | 3 |
| 4 | 2000-01-07 | 2 | 5 | 11 |
| 5 | 2000-01-15 | 2 | 3 | 35 |
| 6 | 2000-01-20 | 1 | 3 | 35 |
| 7 | 2000-01-31 | 1 | 3 | 68 |
+-----+------------+-------+-------+---------------+
My query so far looks like:
UPDATE test.sum_date_diff AS sdd0
JOIN
(SELECT
id_,
SUM(DATEDIFF(sdd1.date_time, sq.date_time)) AS sum_date_diff
FROM
test.sum_date_diff AS sdd1
LEFT OUTER JOIN (SELECT
sdd2.date_time AS date_time, sdd2.p1_id AS player_id
FROM
test.sum_date_diff AS sdd2 UNION ALL SELECT
sdd3.date_time AS date_time, sdd3.p2_id AS player_id
FROM
test.sum_date_diff AS sdd3) AS sq ON sq.date_time < sdd1.date_time
AND sq.player_id = sdd1.p1_id
GROUP BY sdd1.id_) AS master_sq ON master_sq.id_ = sdd0.id_
SET
sdd0.sum_date_diff = master_sq.sum_date_diff
This works as shown here.
However, on a table of 1.5m records the query has been hanging for the last hour. Even when I add a WHERE clause onto the bottom to restrict the update to a single record then it hangs for 5 mins+.
Here is the EXPLAIN statement for the query on the full table:
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
| 1 | UPDATE | sum_date_diff | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | const | 10 | 100 | NULL |
| 2 | DERIVED | sum_date_diff | NULL | index | PRIMARY,ix__match_oc_history__date_time,ix__match_oc_history__p1_id,ix__match_oc_history__p2_id,ix__match_oc_history__date_time_players | ix__match_oc_history__date_time_players | 14 | NULL | 1484288 | 100 | Using index; Using temporary |
| 2 | DERIVED | <derived3> | NULL | ALL | NULL | NULL | NULL | NULL | 2968576 | 100 | Using where; Using join buffer (hash join) |
| 3 | DERIVED | sum_date_diff | NULL | index | NULL | ix__match_oc_history__date_time_players | 14 | NULL | 1484288 | 100 | Using index |
| 4 | UNION | sum_date_diff | NULL | index | NULL | ix__match_oc_history__date_time_players | 14 | NULL | 1484288 | 100 | Using index |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
Here is the CREATE TABLE statement:
CREATE TABLE `sum_date_diff` (
`id_` int NOT NULL AUTO_INCREMENT,
`date_time` datetime DEFAULT NULL,
`p1_id` int NOT NULL,
`p2_id` int NOT NULL,
`sum_date_diff` int DEFAULT NULL,
PRIMARY KEY (`id_`),
KEY `ix__sum_date_diff__date_time` (`date_time`),
KEY `ix__sum_date_diff__p1_id` (`p1_id`),
KEY `ix__sum_date_diff__p2_id` (`p2_id`),
KEY `ix__sum_date_diff__date_time_players` (`date_time`,`p1_id`,`p2_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1822120 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
MySQL version is 8.0.26 running on a 2016 MacBook Pro with Monterey with 16Gb RAM.
After reading around about boosting the RAM available to MySQL I've added the following to the standard my.cnf file:
innodb_buffer_pool_size = 8G
tmp_table_size=2G
max_heap_table_size=2G
I'm wondering if:
I've done something wrong
This is just a very slow task no matter what I do
There is a faster method
I'm hoping someone could enlighten me!
Whereas it is possible to do calculations like this in SQL, it is messy. If the number of rows is not in the millions, I would fetch the necessary columns into my application and do the arithmetic there. (Loops are easier and faster in PHP/Java/etc than in SQL.)
LEAD() and LAG() are possible, but they are not optimized well (or so is my experience). In an APP language, it is easy and efficient to look up things in arrays.
The SELECT can (easily and efficiently) do any filtering and sorting so that the app only receives the necessary data.

TPCH Query Optimization

The following query is taking 5 hours so far to run:
INSERT $LINEITEM_PUBLIC SELECT *
FROM LINEITEM
WHERE L_PARTKEY IN ( SELECT P_PARTKEY FROM $PART_PUBLIC )
AND L_SUPPKEY IN ( SELECT S_SUPPKEY FROM $SUPPLIER_PUBLIC )
AND L_ORDERKEY IN ( SELECT O_ORDERKEY FROM $ORDERS_PUBLIC );
I added all required indexes but nothing seems to be helping. The Query Explain Plan prints the following:
+----+-------------+------------------+------------+--------+--------------------------------+-------------+---------+--------------------------------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+--------+--------------------------------+-------------+---------+--------------------------------+----------+----------+-------------+
| 1 | INSERT | $LINEITEM_PUBLIC | NULL | ALL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 1 | SIMPLE | $ORDERS_PUBLIC | NULL | index | PRIMARY | O_ORDERDATE | 3 | NULL | 12826617 | 100.00 | Using index |
| 1 | SIMPLE | LINEITEM | NULL | ref | PRIMARY,LINEITEM_FK2,L_SUPPKEY | PRIMARY | 4 | TPCH.$ORDERS_PUBLIC.O_ORDERKEY | 3 | 100.00 | NULL |
| 1 | SIMPLE | $SUPPLIER_PUBLIC | NULL | eq_ref | PRIMARY | PRIMARY | 4 | TPCH.LINEITEM.L_SUPPKEY | 1 | 100.00 | Using index |
| 1 | SIMPLE | $PART_PUBLIC | NULL | eq_ref | PRIMARY | PRIMARY | 4 | TPCH.LINEITEM.L_PARTKEY | 1 | 100.00 | Using index |
+----+-------------+------------------+------------+--------+--------------------------------+-------------+---------+--------------------------------+----------+----------+-------------+
Any recommendations on how this query can be optimized?
Update:
The size of the tables in the previous query is as follows:
LINEITEM: 60M records
$ORDERS_PUBLIC: 13M records
$SUPPLIER_PUBLIC: 92K records
$PART_PUBLIC: 2M records
Make sure there is an index starting with O_ORDERKEY.
IN (SELECT ...) may be optimized poorly (depending on version); try this:
INSERT $LINEITEM_PUBLIC
SELECT l.*
FROM LINEITEM AS l
WHERE EXISTS( SELECT * FROM $PART_PUBLIC WHERE P_PARTKEY = L_PARTKEY )
AND EXISTS( SELECT * FROM $SUPPLIER_PUBLIC WHERE S_SUPPKEY = L_SUPPKEY )
AND EXISTS( SELECT * FROM $ORDERS_PUBLIC WHERE O_ORDERKEY = L_ORDERKEY );

MySQL confused about IN (CONST vs UNION vs SELECT FROM (UNION))

Can someone please explain why there is big difference between those queries ?
Results of all of them is exactly same.
Performance of query 1: very good, query 2: bad, query 3: good.
Why in query 2 select from table test (id 1) contain all rows ? And why possible_keys not contain PRIMARY which is actually used ?
Table:
CREATE TABLE `test` (
`id` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `test` ADD PRIMARY KEY (`id`);
Data:
DROP PROCEDURE IF EXISTS insert1000;
DELIMITER $$
CREATE PROCEDURE insert1000()
BEGIN
SET #i = 1;
WHILE #i < 1000 DO
INSERT INTO `test` VALUES (#i);
SET #i = #i + 1;
END WHILE;
END
$$
DELIMITER ;
CALL insert1000();
DROP PROCEDURE insert1000;
Query 1:
SELECT `id` FROM `test` WHERE `id` IN (2, 3)
Query 1 explanation:
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | PRIMARY | PRIMARY | 4 | NULL | 2 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
Query 2:
SELECT `id` FROM `test` WHERE `id` IN (SELECT 2 UNION SELECT 3)
Query 2 explanation:
+------+--------------------+------------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+------------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | PRIMARY | test | index | NULL | PRIMARY | 4 | NULL | 999 | Using where; Using index |
+------+--------------------+------------+-------+---------------+---------+---------+------+------+--------------------------+
| 2 | DEPENDENT SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+------+--------------------+------------+-------+---------------+---------+---------+------+------+--------------------------+
| 3 | DEPENDENT UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+------+--------------------+------------+-------+---------------+---------+---------+------+------+--------------------------+
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+------+--------------------+------------+-------+---------------+---------+---------+------+------+--------------------------+
Query 3:
SELECT `id` FROM `test` WHERE `id` IN (SELECT * FROM (SELECT 2 UNION SELECT 3) AS `derived`)
Query 3 explanation:
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 2 | |
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 1 | PRIMARY | test | eq_ref | PRIMARY | PRIMARY | 4 | derived.2 | 1 | Using where; Using index |
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 2 | MATERIALIZED | <derived3> | ALL | NULL | NULL | NULL | NULL | 2 | |
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 4 | UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| NULL | UNION RESULT | <union3,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+------+--------------+-------------+--------+---------------+---------+---------+-----------+------+--------------------------+
The Inner workings of the MySQL optimizer...
While query 2 and query 3 both require a full table scan (can't use the index), their different syntax makes the optimizer use different strategies.
You can see it more clearly(ish) by running EXPLAIN EXTENDED SELECT ... and then running SHOW WARNINGS;.
Here's the extended plan for query 2:
select `test`.`id` AS `id`
from `test`
where <in_optimizer>(`test`.`id`,<exists>(select 2 having (<cache>(`test`.`id`) = <ref_null_helper>(2))
union
select 3 having (<cache>(`test`.`id`) = <ref_null_helper>(3))
))
The optimizer translates IN to EXISTS and then compares the results of 2 queries SELECT 2 and SELECT 3 to the row that is scanned in test.
Here's the extended plan for query 3:
select `test`.`id` AS `id`
from `test`
where <in_optimizer>(`test`.`id`,<exists>(select 1 from (select 2 AS `2` union select 3 AS `3`) `derived` where (<cache>(`test`.`id`) = `derived`.`2`)))
You can see that in this case the optimizer is running your original UNION to create a derived table with the values 2 and 3, and then compares this table once to the data it scans in table test.

Debugging Slow mySQL query with Explain

Have found an inefficient query in our system. content holds versions of slides, and this is supposed to select the highest version of a slide by id.
SELECT `content`.*
FROM (`content`)
JOIN (
SELECT max(version) as `version` from `content`
WHERE `slide_id` = '16901'
group by `slide_id`
) c ON `c`.`version` = `content`.`version`;
EXPLAIN
+----+-------------+------------------+------------+--------+--------------------------------------------------------------------------------+------------------------------------+---------+-------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+--------+--------------------------------------------------------------------------------+------------------------------------+---------+-------+------+----------+--------------------------+
| 1 | PRIMARY | <derived2> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 1 | PRIMARY | content | NULL | ref | PRIMARY,version | PRIMARY | 8 | const | 9703 | 100.00 | NULL |
| 2 | DERIVED | content | NULL | ref | PRIMARY,fk_content_slides_idx,thumbnail_asset_id,version,slide_id | fk_content_slides_idx | 8 | const | 1 | 100.00 | Using where; Using index |
+----+-------------+------------------+------------+--------+--------------------------------------------------------------------------------+------------------------------------+---------+-------+------+----------+--------------------------+
One big issue is that it returns almost all the slides in the system as the outer query does not filter by slide id. After adding that I get...
SELECT `content`.*
FROM (`content`)
JOIN (
SELECT max(version) as `version` from `content`
WHERE `slide_id` = '16901' group by `slide_id`
) c ON `c`.`version` = `content`.`version`
WHERE `slide_id` = '16901';
EXPLAIN
+----+-------------+------------------+------------+--------+--------------------------------------------------------------------------------+------------------------------------+---------+-------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+--------+--------------------------------------------------------------------------------+------------------------------------+---------+-------------+------+----------+--------------------------+
| 1 | PRIMARY | <derived2> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 1 | PRIMARY | content | NULL | const | PRIMARY,fk_content_slides_idx,version,slide_id | PRIMARY | 16 | const,const | 1 | 100.00 | NULL |
| 2 | DERIVED | content | NULL | ref | PRIMARY,fk_content_slides_idx,thumbnail_asset_id,version,slide_id | fk_content_slides_idx | 8 | const | 1 | 100.00 | Using where; Using index |
+----+-------------+------------------+------------+--------+--------------------------------------------------------------------------------+------------------------------------+---------+-------------+------+----------+--------------------------+
That reduces the amount of rows down to one correctly, but doesnt really speed things up.
There are indexes on version, slide_id and a unique key on version AND slide_id.
Is there anything else I can do to speed this up?
Use a TOP LIMIT 1 insetead of Max ?
m
MySQL seems to take an index (version, slide_id) to join the tables. You should get a better result with
SELECT `content`.*
FROM `content`
FORCE INDEX FOR JOIN (fk_content_slides_idx)
join (
SELECT `slide_id`, max(version) as `version` from `content`
WHERE `slide_id` = '16901' group by `slide_id`
) c ON `c`.`slide_id` = `content`.`slide_id` and `c`.`version` = `content`.`version`
You need an index that has slide_id as first column, I just guessed that's fk_content_slides_idx, if not, take another one.
The part FORCE INDEX FOR JOIN (fk_content_slides_idx) is just to enforce it, you should try if mysql takes it by itself without forcing (it should).
You might get even a slightly better result with an index (slide_id, version), it depends on the amount of data (e.g. the number of versions per id) if you see a difference (but you should not spam indexes, and you already have a lot on this table, but you can try it for fun.)
Just a suggestion i think you should avoid the group by slide_id because you are filter by one slide_id only (16901)
SELECT `content`.*
FROM (`content`)
JOIN (
SELECT max(version) as `version` from `content`
WHERE `slide_id` = '16901'
) c ON `c`.`version` = `content`.`version`
WHERE `slide_id` = '16901';

select takes 20 seconds, delete still running # 30minutes

This select query takes about 20 seconds to complete.
select Count(*)
from products as bad_rows
inner join (
select pid, MAX(last_updated_date) as maxdate
from products
group by pid
having count(*) > 1
) as good_rows on good_rows.pid= bad_rows.pid
and good_rows.maxdate <> bad_rows.last_updated_date
where bad_rows.available = 0
The delete on the other hand is still running after 30 minutes !
delete bad_rows
from products as bad_rows
inner join (
select pid, MAX(last_updated_date) as maxdate
from products
group by pid
having count(*) > 1
) as good_rows on good_rows.pid= bad_rows.pid
and good_rows.maxdate <> bad_rows.last_updated_date
where bad_rows.available = 0
Why ?
Table Schema is as follows:
Explain for the select is as follows:
+----+-------------+------------+------+---------------+------+---------+------+-------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+-------+--------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 6253 | |
| 1 | PRIMARY | bad_rows | ALL | NULL | NULL | NULL | NULL | 34603 | Using where; Using join buffer |
| 2 | DERIVED | products | ALL | NULL | NULL | NULL | NULL | 34603 | Using temporary; Using filesort|
+----+-------------+------------+------+---------------+------+---------+------+-------+--------------------------------
ok so I just googled the results explain which hinted that my query could be slow because of not having indexes on pid. It didn't actually say that, but I just had a hunch from reading about the results of Explain.
SO I added a index on pid and voila. Delete over in 1 minute!!