Extreme query optimization with IN clause and subquery - mysql

My table has more than 15 millions of rows just now.
I need to run such query:
SELECT ch1.* FROM citizens_dynamic ch1
WHERE ch1.id IN (4369943, ..., 4383420, 4383700)
AND ch1.update_id_to = (
SELECT MAX(ch2.update_id_to)
FROM citizens_dynamic ch2
WHERE ch1.id = ch2.id AND ch2.update_id_to < 812
)
Basically, for every citizen in IN clause it searches for a row with closest but lower than specified update_id_to.
There is PRIMARY key on 2 columns columns update_id_to, id.
At the moment, this query is executed in 0.9s (having 100 ids in IN clause).
It's still too slow, I would need to run my scripts for 3 days to complete.
Below you can see my EXPLAIN output.
id index is just like PRIMARY key, but with reversed columns: id, update_id_to
Do you have any ideas how to make it even faster?

I've found that MySQL tends to perform better with JOIN than correlated subqueries.
SELECT ch1.*
FROM citizens_dynamic AS ch1
JOIN (SELECT id, MAX(update_id_to) AS update_id_to
FROM citizens_dynamic
WHERE id IN (4369943, ..., 4383420, 4383700)
GROUP BY id) AS ch2
ON ch1.id = ch2.id
WHERE ch1.id IN (4369943, ..., 4383420, 4383700)
Also, see the other methods in this question:
Retrieving the last record in each group

Related

Why two mysql selects executed separately are much faster than one combined?

I want to understand the case when I run two queries separately i takes around 400ms in total, but when I combined them using sub-select it takes around 12 seconds.
I have two InnoDB tables:
event: 99 914 rows
even_prizes: 24 540 770 rows
Below are my queries:
SELECT
id
FROM
event e
WHERE
e.status != 'SCHEDULED';
-- takes 130ms, returns 2406 rows
SELECT
id, count(*)
FROM
event_prizes
WHERE event_id in (
-- 2406 ids returned from the previous query
)
GROUP BY
id;
-- takes 270ms, returns the same amount of rows
From the other side when I run the query from below:
SELECT
id, count(*)
FROM
event_prizes
WHERE event_id in (
SELECT
id
FROM
event e
WHERE
e.status != 'SCHEDULED'
)
GROUP BY
id;
-- takes 12seconds
I guess in the second case MySQL makes the full-scan of the event_prizes table ?
Is there any better way to create a single query for this case ?
You can use a INNER JOIN instead of a sub-select:
SELECT ep.id, COUNT(*)
FROM event_prizes ep INNER JOIN event e ON ep.event_id = e.id
WHERE e.status <> 'SCHEDULED'
GROUP BY ep.id
Make sure you are using
a PRIMARY KEY on event.id
a PRIMARY KEY on event_prizes.id
a FOREIGN KEY on event_prizes.event_id
You can also try the following indices at least:
event(status)

Query is taking too long even with 1k results

I have made several tests to optimize the query below but none of them helped.
What I tried is;
Add extra indexes
Change query logic by checking other attributes aswell in IN clause
Tested suggestions of online query optimization tools (eversql etc)
Indexes I am using;
radacct (`_accttime`);
radacct (`username`);
radacct (`acctstoptime`,`_accttime`);
Complete Query;
(SELECT *
FROM `radacct`
WHERE (radacct._accttime > NOW() - INTERVAL 1.2 HOUR)
AND radacct.acctstoptime IN
(SELECT MAX(radacct.acctstoptime)
FROM `radacct`
GROUP BY radacct.username) )
UNION
(SELECT *
FROM `radacct`
WHERE (radacct._accttime >= DATE_SUB(NOW(), INTERVAL 2 MONTH)
AND radacct.acctstoptime IS NULL) )
When I execute SELECT statements above by themselves, they only take about few miliseconds.
I have issue with IN clause. So this is the query that takes ages
As I see it, your problem is the dependent subquery in your IN. Apparently the optimizer doesn't get that the subquery technically doesn't change much. (also, the query might be suboptimal). Essentially, the subquery is executed for each row (which is bad).
Now, we have to find out, which part triggers it to be a dependent, because it isn't really. My first try would be to give it a different alias:
IN (SELECT MAX(inner.acctstoptime) FROM radacct AS `inner` GROUP BY inner.username)
If that isn't enough to make it independent, make it a full-blown join (INNER, such that non-joined rows [= non-max rows] are discarded from the result):
INNER JOIN (
SELECT MAX(inner.accstoptime) as maxstoptime, inner.username
FROM `radacct` AS `inner`
GROUP BY inner.username
) sub ON (sub.maxstoptime=radacct.acctstoptime)
Hope that does the trick.
since your result has rows of users with their max acctstoptimes, it might - on rare occasions - contain more than one row for a user, when there is a row with a acctstoptime, which isn't the max for THAT user but it matches the max of another user. In the join part, you can just add another condition in the ON-clause. In the IN subquery, you would drop the explicit group by and add WHERE radacct.username=inner.username. (which would indeed make it an explicit dependent subquery, but the optimizer might be able to handle it)
update: due to miscommunication ...
The resulting complete query with the join:
(SELECT DISTINCT radacct.*
FROM radacct
INNER JOIN (
SELECT MAX(inner.accstoptime) as maxstoptime, inner.username
FROM `radacct` AS `inner`
GROUP BY inner.username
) sub ON (sub.maxstoptime=radacct.acctstoptime)
WHERE (_accttime > NOW() - INTERVAL 1.2 HOUR)
)
UNION
(SELECT *
FROM `radacct`
WHERE (_accttime >= DATE_SUB(NOW(),INTERVAL 2 MONTH)
AND acctstoptime IS NULL)
)
you may still add the username comparison in the ON clause.
What this query does is, it removes the "IN" selector and force a intermediate result for the join (for each username the max acctstoptime). the join will then join the normal rows to an intermediate result row, if and only if the acctstoptime is the max for some user (or THAT user, if you add the username comparison). If it doesn't have the max acctstoptime and thus no join "partner", it will be discarded from the result (caused by the INNER, the LEFT JOIN was somewhat insufficient), thus leaving only the rows with a max acctstoptime (in the first part of the union).

Optimize Query with MAX/MIN in HAVING clause

I have a query that is taking 17-20 seconds on our server, and I'd like to see what I can do to optimize it. MySQL 5.6, will be upgrading to 5.7 next couple of months.
Query:
SELECT pm.mid AS mid
FROM
pm_message pm
INNER JOIN pm_index pmi ON pmi.mid = pm.mid
GROUP BY pm.mid
HAVING (MIN(pmi.deleted) > 0 AND MAX(pmi.deleted) < '1535490002')
LIMIT 1000 OFFSET 0;
The mid column in both pm_message and pm_index is a primary key in both tables.
The table have millions of records each
select count(*) from pm_message;
3748290
select count(*) from pm_index;
6938947
Any suggestions for improving this query?
I'm wondering if making the 'deleted' column in the pm_index table an index would help?
I would completely rewrite the query because you basically want a list of distinct mids where deleted falls within a certain range. You do not need to display any data from the pm_index table, so I would use correlated subquery with not exists operator. This way mysql does not have to group and order the entire pm_index table to get the mins and the maxes.
SELECT pm.mid AS mid
FROM
pm_message pm
WHERE NOT EXISTS (SELECT 1 FROM pm_index WHERE pm_index.mid=pm.mid and (pm_index.deleted<0 OR pm_index.deleted>1535490002))
The query would benefit from a multi-column index on mid and deleted fields of pm_index table.
This elaborates on Shadows answer. Try using two not exists clauses:
SELECT pm.mid
FROM pm_message pm
WHERE NOT EXISTS (SELECT 1
FROM pm_index pmi
WHERE pmi.mid = pm.mid AND
pmi.deleted < 0
) AND
NOT EXISTS (SELECT 1
FROM pm_index pmi
WHERE pmi.mid = pm.mid AND
pmi.deleted > 1535490002
) ;
And be sure you have an index on pm_index(mid, deleted). The index is very important. I'm breaking it into two clauses because OR can confuse the query optimizer.
Give this a try. It turn things inside out -- starting with pmi, then minimizing touching of pm.
SELECT mid, MIN(deleted) AS mind, MAX(deleted) AS maxd
FROM pm_index AS pmi
GROUP BY mid
HAVING mind > 0
AND maxd < '1535490002'
AND EXISTS (
SELECT 1
FROM pm
WHERE mid = pmi.mid
)
LIMIT 1000 OFFSET 0;
I doubt if this will help much -- the query seems to need to touch virtually all rows in both tables.
If all mid values in pmi definitely exist in pm, then my EXISTS clause can be removed. However you say that both tables have PRIMARY KEY(mid)? I suspect pmi actually has a second column in its PK. Please provide SHOW CREATE TABLE.

How to make JOINS faster?

I had this query to start out with:
SELECT DISTINCT spentits.*
FROM `spentits`
WHERE (spentits.user_id IN
(SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44')
ORDER BY id DESC
LIMIT 15 OFFSET 0
This query takes 10ms to execute.
But once I add a simple join in:
SELECT DISTINCT spentits.*
FROM `spentits`
LEFT JOIN wishlist_items ON wishlist_items.user_id = 44 AND wishlist_items.spentit_id = spentits.id
WHERE (spentits.user_id IN
(SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44')
ORDER BY id DESC
LIMIT 15 OFFSET 0
This execute time increased by 11x. Now it takes around 120ms to execute. What's interesting is that if I remove either the LEFT JOIN clause or the ORDER BY id DESC , the time goes back to 10ms.
I am new to databases so I don't understand this. Why is it that removing either one of these clauses speeds it up 11x ? And how can I keep it as is but make it faster?
I have indexes on spentits.user_id, follows.follower_id, follows.accepted, and on primary ids of each table.
EXPLAIN:
1 PRIMARY spentits index index_spentits_on_user_id PRIMARY 4 NULL 15 Using where; Using temporary
1 PRIMARY wishlist_items ref index_wishlist_items_on_user_id,index_wishlist_items_on_spentit_id index_wishlist_items_on_spentit_id 5 spentit.spentits.id 1 Using where; Distinct
2 SUBQUERY follows index_merge index_follows_on_follower_id,index_follows_on_following_id,index_follows_on_accepted
index_follows_on_follower_id,index_follows_on_accepted 5,2 NULL 566 Using intersect(index_follows_on_follower_id,index_follows_on_accepted); Using where
You should have index also on:
wishlist_items.spentit_id
Because you are joining over that column
The LEFT JOIN is easy to explain: A cross product of all entries against all other entries is made. The conditions of the join (in your case: Take all entries on the left and find fitting ones on the right) are applied afterwards. So if your spentits table is large it will take the server some time. Would suggest you get rid of your subquery and make three joins. Start with the smallest table to avoid big amounts of data.
In the 2nd example the subselect runs for every spentits.user_id.
If you write is like this it will be faster because the subselect runs once:
SELECT DISTINCT spentits.*
FROM `spentits`, (SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44') as `follow`
LEFT JOIN wishlist_items ON wishlist_items.user_id = 44 AND wishlist_items.spentit_id = spentits.id
WHERE (spentits.user_id IN
(follow)
ORDER BY id DESC
LIMIT 15 OFFSET 0
As you can see the subselect moved to the FROM-part of the query and creates a imaginary tabel (or view).
This imaginary tabel is a inline-view.
JOINs and inline-views are faster every time than a subselect in the WHERE-part.

Why doesn't this query run?

I have this query that isn't finishing (I think the server runs out of memory)
SELECT fOpen.*, fClose.*
FROM (
SELECT of.*
FROM fixtures of
JOIN (
SELECT MIN(id) id
FROM fixtures
GROUP BY matchId, period, type
) ofi ON ofi.id = of.id
) fOpen
JOIN (
SELECT cf.*
FROM fixtures cf
JOIN (
SELECT MAX(id) id
FROM fixtures
GROUP BY matchId, period, type
) cfi ON cfi.id = cf.id
) fClose ON fClose.matchId = fOpen.matchId AND fClose.period = fOpen.period AND fClose.type = fOpen.type
This is the EXPLAIN of it:
Those 2 subqueries 'of' and 'cf' take about 1.5s to run, if I run them separately.
'id' is a PRIMARY INDEX and there is a BTREE INDEX named 'matchPeriodType' that has those 3 columns in that order.
More info: MySQL 5.5, 512MB of server memory, and the table has about 400k records.
I tried to rewrite your query, so that it is easier to read and should be able to use your indexes. Hope I got it right, could not test without your data.
SELECT fOpen.*, fClose.*
FROM (
SELECT MIN(id) AS min_id, MAX(id) AS max_id
FROM fixtures
GROUP BY matchId, period, type
) ids
JOIN fixtures fOpen ON ( fOpen.id = ids.min_id )
JOIN fixtures fClose ON ( fClose.id = ids.max_id );
This one gets MIN(id) and MAX(id) per matchId, period, type (should use your index) and joins the corresponding rows afterwards.
Appending id to your existing index matchPeriodType could also help, since the sub-query could then be executed with this index only.
Not sure how unique the matchid / period / type is. If unique you are joining 400k records against 400k records, possibly with the indexes being lost.
However it appears that the 2 main subselects might be unnecessary. You could just join fixtures against itself and that against the subselects to get the min and max.