Why two mysql selects executed separately are much faster than one combined? - mysql

I want to understand the case when I run two queries separately i takes around 400ms in total, but when I combined them using sub-select it takes around 12 seconds.
I have two InnoDB tables:
event: 99 914 rows
even_prizes: 24 540 770 rows
Below are my queries:
SELECT
id
FROM
event e
WHERE
e.status != 'SCHEDULED';
-- takes 130ms, returns 2406 rows
SELECT
id, count(*)
FROM
event_prizes
WHERE event_id in (
-- 2406 ids returned from the previous query
)
GROUP BY
id;
-- takes 270ms, returns the same amount of rows
From the other side when I run the query from below:
SELECT
id, count(*)
FROM
event_prizes
WHERE event_id in (
SELECT
id
FROM
event e
WHERE
e.status != 'SCHEDULED'
)
GROUP BY
id;
-- takes 12seconds
I guess in the second case MySQL makes the full-scan of the event_prizes table ?
Is there any better way to create a single query for this case ?

You can use a INNER JOIN instead of a sub-select:
SELECT ep.id, COUNT(*)
FROM event_prizes ep INNER JOIN event e ON ep.event_id = e.id
WHERE e.status <> 'SCHEDULED'
GROUP BY ep.id
Make sure you are using
a PRIMARY KEY on event.id
a PRIMARY KEY on event_prizes.id
a FOREIGN KEY on event_prizes.event_id
You can also try the following indices at least:
event(status)

Related

Select most recent record grouped by 3 columns

I am trying to return the price of the most recent record grouped by ItemNum and FeeSched, Customer can be eliminated. I am having trouble understanding how I can do that reasonably.
The issue is that I am joining about 5 tables containing hundreds of thousands of rows to end up with this result set. The initial query takes about a minute to run, and there has been some trouble with timeout errors in the past. Since this will run on a client's workstation, it may run even slower, and I have no access to modify server settings to increase memory / timeouts.
Here is my data:
Customer Price ItemNum FeeSched Date
5 70.75 01202 12 12-06-2017
5 70.80 01202 12 06-07-2016
5 70.80 01202 12 07-21-2017
5 70.80 01202 12 10-26-2016
5 82.63 02144 61 12-06-2017
5 84.46 02144 61 06-07-2016
5 84.46 02144 61 07-21-2017
5 84.46 02144 61 10-26-2016
I don't have access to create temporary tables, or views and there is no such thing as a #variable in C-tree, but in most ways it acts like MySql. I wanted to use something like GROUP BY ItemNum, FeeSched and select MAX(Date). The issue is that unless I put Price into the GROUP BY I get an error.
I could run the query again only selecting ItemNum, FeeSched, Date and then doing an INNER JOIN, but with the query taking a minute to run each time, it seems there is a better way that maybe I don't know.
Here is my query I am running, it isn't really that complicated of a query other than the amount of data it is processing. Final results are about 50,000 rows. I can't share much about the database structure as it is covered under an NDA.
SELECT DISTINCT
CustomerNum,
paid as Price,
ItemNum,
n.pdate as newest
from admin.fullproclog as f
INNER JOIN (
SELECT
id,
itemId,
MAX(TO_CHAR(pdate, 'MM-DD-YYYY')) as pdate
from admin.fullproclog
WHERE pdate > timestampadd(sql_tsi_year, -3, NOW())
group by id, itemId
) as n ON n.id = f.id AND n.itemId = f.itemId AND n.pdate = f.pdate
LEFT join (SELECT itemId AS linkid, ItemNum FROM admin.itemlist) AS codes ON codes.linkid = f.itemId AND ItemNum >0
INNER join (SELECT DISTINCT parent_id,
MAX(ins1.feesched) as CustomerNum
FROM admin.customers AS p
left join admin.feeschedule AS ins1
ON ins1.feescheduleid = p.primfeescheduleid
left join admin.group AS c1
ON c1.insid = ins1.feesched
WHERE status =1
GROUP BY parent_id)
AS ip ON ip.parent_id = f.parent_id
WHERE CustomerNum >0 AND ItemNum >0
UNION ALL
SELECT DISTINCT
CustomerNum,
secpaid as Price,
ItemNum,
n.pdate as newest
from admin.fullproclog as f
INNER JOIN (
SELECT
id,
itemId,
MAX(TO_CHAR(pdate, 'MM-DD-YYYY')) as pdate
from admin.fullproclog
WHERE pdate > timestampadd(sql_tsi_year, -3, NOW())
group by id, itemId
) as n ON n.id = f.id AND n.itemId = f.itemId AND n.pdate = f.pdate
LEFT join (SELECT itemId AS linkid, ItemNum FROM admin.itemlist) AS codes ON codes.linkid = f.itemId AND ItemNum >0
INNER join (SELECT DISTINCT parent_id,
MAX(ins1.feesched) as CustomerNum
FROM admin.customers AS p
left join admin.feeschedule AS ins1
ON ins1.feescheduleid = p.secfeescheduleid
left join admin.group AS c1
ON c1.insid = ins1.feesched
WHERE status =1
GROUP BY parent_id)
AS ip ON ip.parent_id = f.parent_id
WHERE CustomerNum >0 AND ItemNum >0
I feel it quite simple when I'd read the first three paragraphs, but I get a little confused when I've read the whole question.
Whatever you have done to get the data posted above, once you've got the data like that it's easy to retrive "the most recent record grouped by ItemNum and FeeSched".
How to:
Firstly, sort the whole result set by Date DESC.
Secondly, select fields you need from the sorted result set and group by ItemNum, FeeSched without any aggregation methods.
So, the query might be something like this:
SELECT t.Price, t.ItemNum, t.FeeSched, t.Date
FROM (SELECT * FROM table ORDER BY Date DESC) AS t
GROUP BY t.ItemNum, t.FeeSched;
How it works:
When your data is grouped and you select rows without aggregation methods, it will only return you the first row of each group. As you have sorted all rows before grouping, so the first row would exactly be "the most recent record".
Contact me if you got any problems or errors with this approach.
You can also try like this:
Select Price, ItemNum, FeeSched, Date from table where Date IN (Select MAX(Date) from table group by ItemNum, FeeSched,Customer);
Internal sql query return maximum date group by ItemNum and FeeSched and IN statement fetch only the records with maximum date.

Extreme query optimization with IN clause and subquery

My table has more than 15 millions of rows just now.
I need to run such query:
SELECT ch1.* FROM citizens_dynamic ch1
WHERE ch1.id IN (4369943, ..., 4383420, 4383700)
AND ch1.update_id_to = (
SELECT MAX(ch2.update_id_to)
FROM citizens_dynamic ch2
WHERE ch1.id = ch2.id AND ch2.update_id_to < 812
)
Basically, for every citizen in IN clause it searches for a row with closest but lower than specified update_id_to.
There is PRIMARY key on 2 columns columns update_id_to, id.
At the moment, this query is executed in 0.9s (having 100 ids in IN clause).
It's still too slow, I would need to run my scripts for 3 days to complete.
Below you can see my EXPLAIN output.
id index is just like PRIMARY key, but with reversed columns: id, update_id_to
Do you have any ideas how to make it even faster?
I've found that MySQL tends to perform better with JOIN than correlated subqueries.
SELECT ch1.*
FROM citizens_dynamic AS ch1
JOIN (SELECT id, MAX(update_id_to) AS update_id_to
FROM citizens_dynamic
WHERE id IN (4369943, ..., 4383420, 4383700)
GROUP BY id) AS ch2
ON ch1.id = ch2.id
WHERE ch1.id IN (4369943, ..., 4383420, 4383700)
Also, see the other methods in this question:
Retrieving the last record in each group

Poor Performance from MySQL JOIN - How to Make Improvements?

A bit of a generic question title but I have the following query:
SELECT t.from_number, COUNT(*) AS calls
FROM t
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
and it executes in 488ms.
However, aswell as retrieving the data from that table I need to lookup who the number belongs to.
SELECT t.from_number, COUNT(*) AS calls
FROM t
LEFT JOIN n on CONCAT('44', n.number) = t.from_number
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
As soon as I add the JOIN the query execution time jumps up to anything from 8 - 12 seconds and that's only to find the organisation that the number belongs to, I'd need yet another join after that to retrieve the organisation name from the organisations table.
The cardinality of t and n are > 2,000,000 and ~ 63,000 respectively, and, as you can guess from above, the numbers are stored slightly differently in each:
t stores numbers as 123456789 since the country code (44) is stored in a separate column but n stores numbers as 44123456789 which is why I need to use the CONCAT but I didn't think this would affect performance since it's not in the WHERE clause.
As far as I can tell, I have indexed the important columns in each table.
Are there any suggestions on how I can improve the performance of queries when it comes to these tables?
Update
EXPLAIN output added
id, select_type, table, possible_keys, key, key_len, ref, rows, Extra
1 SIMPLE t index_merge organisation_id,start_time,direction,from_number organisation_id,direction 4,13 NULL 4174 Using intersect(organisation_id,direction); Using where; Using temporary; Using filesort
1 SIMPLE n index NULL number 768 NULL 62759 Using index
The problem is on the JOIN clause:
LEFT JOIN n on CONCAT('44', n.number) = t.from_number
It is joining the tables using the result of the function CONCAT('44', n.number).
Some databases (as Oracle), can create an index based on a funcion, but others (as MySQL) cannot. So, it cannot use any index on table n to make the join.
A solution would be to create a new column on n with the result of the used function and to index it.
You could use a code similar to:
ALTER TABLE n ADD COLUMN extended_number varchar(128) null;
UPDATE n
SET extended_number = CONCAT('44', number);
CREATE INDEX ext_numb_idx
ON n.extended_number;
After this, modify the JOIN clause of the query:
SELECT t.from_number, COUNT(*) AS calls
FROM t
LEFT JOIN n on n.extended_number = t.from_number
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
Then MySQL will use the newly created index and will execute the query much faster.

How to make JOINS faster?

I had this query to start out with:
SELECT DISTINCT spentits.*
FROM `spentits`
WHERE (spentits.user_id IN
(SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44')
ORDER BY id DESC
LIMIT 15 OFFSET 0
This query takes 10ms to execute.
But once I add a simple join in:
SELECT DISTINCT spentits.*
FROM `spentits`
LEFT JOIN wishlist_items ON wishlist_items.user_id = 44 AND wishlist_items.spentit_id = spentits.id
WHERE (spentits.user_id IN
(SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44')
ORDER BY id DESC
LIMIT 15 OFFSET 0
This execute time increased by 11x. Now it takes around 120ms to execute. What's interesting is that if I remove either the LEFT JOIN clause or the ORDER BY id DESC , the time goes back to 10ms.
I am new to databases so I don't understand this. Why is it that removing either one of these clauses speeds it up 11x ? And how can I keep it as is but make it faster?
I have indexes on spentits.user_id, follows.follower_id, follows.accepted, and on primary ids of each table.
EXPLAIN:
1 PRIMARY spentits index index_spentits_on_user_id PRIMARY 4 NULL 15 Using where; Using temporary
1 PRIMARY wishlist_items ref index_wishlist_items_on_user_id,index_wishlist_items_on_spentit_id index_wishlist_items_on_spentit_id 5 spentit.spentits.id 1 Using where; Distinct
2 SUBQUERY follows index_merge index_follows_on_follower_id,index_follows_on_following_id,index_follows_on_accepted
index_follows_on_follower_id,index_follows_on_accepted 5,2 NULL 566 Using intersect(index_follows_on_follower_id,index_follows_on_accepted); Using where
You should have index also on:
wishlist_items.spentit_id
Because you are joining over that column
The LEFT JOIN is easy to explain: A cross product of all entries against all other entries is made. The conditions of the join (in your case: Take all entries on the left and find fitting ones on the right) are applied afterwards. So if your spentits table is large it will take the server some time. Would suggest you get rid of your subquery and make three joins. Start with the smallest table to avoid big amounts of data.
In the 2nd example the subselect runs for every spentits.user_id.
If you write is like this it will be faster because the subselect runs once:
SELECT DISTINCT spentits.*
FROM `spentits`, (SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44') as `follow`
LEFT JOIN wishlist_items ON wishlist_items.user_id = 44 AND wishlist_items.spentit_id = spentits.id
WHERE (spentits.user_id IN
(follow)
ORDER BY id DESC
LIMIT 15 OFFSET 0
As you can see the subselect moved to the FROM-part of the query and creates a imaginary tabel (or view).
This imaginary tabel is a inline-view.
JOINs and inline-views are faster every time than a subselect in the WHERE-part.

Why doesn't this query run?

I have this query that isn't finishing (I think the server runs out of memory)
SELECT fOpen.*, fClose.*
FROM (
SELECT of.*
FROM fixtures of
JOIN (
SELECT MIN(id) id
FROM fixtures
GROUP BY matchId, period, type
) ofi ON ofi.id = of.id
) fOpen
JOIN (
SELECT cf.*
FROM fixtures cf
JOIN (
SELECT MAX(id) id
FROM fixtures
GROUP BY matchId, period, type
) cfi ON cfi.id = cf.id
) fClose ON fClose.matchId = fOpen.matchId AND fClose.period = fOpen.period AND fClose.type = fOpen.type
This is the EXPLAIN of it:
Those 2 subqueries 'of' and 'cf' take about 1.5s to run, if I run them separately.
'id' is a PRIMARY INDEX and there is a BTREE INDEX named 'matchPeriodType' that has those 3 columns in that order.
More info: MySQL 5.5, 512MB of server memory, and the table has about 400k records.
I tried to rewrite your query, so that it is easier to read and should be able to use your indexes. Hope I got it right, could not test without your data.
SELECT fOpen.*, fClose.*
FROM (
SELECT MIN(id) AS min_id, MAX(id) AS max_id
FROM fixtures
GROUP BY matchId, period, type
) ids
JOIN fixtures fOpen ON ( fOpen.id = ids.min_id )
JOIN fixtures fClose ON ( fClose.id = ids.max_id );
This one gets MIN(id) and MAX(id) per matchId, period, type (should use your index) and joins the corresponding rows afterwards.
Appending id to your existing index matchPeriodType could also help, since the sub-query could then be executed with this index only.
Not sure how unique the matchid / period / type is. If unique you are joining 400k records against 400k records, possibly with the indexes being lost.
However it appears that the 2 main subselects might be unnecessary. You could just join fixtures against itself and that against the subselects to get the min and max.