40 seconds strange sql performance puzzle - mysql

I'm running a query to update my user's field like the following:
UPDATE Members SET abc = abc + 1 where Members.id in (
SELECT DISTINCT(memberId) FROM Events WHERE Events.createdAt > "2017-08-06 13:10:00";
Shockingly, with about 500 members, this query runs for 40 seconds...
so the break down:
SELECT DISTINCT(memberId) FROM Events WHERE Events.createdAt > "2017-08-06 13:10:00"
runs for 0.1s and there're only 39 rows matched.
The total # of Members is only ~500. I don't understand why this can take that long... Am I missing something?
I'm running on RDS with mysql 5.6

Try replacing with exists:
UPDATE Members m
SET abc = abc + 1
WHERE EXISTS (SELECT 1
FROM events e
WHERE e.memberId = m.id AND
e.createdAt > '2017-08-06 13:10:00'
);
For performance, you want an index on events(memberId, createdAt).
My guess is that MySQL runs the subquery once for every row in Members. This is consistent with your timing -- ~0.1 seconds * ~500 rows is about 50 seconds, not that far off of 40 seconds.
For SELECTs, this was fixed several versions ago. Perhaps this issue still exists in non-SELECT queries.
You can also write this as:
UPDATE Members m JOIN
(SELECT DISTINCT e.memberId
FROM events e
WHERE e.createdAt > '2017-08-06 13:10:00'
) e
ON e.memberId = m.id
SET abc = abc + 1 ;
Whether this is faster than the exists version, depends on characteristics of your data. Without the suggested index, this is likely to be faster.

Related

Optimizing Parameterized MySQL Queries

I have a query that has a number of parameters which if I run from in MySQLWorkbench takes around a second to run.
If I take this query and get rid of the parameters and instead substitute the values into the query then it takes about 22 seconds to run, same as If I convert this query to a parameterized stored procedure and run it (it then takes about 22 seconds).
I've enabled profiling on MySQL and I can see a few things there. For example, it shows the number of rows examined and there's an order of difference (20,000 to 400,000) which I assume is the reason for the 20x increase in processing time.
The other difference in the profile is that the parameterized query sent from MySQLWorkbench still has the parameters in (e.g. where limit < #lim) while the sproc the values have been set (where limit < 300).
I've tried this a number of different ways, I'm using JetBrains's DataGrip (as well as MySQLWorkbench) and that works like MySQLWorkbench (sends through the # parameters), I've tried executing the queries and the sproc from MySQLWorkbench, DataGrip, Java (JDBC) and .Net. I've also tried prepared statements in Java but I can't get anywhere near the performance of sending the 'raw' SQL to MySQL.
I feel like I'm missing something obvious here but I don't know what it is.
The query is relatively complex, it has a CTE a couple of sub-selects and a couple of joins, but as I said it runs quickly straight from MySQL.
My main question is why the query is 20x faster in one format than another.
Does the way the query is sent to MySQL have anything to do with this (the '#' values sent through and can I replicate this in a stored procedure?
Updated 1st Jan
Thanks for the comments, I didn't post the query originally as I'm more interested in the general concepts around the use of variables/parameters and how I could take advantage of that (or not)
Here is the original query:
with tmp_bat as (select bd.MatchId,
bd.matchtype,
bd.playerid,
bd.teamid,
bd.opponentsid,
bd.inningsnumber,
bd.dismissal,
bd.dismissaltype,
bd.bowlerid,
bd.fielderid,
bd.score,
bd.position,
bd.notout,
bd.balls,
bd.minutes,
bd.fours,
bd.sixes,
bd.hundred,
bd.fifty,
bd.duck,
bd.captain,
bd.wicketkeeper,
m.hometeamid,
m.awayteamid,
m.matchdesignator,
m.matchtitle,
m.location,
m.tossteamid,
m.resultstring,
m.whowonid,
m.howmuch,
m.victorytype,
m.duration,
m.ballsperover,
m.daynight,
m.LocationId
from (select *
from battingdetails
where matchid in
(select id
from matches
where id in (select matchid from battingdetails)
and matchtype = #match_type
)) as bd
join matches m on m.id = bd.matchid
join extramatchdetails emd1
on emd1.MatchId = m.Id
and emd1.TeamId = bd.TeamId
join extramatchdetails emd2
on emd2.MatchId = m.Id
and emd2.TeamId = bd.TeamId
)
select players.fullname name,
teams.teams team,
'' opponents,
players.sortnamepart,
innings.matches,
innings.innings,
innings.notouts,
innings.runs,
HS.score highestscore,
HS.NotOut,
CAST(TRUNCATE(innings.runs / (CAST((Innings.Innings - innings.notOuts) AS DECIMAL)),
2) AS DECIMAL(7, 2)) 'Avg',
innings.hundreds,
innings.fifties,
innings.ducks,
innings.fours,
innings.sixes,
innings.balls,
CONCAT(grounds.CountryName, ' - ', grounds.KnownAs) Ground,
'' Year,
'' CountryName
from (select count(case when inningsnumber = 1 then 1 end) matches,
count(case when dismissaltype != 11 and dismissaltype != 14 then 1 end) innings,
LocationId,
playerid,
MatchType,
SUM(score) runs,
SUM(notout) notouts,
SUM(hundred) Hundreds,
SUM(fifty) Fifties,
SUM(duck) Ducks,
SUM(fours) Fours,
SUM(sixes) Sixes,
SUM(balls) Balls
from tmp_bat
group by MatchType, playerid, LocationId) as innings
JOIN players ON players.id = innings.playerid
join grounds on Grounds.GroundId = LocationId and grounds.MatchType = innings.MatchType
join
(select pt.playerid, t.matchtype, GROUP_CONCAT(t.name SEPARATOR ', ') as teams
from playersteams pt
join teams t on pt.teamid = t.id
group by pt.playerid, t.matchtype)
as teams on teams.playerid = innings.playerid and teams.matchtype = innings.MatchType
JOIN
(SELECT playerid,
LocationId,
MAX(Score) Score,
MAX(NotOut) NotOut
FROM (SELECT battingdetails.playerid,
battingdetails.score,
battingdetails.notout,
battingdetails.LocationId
FROM tmp_bat as battingdetails
JOIN (SELECT battingdetails.playerid,
battingdetails.LocationId,
MAX(battingdetails.Score) AS score
FROM tmp_bat as battingdetails
GROUP BY battingdetails.playerid,
battingdetails.LocationId,
battingdetails.playerid) AS maxscore
ON battingdetails.score = maxscore.score
AND battingdetails.playerid = maxscore.playerid
AND battingdetails.LocationId = maxscore.LocationId ) AS internal
GROUP BY internal.playerid, internal.LocationId) AS HS
ON HS.playerid = innings.playerid and hs.LocationId = innings.LocationId
where innings.runs >= #runs_limit
order by runs desc, KnownAs, SortNamePart
limit 0, 300;
Wherever you see '#match_type' then I substitute that for a value ('t'). This query takes ~1.1 secs to run. The query with the hard coded values rather than the variables down to ~3.5 secs (see the other note below). The EXPLAIN for this query gives this:
1,PRIMARY,<derived7>,,ALL,,,,,219291,100,Using temporary; Using filesort
1,PRIMARY,players,,eq_ref,PRIMARY,PRIMARY,4,teams.playerid,1,100,
1,PRIMARY,<derived2>,,ref,<auto_key3>,<auto_key3>,26,"teams.playerid,teams.matchtype",11,100,Using where
1,PRIMARY,grounds,,ref,GroundId,GroundId,4,innings.LocationId,1,10,Using where
1,PRIMARY,<derived8>,,ref,<auto_key0>,<auto_key0>,8,"teams.playerid,innings.LocationId",169,100,
8,DERIVED,<derived3>,,ALL,,,,,349893,100,Using temporary
8,DERIVED,<derived14>,,ref,<auto_key0>,<auto_key0>,13,"battingdetails.PlayerId,battingdetails.LocationId,battingdetails.Score",10,100,Using index
14,DERIVED,<derived3>,,ALL,,,,,349893,100,Using temporary
7,DERIVED,t,,ALL,PRIMARY,,,,3323,100,Using temporary; Using filesort
7,DERIVED,pt,,ref,TeamId,TeamId,4,t.Id,65,100,
2,DERIVED,<derived3>,,ALL,,,,,349893,100,Using temporary
3,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where
3,DERIVED,m,,eq_ref,PRIMARY,PRIMARY,4,matches.Id,1,100,
3,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index
3,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index
3,DERIVED,battingdetails,,ref,"TeamId,MatchId,match_team",match_team,8,"emd1.TeamId,matches.Id",15,100,
3,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails)
and the EXPLAIN for the query with the hardcoded values looks like this:
1,PRIMARY,<derived8>,,ALL,,,,,20097,100,Using temporary; Using filesort
1,PRIMARY,players,,eq_ref,PRIMARY,PRIMARY,4,HS.PlayerId,1,100,
1,PRIMARY,grounds,,ref,GroundId,GroundId,4,HS.LocationId,1,100,Using where
1,PRIMARY,<derived2>,,ref,<auto_key0>,<auto_key0>,30,"HS.LocationId,HS.PlayerId,grounds.MatchType",17,100,Using where
1,PRIMARY,<derived7>,,ref,<auto_key0>,<auto_key0>,46,"HS.PlayerId,innings.MatchType",10,100,Using where
8,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where; Using temporary
8,DERIVED,m,,eq_ref,"PRIMARY,LocationId",PRIMARY,4,matches.Id,1,100,
8,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index
8,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index
8,DERIVED,<derived14>,,ref,<auto_key2>,<auto_key2>,4,m.LocationId,17,100,
8,DERIVED,battingdetails,,ref,"PlayerId,TeamId,Score,MatchId,match_team",MatchId,8,"matches.Id,maxscore.PlayerId",1,3.56,Using where
8,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails)
14,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where; Using temporary
14,DERIVED,m,,eq_ref,PRIMARY,PRIMARY,4,matches.Id,1,100,
14,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index
14,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index
14,DERIVED,battingdetails,,ref,"TeamId,MatchId,match_team",match_team,8,"emd1.TeamId,matches.Id",15,100,
14,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails)
7,DERIVED,t,,ALL,PRIMARY,,,,3323,100,Using temporary; Using filesort
7,DERIVED,pt,,ref,TeamId,TeamId,4,t.Id,65,100,
2,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where; Using temporary
2,DERIVED,m,,eq_ref,PRIMARY,PRIMARY,4,matches.Id,1,100,
2,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index
2,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index
2,DERIVED,battingdetails,,ref,"TeamId,MatchId,match_team",match_team,8,"emd1.TeamId,matches.Id",15,100,
2,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails)
Pointers as to ways to improve my SQL are always welcome (I'm definitely not a database person), but I''d still like to understand whether I can use the SQL with the variables from code and why that improves the performance by so much
Update 2 1st Jan
AAArrrggghhh. My machine rebooted overnight and now the queries are generally running much quicker. It's still 1 sec vs 3 secs but the 20 times slowdown does seem to have disappeared
In your WITH construct, are you overthinking your select in ( select in ( select in ))) ... overstating what could just be simplified to the with Innings I have in my solution.
Also, you were joining to the extraMatchDetails TWICE, but joined on the same conditions on match and team, but never utliized either of those tables in the "WITH CTE" rendering that component useless, doesn't it? However, the MATCH table has homeTeamID and AwayTeamID which is what I THINK your actual intent was
Also, your WITH CTE is pulling many columns not needed or used in subsequent return such as Captain, WicketKeeper.
So, I have restructured... pre-query the batting details once up front and summarized, then you should be able to join off that.
Hopefully this MIGHT be a better fit, function and performance for your needs.
with innings as
(
select
bd.matchId,
bd.matchtype,
bd.playerid,
m.locationId,
count(case when bd.inningsnumber = 1 then 1 end) matches,
count(case when bd.dismissaltype in ( 11, 14 ) then 0 else 1 end) innings,
SUM(bd.score) runs,
SUM(bd.notout) notouts,
SUM(bd.hundred) Hundreds,
SUM(bd.fifty) Fifties,
SUM(bd.duck) Ducks,
SUM(bd.fours) Fours,
SUM(bd.sixes) Sixes,
SUM(bd.balls) Balls
from
battingDetails bd
join Match m
on bd.MatchID = m.MatchID
where
matchtype = #match_type
group by
bd.matchId,
bd.matchType,
bd.playerid,
m.locationId
)
select
p.fullname playerFullName,
p.sortnamepart,
CONCAT(g.CountryName, ' - ', g.KnownAs) Ground,
t.team,
i.matches,
i.innings,
i.runs,
i.notouts,
i.hundreds,
i.fifties,
i.ducks,
i.fours,
i.sixes,
i.balls,
CAST( TRUNCATE( i.runs / (CAST((i.Innings - i.notOuts) AS DECIMAL)), 2) AS DECIMAL(7, 2)) 'Avg',
hs.maxScore,
hs.maxNotOut,
'' opponents,
'' Year,
'' CountryName
from
innings i
JOIN players p
ON i.playerid = p.id
join grounds g
on i.locationId = g.GroundId
and i.matchType = g.matchType
join
(select
pt.playerid,
t.matchtype,
GROUP_CONCAT(t.name SEPARATOR ', ') team
from
playersteams pt
join teams t
on pt.teamid = t.id
group by
pt.playerid,
t.matchtype) as t
on i.playerid = t.playerid
and i.MatchType = t.matchtype
join
( select
i2.playerid,
i2.locationid,
max( i2.score ) maxScore,
max( i2.notOut ) maxNotOut
from
innings i2
group by
i2.playerid,
i2.LocationId ) HS
on i.playerid = HS.playerid
AND i.locationid = HS.locationid
FROM
where
i.runs >= #runs_limit
order by
i.runs desc,
g.KnownAs,
p.SortNamePart
limit
0, 300;
Now, I know that you stated that after the server reboot, performance is better, but really, what you DO have appears to really have overbloated queries.
Not sure this is the correct answer but I thought I'd post this in case other people have the same issue.
The issue seems to be the use of CTEs in a stored procedure. I have a query that creates a CTE and then uses that CTE 8 times. If I run this query using interpolated variables it takes about 0.8 sec, if I turn it into a stored procedure and use the stored procedure parameters then it takes about to a minute (between 45 and 63 seconds) to run!
I've found a couple of ways of fixing this, one is to use multiple temporary tables (8 in this case) as MySQL cannot re-use a temp table in a query. This gets the query time right down but just doesn't fell like a maintainable or scalable solution. The other fix is to leave the variables in place and assign them from the stored procedure parameters, this also has no real performance issues. So my sproc looks like this:
create procedure bowling_individual_career_records_by_year_for_team_vs_opponent(IN team_id INT,
IN opponents_id INT)
begin
set #team_id = team_id;
set #opponents_id = opponents_id;
# use these variables in the SQL below
...
end
Not sure this is the best solution but it works for me and keeps the structure of the SQL the same as it was previously.

Get Records From Table 1, Comparing With Multiple Records From Table 2

I have 2 tables (users and usages)
USERS TABLE
username usage
a 32
b 5
c 5
USAGES TABLE
username usage_added
a 7
b 7
c 7
a 30
I want to get all items from USERS table, that have USAGE BIGGER than X (in this case, let's say X is 30) AND if either NO RECORDS are found with the same username in USAGES TABLE or if the usage_added for this username in USAGES TABLE are SMALLER than X (30 in our case)
So in this case, it should return no records. I have a codeigniter query
$this->db->select('users.username');
$this->db->from('users');
$this->db->join('usages', 'usages.username = users.username','left');
$this->db->where("(usages.email is NULL OR (usages.usage_added<30 AND usages.username=users.username))", NULL, FALSE);
$this->db->where("users.usage>30", NULL, FALSE);
By using above query, I still get "username a" returned.
Normally it should not return user A, because user a already has date 30 added. But it seems it compares to first record (a=7) and it says a<30 and it shows it again.
I hope it makes sense and somebody can help.
Written SQL Server syntax, this query should work for you:
DECLARE #usage_limit int = 30;
SELECT A.username
FROM users as A
LEFT OUTER JOIN
(
SELECT username,
usage_added = sum(usage_added)
FROM usages
GROUP BY
username
) as B
ON A.username = B.username
WHERE A.usage > #usage_limit
AND (B.username is null OR B.usage_added < #usage_limit)
This returns no records.
Hope this helps!
You seem to be describing logic like this:
select u.*
from users u
where u.usage > 30 or
not exists (select 1
from usages us
where us.username = u.username and
us.usage > 30
);
You should replace the 30 with a parameter if it varies.

SQL Statement running extreamly Slow

Okay I have look through several posts about SQL running slow and I didn't see anything similar to this, so my apologies if I missed one. I was asked about a month ago to create a view that would allow the user to report what hasn't been billed yet, and the SQL is joining 4 tables together. One is 1.2 million records ish. the rest are between 80K - 250K. (Please note that this should only return around 100 records after the where statements).
SELECT C.Cltnum + '.' + C.CltEng AS [ClientNum],
C.CPPLname,
w.WSCDesc,
MIN(w.Wdate) AS [FirstTDate],
w.WCodeCat,
w.WCodeSub,
SUM(w.Wbilled) AS [Billed],
SUM(w.Whours * w.Wrate) AS [Billable Wip],
sum(ar.[ARProgress]) AS [Progress],
w.Winvnum,
-- dbo.GetInvoiceAmount(w.Winvnum) AS [InvoiceAmount],
SUM(cb.cinvar) AS [AR Balance]
FROM dbo.WIP AS w
--Never join on a select statement
--join BillingAuditCatagoriesT bac on w.WCodeCat = bac.Catagory and w.WCodeSub = bac.Subcatagory
INNER JOIN dbo.Clients AS C ON w.WCltID = C.ID
JOIN dbo.ClientBuckets AS cb on c.cltnum = cb.cltnum
JOIN dbo.AcctsRec AS ar on ar.arapplyto = w.[Winvnum]
-- WHERE w.wcodecat = '1AUDT'
GROUP BY C.Cltnum, C.CltEng, C.CPPLname, w.WCodeCat, w.Wdate, w.Winvnum, w.WCodeSub, w.WSCDesc
so, where I think there may be a problem is that Category is a varchar it is xat, ACT, BID and there are about 15 different Category. this is the same as SubCat. you will notice that there are 3 functions on this and they are GetJamesProgress Which is = (SELECT sum(Amount) From Progress Where inv = w.invnum) and the same with GetInvoiceAmount and GetJamesARBalance. I know that this is bad to do but when I join by invoice number it takes even longer than with out them.
Please help thanks so much!

MySQL query with multiple subqueries is too slow. How do I speed it up?

I have a MySQL query which does exactly what I want, but it takes anywhere between 110 & 130 seconds to process. The problem is that it works in tandem with a software that times out 20 seconds after making the query.
Is there anything I can do to speed up the query? I'm considering moving the db over to another server, but are there any more elegant options before I go that route?
-- 1 Give me a list of IDs & eBayItemIDs
-- 2 where it is flagged as bottom tier
-- 3 Where it has been checked less than 168 times
-- 4 Where it has not been checked in the last hour
-- 5 Or where it was never checked but appears on the master list.
-- 1 Give me a list of IDs & eBayItemIDs
SELECT `id`, eBayItemID
FROM `eBayDD_Main`
-- 2 where it is flagged as bottom tier
WHERE `isBottomTier`='0'
-- 3 Where it has been checked less than 168 times
AND (`id` IN
(SELECT `mainid`
FROM `eBayDD_History`
GROUP BY `mainid`
HAVING COUNT(`mainID`) < 168)
-- 4 Where it has not been checked in the last hour
AND id IN
(SELECT `mainID`
FROM `eBayDD_History`
GROUP BY `mainID`
HAVING ((TIME_TO_SEC(TIMEDIFF(NOW(), MAX(`dateCollected`)))/60)/60) > 1))
-- 5 Or where it was never checked but appears on the master list.
OR (`id` IN
(SELECT `id`
FROM `eBayDD_Main`)
AND `id` NOT IN
(SELECT `mainID`
FROM `eBayDD_History`))
If I understand the logic correctly, you should be able to replace this logic with this:
select m.`id`, m.eBayItemID
from `eBayDD_Main` m left outer join
(select `mainid`, count(`mainID`) as cnt,
TIME_TO_SEC(TIMEDIFF(NOW(), MAX(`dateCollected`)))/60)/60) as dc
from `eBayDD_History`
group by `mainid`
) hm
on m.mainid = hm.mainid
where m.`isBottomTier` = '0' and hm.cnt < 168 and hm.dc > 1 or
hm.mainid is null;

MySQL ranking in presence of indexes using variables

Using the classic trick of using #N=#N + 1 to get the rank of items on some ordered column. Now before ordering I need to filter out some values from the base table by inner joining it with some other table. So the query looks like this -:
SET #N=0;
SELECT
#N := #N + 1 AS rank,
fa.id,
fa.val
FROM
table1 AS fa
INNER JOIN table2 AS em
ON em.id = fa.id
AND em.type = "A"
ORDER BY fa.val ;
The issue is if I don't have an index on the em.type, then everything works fine but if I put an index on em.type then hell unleashes and the rank values instead of coming ordered by the val column comes in the order the rows are stored in the em table.
here are sample outputs -:
without index-:
rank id val
1 05F8C7 55050.000000
2 05HJDG 51404.733458
3 05TK1Z 46972.008208
4 05F2TR 46900.000000
5 05F349 44433.412847
6 06C2BT 43750.000000
7 0012X3 42000.000000
8 05MMPK 39430.399658
9 05MLW5 39054.046383
10 062D20 35550.000000
with index-:
rank id val
480 05F8C7 55050.000000
629 05HJDG 51404.733458
1603 05TK1Z 46972.008208
466 05F2TR 46900.000000
467 05F349 44433.412847
3534 06C2BT 43750.000000
15 0012X3 42000.000000
1109 05MMPK 39430.399658
1087 05MLW5 39054.046383
2544 062D20 35550.000000
I believe the use of indexes should be completely transparent and outputs should not be effected by it. Is this a bug in MySQL?
This "trick" was a bomb waiting to explode. A clever optimizer will evaluate a query as it sees fits, optimizing for speed - that's why it's called optimizer. I don't think this use of MySQL variables was documented to work as you expect it to work, but it was working.
Was working, up until recent improvements on the MariaDB optimizer. It will probably break as well in the mainstream MySQL as there are several improvements on the optimizer in the (yet to be released, still beta) 5.6 version.
What you can do (until MySQL implemented window functions) is to use a self-join and a grouping. Results will be consistent, no matter what future improvements are done in the optimizer. Downside is that that it may not be very efficient:
SELECT
COUNT(*) AS rank,
fa.id,
fa.val
FROM
table1 AS fa
INNER JOIN table2 AS em
ON em.id = fa.id
AND em.type = 'A'
INNER JOIN
table1 AS fa2
INNER JOIN table2 AS em2
ON em2.id = fa2.id
AND em2.type = 'A'
ON fa2.id <= fa.id
--- assuming that `id` is the Primary Key of the table
GROUP BY fa.id
ORDER BY fa.val ;