I have a slow UPDATE statement and can't work out why
CREATE TEMPORARY TABLE ttShifts
(ShiftId int NOT NULL, ttdtAdded datetime not null, ttdtBookingStart DATETIME NOT NULL, ttHoursNeeded int not null, szHidden varchar(255),
PRIMARY KEY (ShiftID))
AS
(select shiftId, dtAdded as ttdtAdded, dtBookingStart as ttdtBookingStart, HoursNeeded as ttHoursNeeded from shifts where shifts.lStatus=0);
update ttShifts set szHidden='x' where szHidden is NULL and ShiftId in (select shiftid from shifts,practices where shifts.PracticeId=practices.PracticeId and shifts.iBranch = practices.iBranch and practices.Healthboard not in (select Locname from userlocationprefs where iUser=82 and Level=0 and fAcceptWork=true))
166 rows affected. (Query took 0.2899 seconds.)
EXPLAIN:
1 PRIMARY ttShifts index PRIMARY 4 297 Using where
2 DEPENDENT SUBQUERY shifts eq_ref PRIMARY PRIMARY 4 func 1
2 DEPENDENT SUBQUERY practices ALL PRIMARY 636 Using where; Using join buffer (flat, BNL join)
3 MATERIALIZED userlocationprefs ref PRIMARY PRIMARY 8 const,const 3 Using where
OK, so let's try switching this to use a join to eliminate the dependent subqueries
update ttShifts join shifts on (ttShifts.ShiftID=shifts.shiftId) join practices on (shifts.practiceId=practices.PracticeId and shifts.iBranch=practices.iBranch) set szHidden='x' where szHidden is NULL and practices.Healthboard not in (select Locname from userlocationprefs where iUser=82 and Level=0 and fAcceptWork=true);
166 rows affected. (Query took 0.4009 seconds.)
Right, so that takes longer
EXPLAIN:
1 PRIMARY ttShifts ALL PRIMARY 297 Using where
1 PRIMARY shifts eq_ref PRIMARY PRIMARY 4 ttShifts.shiftId 1
1 PRIMARY practices ALL 636 Using where
2 MATERIALIZED userlocationprefs ref PRIMARY PRIMARY 8 const,const 3 Using where
OK, so it must be the MATERIALIZED bit it's not doing efficiently for some reason, let's try swapping that to a straight equality check just as a test.
update ttShifts join shifts on (ttShifts.ShiftID=shifts.shiftId) join practices on (shifts.practiceId=practices.PracticeId and shifts.iBranch=practices.iBranch) set szHidden='x' where szHidden is NULL and practices.Healthboard!='X'
0.3493 seconds.
OK, not that then.
If I eliminate the update and make it a select...
select * from ttShifts join shifts on (ttShifts.ShiftID=shifts.shiftId) join practices on (shifts.practiceId=practices.PracticeId and shifts.iBranch=practices.iBranch) where szHidden is NULL and practices.Healthboard not in (select Locname from userlocationprefs where iUser=82 and Level=0 and fAcceptWork=true)
(166 rows, Query took 0.0159 seconds.)
So why is the UPDATE so bloody slow, and what can I do to speed it up?
Your first EXPLAIN output tells us that it needs to process 297*1*636*3 rows = 566,676 rows. So yes it will take a moment to process. It is similar to second EXPLAIN output.
If I were you, I will try to focus on entry marked as ALL as it represent table scan operation.
Also, IN work best with list of constant value not sub query as they cause index to be useless. Try to remove sub query with constant value if possible.
The second EXPLAIN even worse because there are two table scan without index available to use.
Third UPDATE there is no EXPLAIN output, but again I think because number of rows to process is still high because you use column with low cardinality (not null) and healthBoard !='x' as filtering criteria WHERE clause.
Your last query try to compare speed of SELECT vs UPDATE. Well, UPDATE is more expensive because it has to search the matching row, write the value, write the the index.
From what I see, most of your problem is due low cardinality column is used as filtering criteria.
For anyone else who runs across this, it seems the issue is with how the optimiser handles IN with update vs select. I refactored things so the user viewing preferences was set in a separate table testuserchosenpractices, and then I could join it.
I can't explain the difference in speed between UPDATE and SELECT originally (as SELECT was perfectly tolerable) but with the replacement the UPDATE is faster than the original SELECT was!
update (select 1) as dummy, ttShifts set szHidden='x' where szHidden is NULL and ShiftId in (select shiftid from shifts join practices on (shifts.PracticeId=practices.PracticeId and shifts.iBranch = practices.iBranch) join testuserchosenpractices on (shifts.practiceid=testuserchosenpractices.practiceid and shifts.ibranch=testuserchosenpractices.ibranch and testuserchosenpractices.iUser=82) and szReason!='')
166 rows affected. (Query took 0.0066 seconds.)
Related
(I found the same question exists, but was not happy with the detailed specification, so came here for help, forgive me for my ignorance)
DELETE FROM supportrequestresponse # ~3 million records
WHERE SupportRequestID NOT IN (
SELECT SR.SupportRequestID
FROM supportrequest AS SR # ~1 million records
)
Or
DELETE SRR
FROM supportrequestresponse AS SRR # ~3 million records
LEFT JOIN supportrequest AS SR
ON SR.SupportRequestID = SRR.SupportRequestID # ~1 million records
WHERE SR.SupportRequestID IS NULL
Specifics
Database: MySQL
SR.SupportRequestID is INTEGER PRIMARY KEY
SRR.SupportRequestID is INTEGER INDEX
SR.SupportRequestID & SRR.SupportRequestID are not in FOREIGN KEY relation
Both tables contain TEXT columns for subject and message
Both tables are InnoDB
Motive: I am planning to use this with a periodic clean up job, likely to be once an hour or every two hours. It is very important to avoid lengthy operation in order to avoid table locks as this is a very busy database and am already over quota with deadlocks!
EXPLAIN query 1
1 PRIMARY supportrequestresponse ALL 410 Using where
2 DEPENDENT SUBQUERY SR unique_subquery PRIMARY PRIMARY 4 func 1 Using index
EXPLAIN query 2
1 SIMPLE SRR ALL 410
1 SIMPLE SR eq_ref PRIMARY PRIMARY 4 SRR.SupportRequestID 1 Using where; Using index; Not exists
RUN #2
EXPLAIN query 1
1 PRIMARY supportrequestresponse ALL 157209473 Using where
2 DEPENDENT SUBQUERY SR unique_subquery PRIMARY PRIMARY 4 func 1 Using index; Using where; Full scan on NULL key
EXPLAIN query 2
1 SIMPLE SRR ALL 157209476
1 SIMPLE SR eq_ref PRIMARY PRIMARY 4 SRR.SupportRequestID 1 Using where; Using index; Not exists
I suspect it would be quicker to create a new table, retaining just the rows you wish to keep. Then drop the old table. Then rename the new table.
I don't know how to describe this, but this worked as an answer to my case; an unbelievable one!
DELETE SRR
FROM supportrequestresponse AS SRR
LEFT JOIN (
SELECT SRR3.SupportRequestResponseID
FROM supportrequestresponse AS SRR3
LEFT JOIN supportrequest AS SR ON SR.SupportRequestID = SRR3.SupportRequestID
WHERE SR.SupportRequestID IS NULL
LIMIT 999
) AS SRR2 ON SRR2.SupportRequestResponseID = SRR.SupportRequestResponseID
WHERE SRR2.SupportRequestResponseID IS NOT NULL;
... # Same piece of SQL
... # Same piece of SQL
... #999 Same piece of SQL
A fork of the second pattern looks/feels appropriate than having to let MySQL match each row against a dynamic list, but this is the minor fact. I just limited the row selection to 999 rows at once only, that lets the DELETE operation finish in a blink of eye, but most importantly, I repeated the same piece of DELETE SQL 99 times one after another!
This basically made it super comfortable for a Cron job. The x99 statements let the database engine keep the tables NOT LOCKED so other processes don't get stuck waiting for the DELETE to finish, while each x# DELETE SQL takes very less amount of time to finish. I find it something like when vehicles pass through cross roads in a zipper fashion.
I'm trying to execute a select query over a fairly simple (but large) table and am getting over 10x slower performance when I don't join on a certain secondary table.
TableA is keyed on two columns, 'ID1' & 'ID2', and has a total of 10 numeric (int + dbl) columns.
TableB is keyed on 'ID1' and has a total of 2 numeric (int) columns.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
TableA
INNER JOIN
TableB
ON
TableA.ID1 = TableB.ID1
WHERE
TableA.ID2 = 5
AND
TableA.ID1 BETWEEN 15000 AND 20000
As soon as the join is removed, performance takes a major hit. The query above takes 0.016 seconds to run while the query below takes 0.216 seconds to run.
The end goal is to replace TableA's 'ID1' with TableB's 2nd (non-key) column and deprecate TableB.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
tableA
WHERE
ID2 = 5
AND
ID1 BETWEEN 15000 AND 20000
Both tables have indexes on their primary keys. The relationship between the two tables is One-to-Many. DB Engine is MyISAM.
Scenario 1 (fast):
id stype table type possKey key kln ref rws extra
1 SIMPLE TableB range PRIMARY PRIMARY 4 498 Using where; Using index
1 SIMPLE TableA eq_ref PRIMARY PRIMARY 8 schm.TableA.ID1,const 1
Scenario 2 (slow):
id stype table type possKey key key_len ref rows extra
1 SIMPLE TableA range PRIMARY PRIMARY 8 288282 Using where
Row count and lack of any mention of an index in scenario 2 definitely stand out, but why would that be the case?
700 results from both queries -- same data.
Given your query, I'd say an index like this might be useful:
CREATE INDEX i ON tableA(ID2, ID1)
A possible reason why your first query is much faster is because you probably only have few records in tableB, which makes the join predicate very selective, compared to the range predicate.
I suggest reading up on indexes. Knowing 2-3 details about them will help you easily tune your queries only by choosing better indexes.
I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.
I have this MySQL query:
EXPLAIN EXTENDED
SELECT img.id as iid,img.*,users.*,img.ave_rate as arate,img.count_rate as cn
FROM images AS img
LEFT OUTER JOIN users on img.uid=users.id
WHERE img.id NOT IN (SELECT rates.imageid from rates WHERE rates.ip=1854604622)
GROUP BY iid
ORDER BY
iid DESC
LIMIT 30
Its output says this:
1 PRIMARY img index NULL PRIMARY 4 NULL 30 580 Using where
1 PRIMARY users eq_ref PRIMARY PRIMARY 4 twtgirls3.img.uid 1 100
2 DEPENDENT SUBQUERY rates ref imageid,ip ip 5 const 4 100 Using where
As you can see in the first row it is using the PRIMARY key as index but in the extra column we have "Using Where", what does it mean? does it mean that the key is not used? We have the same condition in the third row....
And finally, what do you think about this query? Is it optimized?
If the Extra column also says Using where, it means the index is being used to perform lookups of key values. Without Using where, the optimizer may be reading the index to avoid reading data rows but not using it for lookups. For example, if the index is a covering index for the query, the optimizer may scan it without using it for lookups.
Source: https://dev.mysql.com/doc/refman/5.1/en/explain-output.html
I've done a lot of reading and Googling on this and I cannot find any satisfactory answer so I'd appreciate any help. Most answers I find come close to my situation but do not address it (and attempting to follow the solutions has not done me any good).
See Edit #2 below for the best example
[This was the original question but is not a great representation of what I'm asking.]
Say I have 2 tables, each with 4 columns:
key (int, auto increment)
c1 (a date)
c2 (a varchar of length 3)
c3 (also a varchar of length 3)
And I want to perform the following query:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.c1, t.c2
Both key fields are indexed as primary keys. I want to get the number of rows returned in each grouping of c1, c2.
When I explain this query I get "using temporary; using filesort". The actual table I'm performing this query on is over 500,000 rows, so that means it's a time consuming query.
So my question is (assuming I'm not doing anything wrong in the query): is there a way to index this table to eliminate the temporary/filesort usage?
Thanks in advance for any help.
Edit
Here is the table definition (in this example both tables are identical - in reality they're not but I'm not sure it makes a difference at this point):
CREATE TABLE `test1` (
`key` int(11) NOT NULL auto_increment,
`c1` date NOT NULL,
`c2` varchar(3) NOT NULL,
`c3` varchar(3) NOT NULL,
PRIMARY KEY (`key`),
UNIQUE KEY `c1` (`c1`,`c2`),
UNIQUE KEY `c2_2` (`c2`,`c1`),
KEY `c2` (`c2`,`c3`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8
Full EXPLAIN statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL NULL NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 tracking.t.key 1 Using index
This is just for my sample tables. In my real tables the rows for t says 500,000+ (every row in the table, though that could be related to something else).
Edit #2
Here is a more concrete example to better explain my situation.
Let's say I have data on Little League baseball games. I have two tables. One holds data on the games:
CREATE TABLE `ex_games` (
`game_id` int(11) NOT NULL auto_increment,
`home_team` int(11) NOT NULL,
`date` date NOT NULL,
PRIMARY KEY (`game_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The other holds data on the at bats in each game:
CREATE TABLE `ex_atbats` (
`ab_id` int(11) NOT NULL auto_increment,
`game` int(11) NOT NULL,
`team` int(11) NOT NULL,
`player` int(11) NOT NULL,
`result` tinyint(1) NOT NULL,
PRIMARY KEY (`hit_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
So I have two questions. Let's start with the simple version: I want to return a list of games with a count of how many at bats are in each game. So I think I would do something like this:
SELECT date, home_team, COUNT(h.ab_id) FROM `ex_atbats` h
LEFT JOIN ex_games g ON g.game_id = h.game
GROUP BY g.game_id
This query uses filesort/temporary. Is there a better way to structure this or to index the tables to get rid of that?
Then, the trickier part: say I now want to not only include a count of the number of at bats, but also include a count of the number of at bats that were preceded by an at bat with the same result by the same team. I assume that would be something like:
SELECT g.date, g.home_team, COUNT(ab.ab_id), COUNT(ab2.ab_id) FROM `ex_atbats` ab
LEFT JOIN ex_games g ON g.game_id = ab.game
LEFT JOIN ex_atbats ab2 ON ab2.ab_id = ab.ab_id - 1 AND ab2.result = ab.result
GROUP BY g.game_id
Is that the correct way to structure that query? This also uses filesort/temporary.
So what is the optimal way to go about accomplishing these tasks?
Thanks again.
Phrases Using temporary/filesort usually are not related to the indexes used in the JOIN operation. There is numerous examples where you can have all indexes set (they show up in key and key_len columns in EXPLAIN) but you still get Using temporary and Using filesort.
Check out what the manual says about Using temporary and Using filesort:
How MySQL Uses Internal Temporary Tables
ORDER BY Optimization
Having a combined index for all columns used in GROUP BY clause may help to get rid of Using filesort in certain circumstances. If you also issue ORDER BY you may need to add more complex indexes.
If you have a huge dataset consider partitioning it using some criteria like date or timestamp by means of actual partitioning or a simple WHERE clause.
First of all, the tables' definitions do matter. It's one thing to join using two primary keys, another to join using a primary key from one side and a non-unique key in the other, etc. It also matters what type of engine the tables use as InnoDB treats Primary Keys differently than MyISAM engine.
What I notice though is that on table test1, the (c1,c2) combination is Unique and the fields are not nullable. This allows your query to be rewritten as:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
It will give the same results while using the same field for the JOIN and the GROUP BY. Note that MySQL allows you to use in the SELECT list fields that are not in the GROUP BY list, without having aggregate functions on them. This is not allowed in most other systems and is seen as a bug by some. In this situation though it is a very nice feature. Every row can be either identified by (key) or (c1,c2), so it shouldn't matter which of the two is used for the grouping.
Another thing to note is that when you use LEFT JOIN, it's common to use the joining column from the right side for the counting: COUNT(t2.key) and not COUNT(*). Your original query will give 1 in that column for records in test1 that do not mmatch any record in test2 because it counts rows while you probably want to count the related records in test2 - and show 0 in those cases.
So, try this query and post the EXPLAIN:
SELECT t.c1, t.c2, COUNT(t2.key)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
The indexes help with the join, but you still need to do a full sort in order to do the group by. Essentially, it still has to process every record in the set.
Adding a where clause and limiting the set would run faster, of course. It just won't get you the results you want.
There may be other options than doing a group by on the entire table. I notice you're doing a SELECT * - What are you trying to get out of the query?
SELECT DISTINCT c1, c2
FROM test t
LEFT JOIN test2 t2 ON t2.key = t.key
may run faster, for instance. (I realize this was just a sample query, but understand that it's hard to optimize when you don't know what the end goal is!)
EDIT - In doing some reading (http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html), I learned that, under the correct circumstances, indexes can help significantly with the group by.
What I'm seeing is that it needs to be a sorted index (like BTREE), not a HASH. Perhaps:
CREATE INDEX c1c2 IN t (c1, c2) USING BTREE;
might help.
For innodb it will work, as the index caries your primary key by default. For myisam you have to have the key as the last column of your index be "key". That will give the optimizers all keys in the same order and he can skip the sort. You cannot do any range queryies on the index prefix theN, puts you right back into filesort. currently struggling with a similiar problem