(I found the same question exists, but was not happy with the detailed specification, so came here for help, forgive me for my ignorance)
DELETE FROM supportrequestresponse # ~3 million records
WHERE SupportRequestID NOT IN (
SELECT SR.SupportRequestID
FROM supportrequest AS SR # ~1 million records
)
Or
DELETE SRR
FROM supportrequestresponse AS SRR # ~3 million records
LEFT JOIN supportrequest AS SR
ON SR.SupportRequestID = SRR.SupportRequestID # ~1 million records
WHERE SR.SupportRequestID IS NULL
Specifics
Database: MySQL
SR.SupportRequestID is INTEGER PRIMARY KEY
SRR.SupportRequestID is INTEGER INDEX
SR.SupportRequestID & SRR.SupportRequestID are not in FOREIGN KEY relation
Both tables contain TEXT columns for subject and message
Both tables are InnoDB
Motive: I am planning to use this with a periodic clean up job, likely to be once an hour or every two hours. It is very important to avoid lengthy operation in order to avoid table locks as this is a very busy database and am already over quota with deadlocks!
EXPLAIN query 1
1 PRIMARY supportrequestresponse ALL 410 Using where
2 DEPENDENT SUBQUERY SR unique_subquery PRIMARY PRIMARY 4 func 1 Using index
EXPLAIN query 2
1 SIMPLE SRR ALL 410
1 SIMPLE SR eq_ref PRIMARY PRIMARY 4 SRR.SupportRequestID 1 Using where; Using index; Not exists
RUN #2
EXPLAIN query 1
1 PRIMARY supportrequestresponse ALL 157209473 Using where
2 DEPENDENT SUBQUERY SR unique_subquery PRIMARY PRIMARY 4 func 1 Using index; Using where; Full scan on NULL key
EXPLAIN query 2
1 SIMPLE SRR ALL 157209476
1 SIMPLE SR eq_ref PRIMARY PRIMARY 4 SRR.SupportRequestID 1 Using where; Using index; Not exists
I suspect it would be quicker to create a new table, retaining just the rows you wish to keep. Then drop the old table. Then rename the new table.
I don't know how to describe this, but this worked as an answer to my case; an unbelievable one!
DELETE SRR
FROM supportrequestresponse AS SRR
LEFT JOIN (
SELECT SRR3.SupportRequestResponseID
FROM supportrequestresponse AS SRR3
LEFT JOIN supportrequest AS SR ON SR.SupportRequestID = SRR3.SupportRequestID
WHERE SR.SupportRequestID IS NULL
LIMIT 999
) AS SRR2 ON SRR2.SupportRequestResponseID = SRR.SupportRequestResponseID
WHERE SRR2.SupportRequestResponseID IS NOT NULL;
... # Same piece of SQL
... # Same piece of SQL
... #999 Same piece of SQL
A fork of the second pattern looks/feels appropriate than having to let MySQL match each row against a dynamic list, but this is the minor fact. I just limited the row selection to 999 rows at once only, that lets the DELETE operation finish in a blink of eye, but most importantly, I repeated the same piece of DELETE SQL 99 times one after another!
This basically made it super comfortable for a Cron job. The x99 statements let the database engine keep the tables NOT LOCKED so other processes don't get stuck waiting for the DELETE to finish, while each x# DELETE SQL takes very less amount of time to finish. I find it something like when vehicles pass through cross roads in a zipper fashion.
Related
I have a query that updates a field in a table using the primary key to locate the row. The table can contain many rows where the date/time field is initially NULL, and then is updated with a date/time stamp using NOW().
When I run the update statement on the table, I am getting a slow query log entry (3.38 seconds). The log indicates that 200,000 rows were examined. Why would that many rows be examined if I am using the PK to identify the row being updated?
Primary key is item_id and customer_id. I have verified the PRIMARY key is correct in the mySQL table structure.
UPDATE cust_item
SET status = 'approved',
lstupd_dtm = NOW()
WHERE customer_id = '7301'
AND item_id = '12498';
I wonder if it's a hardware issue.
While the changes I've mentioned in comments might help slightly, in truth, I cannot replicate this issue...
I have a data set of roughly 1m rows...:
CREATE TABLE cust_item
(customer_id INT NOT NULL
,item_id INT NOT NULL
,status VARCHAR(12) NULL
,PRIMARY KEY(customer_id,item_id)
);
-- INSERT some random rows...
SELECT COUNT(*)
, SUM(customer_id = 358) dense
, SUM(item_id=12498) sparse
FROM cust_item;
+----------+-------+--------+
| COUNT(*) | dense | sparse |
+----------+-------+--------+
| 1047720 | 104 | 8 |
+----------+-------+--------+
UPDATE cust_item
SET status = 'approved'
WHERE item_id = '12498'
AND customer_id = '358';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
How long does it take to select the record, without the update?
If select is fast then you need to look into things that can affect update/write speed.
too many indexes on the table, don't forget filtered indexes and indexed views
the index pages have 0 fill factor and need to split to accommodate the data change.
referential constraints with cascade
triggers
slow write speed at the storage level
If the select is slow
old/bad statistics on the index
extreme fragmentation
columnstore index with too many open rowgroups
If the select speed improves significantly after the first time, you may be having some cold buffer performance issues. That could point to storage I/O problems as well.
You may also be having concurrency issues caused by another process locking the table momentarily.
Finally, any chance the tool executing the query is returning a false duration? For example, SQL Server Management Studio can occasionally be slow to return a large resultset, even if the server handled it very quickly.
I have a slow UPDATE statement and can't work out why
CREATE TEMPORARY TABLE ttShifts
(ShiftId int NOT NULL, ttdtAdded datetime not null, ttdtBookingStart DATETIME NOT NULL, ttHoursNeeded int not null, szHidden varchar(255),
PRIMARY KEY (ShiftID))
AS
(select shiftId, dtAdded as ttdtAdded, dtBookingStart as ttdtBookingStart, HoursNeeded as ttHoursNeeded from shifts where shifts.lStatus=0);
update ttShifts set szHidden='x' where szHidden is NULL and ShiftId in (select shiftid from shifts,practices where shifts.PracticeId=practices.PracticeId and shifts.iBranch = practices.iBranch and practices.Healthboard not in (select Locname from userlocationprefs where iUser=82 and Level=0 and fAcceptWork=true))
166 rows affected. (Query took 0.2899 seconds.)
EXPLAIN:
1 PRIMARY ttShifts index PRIMARY 4 297 Using where
2 DEPENDENT SUBQUERY shifts eq_ref PRIMARY PRIMARY 4 func 1
2 DEPENDENT SUBQUERY practices ALL PRIMARY 636 Using where; Using join buffer (flat, BNL join)
3 MATERIALIZED userlocationprefs ref PRIMARY PRIMARY 8 const,const 3 Using where
OK, so let's try switching this to use a join to eliminate the dependent subqueries
update ttShifts join shifts on (ttShifts.ShiftID=shifts.shiftId) join practices on (shifts.practiceId=practices.PracticeId and shifts.iBranch=practices.iBranch) set szHidden='x' where szHidden is NULL and practices.Healthboard not in (select Locname from userlocationprefs where iUser=82 and Level=0 and fAcceptWork=true);
166 rows affected. (Query took 0.4009 seconds.)
Right, so that takes longer
EXPLAIN:
1 PRIMARY ttShifts ALL PRIMARY 297 Using where
1 PRIMARY shifts eq_ref PRIMARY PRIMARY 4 ttShifts.shiftId 1
1 PRIMARY practices ALL 636 Using where
2 MATERIALIZED userlocationprefs ref PRIMARY PRIMARY 8 const,const 3 Using where
OK, so it must be the MATERIALIZED bit it's not doing efficiently for some reason, let's try swapping that to a straight equality check just as a test.
update ttShifts join shifts on (ttShifts.ShiftID=shifts.shiftId) join practices on (shifts.practiceId=practices.PracticeId and shifts.iBranch=practices.iBranch) set szHidden='x' where szHidden is NULL and practices.Healthboard!='X'
0.3493 seconds.
OK, not that then.
If I eliminate the update and make it a select...
select * from ttShifts join shifts on (ttShifts.ShiftID=shifts.shiftId) join practices on (shifts.practiceId=practices.PracticeId and shifts.iBranch=practices.iBranch) where szHidden is NULL and practices.Healthboard not in (select Locname from userlocationprefs where iUser=82 and Level=0 and fAcceptWork=true)
(166 rows, Query took 0.0159 seconds.)
So why is the UPDATE so bloody slow, and what can I do to speed it up?
Your first EXPLAIN output tells us that it needs to process 297*1*636*3 rows = 566,676 rows. So yes it will take a moment to process. It is similar to second EXPLAIN output.
If I were you, I will try to focus on entry marked as ALL as it represent table scan operation.
Also, IN work best with list of constant value not sub query as they cause index to be useless. Try to remove sub query with constant value if possible.
The second EXPLAIN even worse because there are two table scan without index available to use.
Third UPDATE there is no EXPLAIN output, but again I think because number of rows to process is still high because you use column with low cardinality (not null) and healthBoard !='x' as filtering criteria WHERE clause.
Your last query try to compare speed of SELECT vs UPDATE. Well, UPDATE is more expensive because it has to search the matching row, write the value, write the the index.
From what I see, most of your problem is due low cardinality column is used as filtering criteria.
For anyone else who runs across this, it seems the issue is with how the optimiser handles IN with update vs select. I refactored things so the user viewing preferences was set in a separate table testuserchosenpractices, and then I could join it.
I can't explain the difference in speed between UPDATE and SELECT originally (as SELECT was perfectly tolerable) but with the replacement the UPDATE is faster than the original SELECT was!
update (select 1) as dummy, ttShifts set szHidden='x' where szHidden is NULL and ShiftId in (select shiftid from shifts join practices on (shifts.PracticeId=practices.PracticeId and shifts.iBranch = practices.iBranch) join testuserchosenpractices on (shifts.practiceid=testuserchosenpractices.practiceid and shifts.ibranch=testuserchosenpractices.ibranch and testuserchosenpractices.iUser=82) and szReason!='')
166 rows affected. (Query took 0.0066 seconds.)
I'm trying to execute a select query over a fairly simple (but large) table and am getting over 10x slower performance when I don't join on a certain secondary table.
TableA is keyed on two columns, 'ID1' & 'ID2', and has a total of 10 numeric (int + dbl) columns.
TableB is keyed on 'ID1' and has a total of 2 numeric (int) columns.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
TableA
INNER JOIN
TableB
ON
TableA.ID1 = TableB.ID1
WHERE
TableA.ID2 = 5
AND
TableA.ID1 BETWEEN 15000 AND 20000
As soon as the join is removed, performance takes a major hit. The query above takes 0.016 seconds to run while the query below takes 0.216 seconds to run.
The end goal is to replace TableA's 'ID1' with TableB's 2nd (non-key) column and deprecate TableB.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
tableA
WHERE
ID2 = 5
AND
ID1 BETWEEN 15000 AND 20000
Both tables have indexes on their primary keys. The relationship between the two tables is One-to-Many. DB Engine is MyISAM.
Scenario 1 (fast):
id stype table type possKey key kln ref rws extra
1 SIMPLE TableB range PRIMARY PRIMARY 4 498 Using where; Using index
1 SIMPLE TableA eq_ref PRIMARY PRIMARY 8 schm.TableA.ID1,const 1
Scenario 2 (slow):
id stype table type possKey key key_len ref rows extra
1 SIMPLE TableA range PRIMARY PRIMARY 8 288282 Using where
Row count and lack of any mention of an index in scenario 2 definitely stand out, but why would that be the case?
700 results from both queries -- same data.
Given your query, I'd say an index like this might be useful:
CREATE INDEX i ON tableA(ID2, ID1)
A possible reason why your first query is much faster is because you probably only have few records in tableB, which makes the join predicate very selective, compared to the range predicate.
I suggest reading up on indexes. Knowing 2-3 details about them will help you easily tune your queries only by choosing better indexes.
I have the following tables (example)
t1 (20.000 rows, 60 columns, primary key t1_id)
t2 (40.000 rows, 8 columns, primary key t2_id)
t3 (50.000 rows, 3 columns, primary key t3_id)
t4 (30.000 rows, 4 columns, primary key t4_id)
sql query:
SELECT COUNT(*) AS count FROM (t1)
JOIN t2 ON t1.t2_id = t2.t2_id
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id
I have created indexes on columns that affect the join (e.g on t1.t2_id) and foreign keys where necessary. The query is slow (600 ms) and if I put where clauses (e.g. WHERE t1.column10 = 1, where column10 doesn't have index), the query becomes much slower. The queries I do with select (*) and LIMIT are fast, and I can't understand count behaviour. Any solution?
EDIT: EXPLAIN SQL ADDED
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t4 index PRIMARY user_id 4 NULL 5259 Using index
1 SIMPLE t2 ref PRIMARY,t4_id t4_id 4 t4.t4_id 1 Using index
1 SIMPLE t1 ref t2_id t2_id 4 t2.t2_id 1 Using index
1 SIMPLE t3 ref PRIMARY PRIMARY 4 t2.t2_id 1 Using index
where user_id is a column of t4 table
EDIT: I changed from innodb to myisam and i had a speed increase, especially if i put where clauses. But i h still have times (100-150 ms) The reason i want count in my application, is to the the user who is processing a search form, the number of results he is expecting with ajax. May be there is a better solution in this, for example creating a temporary table, that is updated every one hour?
The count query is simply faster because of INDEX ONLY SCAN, as stated within query plan. The query you mention consists of only indexed columns, and thats why during execution there is no need to touch physical data - all query is performed on indexes. When you put some additional clause consisting of columns that are not indexed, or indexed in a way that prevents index usage there is a need to access data stored in a heap table by physical address - which is very slow.
EDIT:
Another important thing is that, those are PKs, so they are UNIQUE. Optimizer choses to perform INDEX RANGE SCAN on the first index, and only checks if keys exist in subsequent indexes (that's why the plan states there will be only one row returned).
EDIT2:
Thx to J. Bruni, in fact that is clustered index co the above isn't the "whole truth". There is probably full scan on the first table, and three subsequent INDEX ACCESSes to confirm the FK existance.
count iterate over whole result set and does not depends on indexes. Use EXPLAIN ANALYSE for your query to check how it is executed.
select + limit does not iterate whole result set, hence it's faster
Regarding the COUNT(*) slow performance: are you using InnoDB engine? See:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/
"SELECT COUNT(*)" is slow, even with where clause
The main information seems to be: "InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages."
So, one possible solution is to create a separated index and force its usage through USE INDEX command in the SQL query. Look at this comment for a sample usage report:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/comment-page-1/#comment-529049
Regarding the WHERE issue, the query will perform better if you put the condition in the JOIN clause, like this:
SELECT COUNT(t1.t1_id) AS count FROM (t1)
JOIN t2 ON (t1.column10 = 1) AND (t1.t2_id = t2.t2_id)
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id
I have this MySQL query:
EXPLAIN EXTENDED
SELECT img.id as iid,img.*,users.*,img.ave_rate as arate,img.count_rate as cn
FROM images AS img
LEFT OUTER JOIN users on img.uid=users.id
WHERE img.id NOT IN (SELECT rates.imageid from rates WHERE rates.ip=1854604622)
GROUP BY iid
ORDER BY
iid DESC
LIMIT 30
Its output says this:
1 PRIMARY img index NULL PRIMARY 4 NULL 30 580 Using where
1 PRIMARY users eq_ref PRIMARY PRIMARY 4 twtgirls3.img.uid 1 100
2 DEPENDENT SUBQUERY rates ref imageid,ip ip 5 const 4 100 Using where
As you can see in the first row it is using the PRIMARY key as index but in the extra column we have "Using Where", what does it mean? does it mean that the key is not used? We have the same condition in the third row....
And finally, what do you think about this query? Is it optimized?
If the Extra column also says Using where, it means the index is being used to perform lookups of key values. Without Using where, the optimizer may be reading the index to avoid reading data rows but not using it for lookups. For example, if the index is a covering index for the query, the optimizer may scan it without using it for lookups.
Source: https://dev.mysql.com/doc/refman/5.1/en/explain-output.html