I'm trying to find the best way to delete millions of records in MySQL DB.
I have a table with a PK on ID and an index on 'date' column and my delete queries are like:
DELETE FROM table WHERE date < '<today - 6 months>';
It's generating a lot of delay on the slave.
I had 2 options:
DELETE FROM table WHERE date < '<today - 6 months>' LIMIT 1000;
or
Include further Indexing or using PK for deleting.
I would like to hear your opinions. If using LIMIT won't change the workload or if using PK (in combination with LIMIT) is better.
The best way to delete lots of rows, especially in a replication setup, is to walk through the table via the PRIMARY KEY in chunks of 1000 rows.
See this for details.
If this is a recurring task, then a 'time series' PARTITIONing is even better. (Though it won't help until you set up partitioning.) See this.
Related
I have a database table which is around 700GB with 1 Billion rows, the data is approximately 500 GB and index is 200GB,
I am trying to delete all the data before 2021,
Roughly around 298,970,576 rows in 2021 and there are 708,337,583 rows remaining.
To delete this I am running a non-stop query in my python shell
DELETE FROM table_name WHERE id < 1762163840 LIMIT 1000000;
id -> 1762163840 represent data from 2021. Deleting 1Mil row taking almost 1200-1800sec.
Is there any way I can speed up this because the current way is running for more than 15 days and there is not much data delete so far and it's going to do more days.
I thought that if I make a table with just ids of all the records that I want to delete and then do an exact map like
DELETE FROM table_name WHERE id IN (SELECT id FROM _tmp_table_name);
Will that be fast? Is it going to be faster than first making a new table with all the records and then deleting it?
The database is setup on RDS and instance class is db.r3.large 2 vCPU and 15.25 GB RAM, only 4-5 connections running.
I would suggest recreating the data you want to keep -- if you have enough space:
create table keep_data as
select *
from table_name
where id >= 1762163840;
Then you can truncate the table and re-insert new data:
truncate table table_name;
insert into table_name
select *
from keep_data;
This will recreate the index.
The downside is that this will still take a while to re-insert the data (renaming keep_data would be faster). But it should be much faster than deleting the rows.
AND . . . this will give you the opportunity to partition the table so future deletes can be handled much faster. You should look into table partitioning if you have such a large table.
Multiple techniques for big deletes: http://mysql.rjweb.org/doc.php/deletebig
It points out that LIMIT 1000000 is unnecessarily big and causes more locking than might be desirable.
In the long run, PARTITIONing would be beneficial, it mentions that.
If you do Gordon's technique (rebuilding table with what you need), you lose access to the table for a long time; I provide an alternative that has essentially zero downtime.
id IN (SELECT...) can be terribly slow -- both because of the inefficiency of in-SELECT and due to the fact that DELETE will hang on to a huge number of rows for transactional integrity.
Can any one help me to re-write the query to speed up the execution time? It took 37 seconds to execute.
DELETE FROM storefront_categories
WHERE userid IN (SELECT userid
FROM MASTER
where expirydate<'2020-2-4'
)
At the same time, this query took only 4.69 seconds only to execute.
DELETE FROM storefront_categories
WHERE userid NOT IN (SELECT userid FROM MASTER)
The table storefront_categories have 97K records where as in MASTER have 40K records. We have created a index on MASTER.expirydate field.
When deleting 40K rows, expect it to take time. The main cost (assuming adequate indexing and a decent query) is the overhead of transactional semantics of an "atomic" delete. This involves making a copy of each row being deleted, just in case there is a crash. That way, InnoDB can bring the database back to what it had been before the crash.
When deleting 40% of a table, it is much faster to copy the rows to keep into another table then swap tables.
When deleting a large number of rows (regardless of the percentage), it is better to do it in chunks. And it is best to walk through the table based on the PRIMARY KEY.
I discuss both of those techniques, plus others, in http://mysql.rjweb.org/doc.php/deletebig
As for the query formulation:
It is version-dependent; old versions of MySQL did a poor job on some flavors.
NOT IN (SELECT ...) and NOT EXISTS tend to be the worst performers.
IN (SELECT ...) and/or EXISTS may be better.
"Multi-table DELETE is another option. It works like JOIN.
(Bottom line: You did not say what version you are running; I can't predict which formulation will be best.)
My blog avoids the formulation debate.
The query looks fine as it is.
I would suggest the following indexes for optimization:
master(expiry_date, userid)
storefront_categories(userid)
The first index is a covering index for the subquery on master: it means that the database should be able to execute the subquery by looking at the index only (whereas with just expiry_date in the index, it still needs to look at the table data to fetch the related userid).
The second index lets the database optimize the in operation.
I would try with exists :
DELETE
FROM storefront_categories
WHERE EXISTS (SELECT 1
FROM MASTER M
WHERE M.userid = storefront_categories.userid AND
M.expirydate <'2020-02-04'
);
Index would be metter here i would expect index on storefront_categories(userid) & MASTER(userid, expirydate).
I would advise you to use NOT EXISTS with the correct index:
DELETE sc
FROM storefront_categories sc
WHERE NOT EXISTS (SELECT 1
FROM master m
WHERE m.userid = sc.userid AND
m.expirydate < '2020-02-04'
);
The index you want is on master(userid, expirydate). The order of the columns is important. For this version, an index on storefront_categories does not help.
Note that I changed the date format. I recommend using YYYY-MM-DD to avoid ambiguity -- and to use the full 10 characters.
I've been running a website, with a large amount of data in the process.
A user's save data like ip , id , date to the server and it is stored in a MySQL database. Each entry is stored as a single row in a table.
Right now there are approximately 24 million rows in the table
Problem 1:
Things are getting slow now, as a full table scan can take too many minutes but I already indexed the table.
Problem 2:
If a user is pulling a select data from table it could potentially block all other users (as the table is locked) access to the site until the query is complete.
Our server
32 Gb Ram
12 core with 24 thread cpu
table use MyISAM engine
EXPLAIN SELECT SUM(impresn), SUM(rae), SUM(reve), `date` FROM `publisher_ads_hits` WHERE date between '2015-05-01' AND '2016-04-02' AND userid='168' GROUP BY date ORDER BY date DESC
Lock to comment from #Max P. If you write to MyIsam Tables ALL SELECTs are blocked. There is only a Table lock. If you use InnoDB there is a ROW Lock that only locks the ROWs they need. Aslo show us the EXPLAIN of your Queries. So it is possible that you must create some new one. MySQL can only handle one Index per Query. So if you use more fields in the Where Condition it can be useful to have a COMPOSITE INDEX over this fields
According to explain, query doesn't use index. Try to add composite index (userid, date).
If you have many update and delete operations, try to change engine to INNODB.
Basic problem is full table scan. Some suggestion are:
Partition the table based on date and dont keep more than 6-12months data in live system
Add an index on user_id
I have a query like this which takes a really long time to run. The table is around 4 million rows.
DELETE FROM TABLE WHERE value_was IS NULL OR value_was <= value_now;
I'm hoping I could create an index for (value_was, value_now) so I could do something like
DELETE FROM TABLE WHERE
ID1 IN (SELECT ID1 from TABLE where value_was IS NULL)
OR ID2 IN (SELECT ID2 FROM TABLE WHERE value_was <= value_now);
This table doesn't have primary key. It has two composite keys. And I guess I cannot use the same table in subquery, but how do I improve the performance of the first query?
Thanks very much any suggestion would be much appreciated.
Updated:
The db is innoDB
Due to the way data is stored, as an internal linked list, innoDB tables are inherently slow at huge DELETE operations. Changing the storage type to myISAM should make the operation an awful lot faster - I've seen 100x improvements in similar situations.
I have to delete about 10K rows from a table that has more than 100 million rows based on some criteria. When I execute the query, it takes about 5 minutes. I ran an explain plan (the delete query converted to select * since MySQL does not support explain delete) and found that MySQL uses the wrong index.
My question is: is there any way to tell MySQL which index to use during delete? If not, what ca I do? Select to temp table then delete from temp table?
There is index hint syntax. //ETA: sadly, not for deletes
ETA:
Have you tried running ANALYZE TABLE $mytable?
If that doesn't pay off, I'm thinking you have 2 choices: Drop the offending index before the delete and recreate it after. Or JOIN your delete table to another table on the desired index which should ensure that the desired index is used.
I've never really come across a situation where MySQL chose the wrong index, but rather my understanding of how indexes worked was usually at fault.
You might want to check out this book: http://oreilly.com/catalog/9780596003067
It has a great section on how indexes work and other tuning options.
As stated in other answers, MySQL can't use indexes, but the PRIMARY KEY index.
So your best option, if you have a PRIMARY KEY on the table is to run a fast SELECT, then DELETE according lines. Preferably in a TRANSACTION, so that you don't delete wrong rows.
Hence:
DELETE FROM table WHERE column_with_index = 0
Will be rewritten:
SELECT primary_key FROM table WHERE column_with_index = 0 => returns many lines
DELETE FROM table WHERE primary_key IN(?, ?, ?) => ? will be replaced by the results of the SELECTed primary keys.
If you have not that much lines to delete, it would be more efficient this way.
For example, I've just hit an exemple, on the same table, with the same data:
7499067 rows analyzed by DELETE : 12 seconds
vs
6 rows analyzed by SELECT using a good index : 0.10 seconds
0 rows to be deleted in the end