I am trying to delete more than 2 million records from a table by a mysql query (no joins). The table has around 80 million records.
I used set autocommit=0; and it is taking long time to complete. Will this be faster if I run the query with autocommit=1?
I'm assuming your table is InnoDB. For those 2 million rows, it needs to keep track of the undo log entries for each modification. This builds up in the memory and will eventually go into disk. That's why it's taking a long time. If you do it in chunks, that'll prevent it from going into disk and for MySQL to keep track of less undo log entries, making things more efficient.
The autocommit happens at the end of your query so it wouldn't do anything.
The best way to figure out what your chunk size should be is by experimenting. Something like
delete from table1 limit 1000;
Then keep doubling it until you come up with the best rows-deleted per time ratio.
I am assuming you are trying to run 2 million individual delete statements.
If you try bulk deletes using the primary key or ranges to delete 100-1000 records at a time it will be much much faster.
Examples:
DELETE FROM Table WHERE ID > 0 AND ID < 1000;
OR
DELETE FROM Table WHERE ID IN (1,2,3,4,5 .... 1000);
You can adjust the number of records to delete to your liking, increasing it quite a bit if you want. On high load production servers I usually run scripts with smaller ranges like this maybe 100 times before sleeping for a bit and then continuing with another loop.
I always have autocommit turned on for this type of thing. Managing a transaction to delete 2 million records would add a lot of overhead.
Also, please ensure the column you use for the bulk/range deleting is the either the primary key or has an index.
It won't be any faster with changing value for autocommit variable. MySQL always build old image even if autocommit is true, because if user interrupts query then it must have old image to rollback.
Related
For MySQL;
What's the difference between a DROP PARTITION vs a DELETE WHERE query?
Which to use when?
My Scenario:
is the simple matter of deleting data older than a month from a few of my tables, at the end of every month. Tables fill at the slow rate of around 5 entries every second.
Pros / Cons
PARTITIONing with InnoDB requires me to disable my FOREIGN KEYs. So, DELETEing seems better for me. What, if any, advantages would PARTITIONing give me? Is it only the query's execution speed, a.k.a. performance? My deletion query would run only once every month so I don't see a problem with execution time.
For what it's worth, dropping a partition is a data definition language statement. It happens quickly.
DELETE is a data manipulation statement. If you use InnoDB (you probably do) it's transactional. What does that mean?
When you issue the statement, for example,
DELETE FROM tbl WHERE start_date < CURDATE() - INTERVAL 1 MONTH
it means that other programs accessing your database will see either all the rows you're deleting, before your DELETE transaction, or none of them. The operation is called atomic or indivisible -- it appears to happen all at once.
If you delete many rows, this can put a big burden on your database server. It has to accumulate a transaction log containing all the deleted rows, then commit the transaction log all at once, possibly locking out other access.
Your question says you must delete about 13 megarows each month. If you do that with just one DELETE statement, you will put a heavy burden on your database. To reduce the burden when deleting unwanted historical rows, do the DELETE operation in chunks. How? You have a couple of choices.
#Akina suggested this: Do the deletion often enough that you don't delete a large number of rows at once, or
Do the deletion in batches of 1000 rows with a statement like this:
DELETE FROM tbl
WHERE start_date < CURDATE() - INTERVAL 1 MONTH
LIMIT 1000;
and repeat the statement until it deletes no rows.
"5 entries every second" = about 400K/day or 13M/month
DELETING 3M rows in a single statement:
Very slow for that many rows. (Not bad for under 1K rows)
Blocks most activity on the table
Builds a very big list of things for potential "rollback" (in case of power failure)
Scheduled DELETE
Why wait for the end of the month? Do up to 1000 every minute; that will keep up with much less overhead. Be sure to have a suitable index, else it won't work efficiently.
Rather than a monthly task, have a separate task that is continually running, deleting up to 200 rows, then moving on to the next table; eventually repeating. (If it is not keeping up, increase the "LIMIT 200"; if it is too invasive, add a SLEEP in the loop.)
Do not use cron or EVENT to schedule the delete. If, for whatever reason, a Delete run fails to finish before the next Delete, the job could become a train wreck. OTOH, a continually-running job needs a "keep-alive" task to restart it if it dies for any unforeseen reason.
DROP PARTITION
Because of how PARTITIONs are implemented as separate 'tables', DROP PARTITION is similar to DROP TABLE.
Very fast, regardless of the number of rows in the partition. (Well, the OS may show a slight sluggishness for huge files.)
Easy to do if using PARTITION BY RANGE(..).
I recommend that the number of partitions be between 20 and 50; adjust the deletion frequency accordingly. (1-month retention --> daily partitions; 3-month retention --> weekly partitions; 1-year retention --> monthly or weekly; etc.)
When partitioning a table, rethink all the indexes. You may be able to improve a few queries by making use of partition pruning. (But don't expect much.)
More info: Partition
PARTITIONing conflicts with FOREIGN KEYS and some UNIQUE keys. This puts a burden on the programmer to worry about (or ignore) the loss of those constraints.
Here's my blog on other big deletions techniques
I have a delete query, which delete rows by chunk (each chunk 2000)
Delete from Table1 where last_refresh_time < {time value}
Here I want to delete the rows in the table which are not refreshed for last 5days.
Usually the delete will be around 10million rows. This process will be done once per-day in a little offtime.
This query executes little faster in Master, but due to ROW_BASED_REPLICATION the SLAVE is in heavy lag. As SLAVE - SQL_THREAD deletes each rows one by one from RELAY_LOG data.
We use READ_COMMITED isolation level,
Is it okay to change this query transaction alone to STATEMENT_BASED replication ?
will we face any issue?
In MySql, it is mentioned like below, can someone explain this will other transaction INSERT gets affected?
If you are using InnoDB tables and the transaction isolation level is READ COMMITTED or READ UNCOMMITTED, only row-based logging can be used. It is possible to change the logging format to STATEMENT, but doing so at runtime leads very rapidly to errors because InnoDB can no longer perform inserts
If other TRANSACTION INSERTS gets affected can we change ISOLATION LEVEL to REPEATABLE_READ for this DELETE QUERY TRANSACTION alone ? Is it recommended do like this?
Please share your views and Suggestions for this lag issue
Mysql - INNDOB Engine - 5.7.18
Don't do a single DELETE that removes 10M rows. Or 1M. Not even 100K.
Do the delete online. Yes, it is possible, and usually preferable.
Write a script that walks through the table 200 rows at a time. DELETE and COMMIT any "old" rows in that 200. Sleep for 1 second, then move on to the next 200. When it hits the end of the table, simply start over. (1K rows in a chunk may be OK.) Walk through the table via the PRIMARY KEY so that the effort to 'chunk' is minimized. Note that the 200 rows plus 1 second delay will let you get through the table in about 1 day, effectively as fast as your current code, but will much less interference.
More details: http://mysql.rjweb.org/doc.php/deletebig Note, especially, how it is careful to touch only N rows (N=200 or whatever) of the table per pass.
My suggestion helps avoid Replica lag in these ways
Lower count (200 vs 2000). That many 'events' will be dumped into the replication stream all at once. Hence, other events are stuck behind them.
Touch only 200 rows -- by using the PRIMARY KEY careful use of LIMIT, etc
"Sleep" between chunks -- The Primary primes the cache with an initial SELECT that is not replicated. Hence, in Row Based Replication, the Replica is likely to be caught off guard (rows to delete have not been cached). The Sleep gives it a chance to finish the deletes and handle other replication items before the next chunk comes.
Discussion: With Row Based Replication (which is preferable), a 10M DELETE will ship 10M 1-row deletes to the Replicas. This clogs replication, delays replication, etc. By breaking it into small chunks, such overhead has a reasonable impact on replication.
Don't worry about isolation mode, etc, simply commit each small chunk. 100 rows will easily be done in less than a second. Probably 1K will be that fast. 10M will certainly not.
You said "refreshed". Does this mean that the processing updates a timestamp in the table? And this happens at 'random' times for 'random' rows? And such an update can happen multiple times for a given row? If that is what you mean, then I do not recommend PARTITIONing, which is also discussed in the link above.
Note that I do not depend on an index on that timestamp, much less suggest partitioning by that timestamp. I want to avoid the overhead of updating such an index so rapidly. Walking through the table via the PK is a very good alternative.
Do you really need READ_COMMITED isolation level ? It's not actually standard and ACID.
But any way.
For this query you can change session isolation to REAPEATABLE_READ and use MIXED mode for binlog_format.
With that you will get STATEMENT base replication only for this session.
Maybe that table usage will better fit to noSQL tool like Mongodb and TTL index.
I'm running an ETL process and streaming data into a MySQL table.
Now it is being written over a web connection (fairly fast one) -- so that can be a bottleneck.
Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index .... and then a few attributes.
If a new ID is found, insert, otherwise, update ... you get the idea.
Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.
When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.
The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once ... (20 is max allowed by web host) ... whereas any "update" function cannot have any parallels running.
Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.
My question is ... given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) .... by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?
Is this something easy to implement, or something that comes up often?
What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.
Ultimately there are 20 columns, max of probably varchar(50) ... should I be getting a lot more than 13 rows processed/ second?
There are many possible 'answers' to your questions.
13/second -- a lot that can be done...
INSERT ... ON DUPLICATE KEY UPDATE ... ('IODKU') is usually the best way to do "update, else insert" (unless I don't know what you mean by it).
Batched inserts is much faster than inserting one row at a time. Optimal is around 100 rows giving 10x speedup. IODKU can (usually) be batched, too; see the VALUES() pseudo function.
BEGIN;...lots of writes...COMMIT; cuts back significantly on the overhead for transaction.
Using a "staging" table for gathering things up update can have a significant benefit. My blog discussing that. That also covers batch "normalization".
Building Summary Tables on the fly interferes with high speed data ingestion. Another blog covers Summary tables.
Normalization can be used for de-dupping, hence shrinking the disk footprint. This can be important for decreasing I/O for the 'Fact' table in Data Warehousing. (I am referring to your 20 x VARCHAR(50).)
RAID striping is a hardware help.
Batter-Backed-Write-Cache on a RAID controller makes writes seem instantaneous.
SSDs speed up I/O.
If you provide some more specifics (SHOW CREATE TABLE, SQL, etc), I can be more specific.
Do it in the DBMS, and wrap it in a transaction.
To explain:
Load your data into a temporary table in MySQL in the fastest way possible. Bulk load, insert, do whatever works. Look at "load data infile".
Outer-join the temporary table to the target table, and INSERT those rows where the PK column of the target table is NULL.
Outer-join the temporary table to the target table, and UPDATE those rows where the PK column of the target table is NOT NULL.
Wrap steps 2 and 3 in a begin/commit (or [start transaction]/commit pair for a transaction. The default behaviour is probably autocommit, which will mean you're doing a LOT of database work after every insert/update. Use transactions properly, and the work is only done once for each block.
I got a mysql database with approx. 1 TB of data. Table fuelinjection_stroke has apprx. 1.000.000.000 rows. DBID is the primary key that is automatically incremented by one with each insert.
I am trying to delete the first 1.000.000 rows using a very simple statement:
Delete from fuelinjection_stroke where DBID < 1000000;
This query is takeing very long (>24h) on my dedicated 8core Xeon Server (32 GB Memory, SAS Storage).
Any idea whether the process can be sped up?
I believe that you table becomes locked. I've faced same problem and find out that can delete 10k records pretty fast. So you might want to write simple script/program which will delete records by chunks.
DELETE FROM fuelinjection_stroke WHERE DBID < 1000000 LIMIT 10000;
And keep executing it until it deletes everything
Are you space deprived? Is down time impossible?
If not, you could fit in a new INT column length 1 and default it to 1 for "active" (or whatever your terminology is) and 0 for "inactive". Actually, you could use 0 through 9 as 10 different states if necessary.
Adding this new column will take a looooooooong time, but once it's over, your UPDATEs should be lightning fast as long as you do it off the PRIMARY (as you do with your DELETE) and you don't index this new column.
The reason why InnoDB takes so long to DELETE on such a massive table as yours is because of the cluster index. It physically orders your table based upon your PRIMARY (or first UNIQUE it finds...or whatever it feels like if it can't find PRIMARY or UNIQUE), so when you pull out one row, it now reorders your ENTIRE table physically on the disk for speed and defragmentation. So it's not the DELETE that's taking so long. It's the physical reordering after that row is removed.
When you create a new INT column with a default value, the space will be filled, so when you UPDATE it, there's no need for physical reordering across your huge table.
I'm not sure exactly what your schema is exactly, but using a column for a row's state is much faster than DELETEing; however, it will take more space.
Try setting values:
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT (for non-windows machine)
innodb_buffer_pool_size=25GB (currently it is close to 21GB)
innodb_doublewrite=0
innodb_support_xa=0
innodb_thread_concurrency=0...1000 (try different values, beginning with 200)
References:
MySQL docs for description of different variables.
MySQL Server Setting Tuning
MySQL Performance Optimization basics
http://bugs.mysql.com/bug.php?id=28382
What indexes do you have?
I think your issue is that the delete is rebuilding the index on every iteration.
I'd delete the indexes if any, do the delete, then re-add the indexes. It'll be far faster, (I think).
I was having the same problem, and my table has several indices that I didn't want to have to drop and recreate. So I did the following:
create table keepers
select * from origTable where {clause to retrieve rows to preserve};
truncate table origTable;
insert into origTable null,keepers.col2,...keepers.col(last) from keepers;
drop table keepers;
About 2.2 million rows were processed in about 3 minutes.
Your database may be checking for records that need to be modified in a foreign key (cascades, delete).
But I-Conica answer is a good point(+1). The process of deleting a single record and updating a lot of indexes during done 100000 times is inefficient. Just drop the index, delete all records and create it again.
And of course, check if there is any kind of lock in the database. One user or application can lock a record or table and your query will be waiting until the user release the resource or it reachs a timeout. One way to check if your database is doing real work or just waiting is lauch the query from a connection that sets the --innodb_lock_wait_timeout parameter to a few seconds. If it fails at least you know that the query is OK and that you need to find and realease that lock. Examples of locks are Select * from XXX For update and uncommited transactions.
For such long tables, I'd rather use MYISAM, specially if there is not a lot of transactions needed.
I don't know exact ans for ur que. But writing another way to delete those rows, pls try this.
delete from fuelinjection_stroke where DBID in
(
select top 1000000 DBID from fuelinjection_stroke
order by DBID asc
)
Just finished rewriting many queries as batch queries - no more DB calls inside of foreach loops!
One of these new batch queries, and insert ignore into a pivot table, is taking 1-4 seconds each time. It is fairly large (~100 rows per call) and the table is also > 2 million rows.
This is the current bottleneck in my program. Should I consider something like locking the table (never done this before, but I have heard it is ... dangerous) or are there other options I should look at first.
As it is a pivot table, there is a unique key comprised of both the rows I am updating.
Are you using indexes? Indexing the correct columns speeds things up immensely. If you are doing a lot of updating and inserting, sometimes it makes sense to disable indexes until finished, since re-indexing takes time. I don't understand how locking the table would help. Is this table in user by other users or applications? That would be the main reason locking would increase speed.