For MySQL;
What's the difference between a DROP PARTITION vs a DELETE WHERE query?
Which to use when?
My Scenario:
is the simple matter of deleting data older than a month from a few of my tables, at the end of every month. Tables fill at the slow rate of around 5 entries every second.
Pros / Cons
PARTITIONing with InnoDB requires me to disable my FOREIGN KEYs. So, DELETEing seems better for me. What, if any, advantages would PARTITIONing give me? Is it only the query's execution speed, a.k.a. performance? My deletion query would run only once every month so I don't see a problem with execution time.
For what it's worth, dropping a partition is a data definition language statement. It happens quickly.
DELETE is a data manipulation statement. If you use InnoDB (you probably do) it's transactional. What does that mean?
When you issue the statement, for example,
DELETE FROM tbl WHERE start_date < CURDATE() - INTERVAL 1 MONTH
it means that other programs accessing your database will see either all the rows you're deleting, before your DELETE transaction, or none of them. The operation is called atomic or indivisible -- it appears to happen all at once.
If you delete many rows, this can put a big burden on your database server. It has to accumulate a transaction log containing all the deleted rows, then commit the transaction log all at once, possibly locking out other access.
Your question says you must delete about 13 megarows each month. If you do that with just one DELETE statement, you will put a heavy burden on your database. To reduce the burden when deleting unwanted historical rows, do the DELETE operation in chunks. How? You have a couple of choices.
#Akina suggested this: Do the deletion often enough that you don't delete a large number of rows at once, or
Do the deletion in batches of 1000 rows with a statement like this:
DELETE FROM tbl
WHERE start_date < CURDATE() - INTERVAL 1 MONTH
LIMIT 1000;
and repeat the statement until it deletes no rows.
"5 entries every second" = about 400K/day or 13M/month
DELETING 3M rows in a single statement:
Very slow for that many rows. (Not bad for under 1K rows)
Blocks most activity on the table
Builds a very big list of things for potential "rollback" (in case of power failure)
Scheduled DELETE
Why wait for the end of the month? Do up to 1000 every minute; that will keep up with much less overhead. Be sure to have a suitable index, else it won't work efficiently.
Rather than a monthly task, have a separate task that is continually running, deleting up to 200 rows, then moving on to the next table; eventually repeating. (If it is not keeping up, increase the "LIMIT 200"; if it is too invasive, add a SLEEP in the loop.)
Do not use cron or EVENT to schedule the delete. If, for whatever reason, a Delete run fails to finish before the next Delete, the job could become a train wreck. OTOH, a continually-running job needs a "keep-alive" task to restart it if it dies for any unforeseen reason.
DROP PARTITION
Because of how PARTITIONs are implemented as separate 'tables', DROP PARTITION is similar to DROP TABLE.
Very fast, regardless of the number of rows in the partition. (Well, the OS may show a slight sluggishness for huge files.)
Easy to do if using PARTITION BY RANGE(..).
I recommend that the number of partitions be between 20 and 50; adjust the deletion frequency accordingly. (1-month retention --> daily partitions; 3-month retention --> weekly partitions; 1-year retention --> monthly or weekly; etc.)
When partitioning a table, rethink all the indexes. You may be able to improve a few queries by making use of partition pruning. (But don't expect much.)
More info: Partition
PARTITIONing conflicts with FOREIGN KEYS and some UNIQUE keys. This puts a burden on the programmer to worry about (or ignore) the loss of those constraints.
Here's my blog on other big deletions techniques
Related
I have a delete query, which delete rows by chunk (each chunk 2000)
Delete from Table1 where last_refresh_time < {time value}
Here I want to delete the rows in the table which are not refreshed for last 5days.
Usually the delete will be around 10million rows. This process will be done once per-day in a little offtime.
This query executes little faster in Master, but due to ROW_BASED_REPLICATION the SLAVE is in heavy lag. As SLAVE - SQL_THREAD deletes each rows one by one from RELAY_LOG data.
We use READ_COMMITED isolation level,
Is it okay to change this query transaction alone to STATEMENT_BASED replication ?
will we face any issue?
In MySql, it is mentioned like below, can someone explain this will other transaction INSERT gets affected?
If you are using InnoDB tables and the transaction isolation level is READ COMMITTED or READ UNCOMMITTED, only row-based logging can be used. It is possible to change the logging format to STATEMENT, but doing so at runtime leads very rapidly to errors because InnoDB can no longer perform inserts
If other TRANSACTION INSERTS gets affected can we change ISOLATION LEVEL to REPEATABLE_READ for this DELETE QUERY TRANSACTION alone ? Is it recommended do like this?
Please share your views and Suggestions for this lag issue
Mysql - INNDOB Engine - 5.7.18
Don't do a single DELETE that removes 10M rows. Or 1M. Not even 100K.
Do the delete online. Yes, it is possible, and usually preferable.
Write a script that walks through the table 200 rows at a time. DELETE and COMMIT any "old" rows in that 200. Sleep for 1 second, then move on to the next 200. When it hits the end of the table, simply start over. (1K rows in a chunk may be OK.) Walk through the table via the PRIMARY KEY so that the effort to 'chunk' is minimized. Note that the 200 rows plus 1 second delay will let you get through the table in about 1 day, effectively as fast as your current code, but will much less interference.
More details: http://mysql.rjweb.org/doc.php/deletebig Note, especially, how it is careful to touch only N rows (N=200 or whatever) of the table per pass.
My suggestion helps avoid Replica lag in these ways
Lower count (200 vs 2000). That many 'events' will be dumped into the replication stream all at once. Hence, other events are stuck behind them.
Touch only 200 rows -- by using the PRIMARY KEY careful use of LIMIT, etc
"Sleep" between chunks -- The Primary primes the cache with an initial SELECT that is not replicated. Hence, in Row Based Replication, the Replica is likely to be caught off guard (rows to delete have not been cached). The Sleep gives it a chance to finish the deletes and handle other replication items before the next chunk comes.
Discussion: With Row Based Replication (which is preferable), a 10M DELETE will ship 10M 1-row deletes to the Replicas. This clogs replication, delays replication, etc. By breaking it into small chunks, such overhead has a reasonable impact on replication.
Don't worry about isolation mode, etc, simply commit each small chunk. 100 rows will easily be done in less than a second. Probably 1K will be that fast. 10M will certainly not.
You said "refreshed". Does this mean that the processing updates a timestamp in the table? And this happens at 'random' times for 'random' rows? And such an update can happen multiple times for a given row? If that is what you mean, then I do not recommend PARTITIONing, which is also discussed in the link above.
Note that I do not depend on an index on that timestamp, much less suggest partitioning by that timestamp. I want to avoid the overhead of updating such an index so rapidly. Walking through the table via the PK is a very good alternative.
Do you really need READ_COMMITED isolation level ? It's not actually standard and ACID.
But any way.
For this query you can change session isolation to REAPEATABLE_READ and use MIXED mode for binlog_format.
With that you will get STATEMENT base replication only for this session.
Maybe that table usage will better fit to noSQL tool like Mongodb and TTL index.
We recently switched our tables to use InnoDB (from MyISAM) specifically so we could take advantage of the ability to make updates to our database while still allowing SELECT queries to occur (i.e. by not locking the entire table for each INSERT)
We have a cycle that runs weekly and INSERTS approximately 100 million rows using "INSERT INTO ... ON DUPLICATE KEY UPDATE ..."
We are fairly pleased with the current update performance of around 2000 insert/updates per second.
However, while this process is running, we have observed that regular queries take very long.
For example, this took about 5 minutes to execute:
SELECT itemid FROM items WHERE itemid = 950768
(When the INSERTs are not happening, the above query takes several milliseconds.)
Is there any way to force SELECT queries to take a higher priority? Otherwise, are there any parameters that I could change in the MySQL configuration that would improve the performance?
We would ideally perform these updates when traffic is low, but anything more than a couple seconds per SELECT query would seem to defeat the purpose of being able to simultaneously update and read from the database. I am looking for any suggestions.
We are using Amazon's RDS as our MySQL server.
Thanks!
I imagine you have already solved this nearly a year later :) but I thought I would chime in. According to MySQL's documentation on internal locking (as opposed to explicit, user-initiated locking):
Table updates are given higher priority than table retrievals. Therefore, when a lock is released, the lock is made available to the requests in the write lock queue and then to the requests in the read lock queue. This ensures that updates to a table are not “starved” even if there is heavy SELECT activity for the table. However, if you have many updates for a table, SELECT statements wait until there are no more updates.
So it sounds like your SELECT is getting queued up until your inserts/updates finish (or at least there's a pause.) Information on altering that priority can be found on MySQL's Table Locking Issues page.
I am trying to delete more than 2 million records from a table by a mysql query (no joins). The table has around 80 million records.
I used set autocommit=0; and it is taking long time to complete. Will this be faster if I run the query with autocommit=1?
I'm assuming your table is InnoDB. For those 2 million rows, it needs to keep track of the undo log entries for each modification. This builds up in the memory and will eventually go into disk. That's why it's taking a long time. If you do it in chunks, that'll prevent it from going into disk and for MySQL to keep track of less undo log entries, making things more efficient.
The autocommit happens at the end of your query so it wouldn't do anything.
The best way to figure out what your chunk size should be is by experimenting. Something like
delete from table1 limit 1000;
Then keep doubling it until you come up with the best rows-deleted per time ratio.
I am assuming you are trying to run 2 million individual delete statements.
If you try bulk deletes using the primary key or ranges to delete 100-1000 records at a time it will be much much faster.
Examples:
DELETE FROM Table WHERE ID > 0 AND ID < 1000;
OR
DELETE FROM Table WHERE ID IN (1,2,3,4,5 .... 1000);
You can adjust the number of records to delete to your liking, increasing it quite a bit if you want. On high load production servers I usually run scripts with smaller ranges like this maybe 100 times before sleeping for a bit and then continuing with another loop.
I always have autocommit turned on for this type of thing. Managing a transaction to delete 2 million records would add a lot of overhead.
Also, please ensure the column you use for the bulk/range deleting is the either the primary key or has an index.
It won't be any faster with changing value for autocommit variable. MySQL always build old image even if autocommit is true, because if user interrupts query then it must have old image to rollback.
Currently we have a database and a script which has 2 update and 1 select, 1 insert.
The problem is we have 20,000 People who run this script every hour. Which cause the mysql to run with 100% cpu.
For the insert, it's for logging, we want to log all the data to our mysql, but as the table scale up, application become slower and slower. We are running on InnoDB, but some people say it should be MyISAM. What should we use? In this log table, we do sometimes pull out the log for statistical purpose. 40->50 times a day only.
Our solution is to use Gearman [http://gearman.org/] to delay insert to the database. But how about the update.
We need to update 2 table, 1 from the customer to update the balance(balance = balance -1), and the other is to update the count from another table.
How should we make this faster and more CPU efficient?
Thank you
but as the table scale up, application become slower and slower
This usually means that you're missing an index somewhere.
MyISAM is not good: in addition to being non ACID compliant, it'll lock the whole table to do an insert -- which kills concurrency.
Read the MySQL documentation carefully:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Especially "innodb_flush_log_at_trx_commit" -
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html
I would stay away from MyISAM as it has concurrency issues when mixing SELECT and INSERT statements. If you can keep your insert tables small enough to stay in memory, they'll go much faster. Batching your updates in a transaction will help them go faster as well. Setting up a test environment and tuning for your actual job is important.
You may also want to look into partitioning to rotate your logs. You'd drop the old partition and create a new one for the current data. This is much faster than than deleting the old rows.
I have a table in MySQL 5 (InnoDB) that is used as a daemon Processing Queue, thus it is being accessed very often. It is typical to have around 250 000 records inserted per day. When I select records to be processed, they are read using a FOR UPDATE query to eliminate race conditions (everything is Transaction Based).
Now I am developing a "queue archive" and I have stumbled into a serious dead-lock problem. I need to delete "executed" records from the table as they are being processed (live), yet the table dead-locks every once in a while if I do so (two-three times per day at).
I though of moving towards delayed deletion (once per day at low load times) but this will not eliminate the problem only make it less obvious.
Is there a common-practice in dealing with high-load tables in MySQL?
InnoDB locks all rows it examines, not only those requested.
See this question for more details.
You need to create an index that would exactly match your search condition to get rid of unnecessary locks, and make sure it is used.
Unfortunately, DML queries in MySQL do not accept hints.