Purging data from mysql tables - mysql

I have a cron setup to take a backup of production mysql tables and looking to purge data from the tables at regular intervals. I have to delete data across multiple tables referenced by ids.
Some background : I need to delete about 2 million rows and my app will be continuously reading/writing to my db(it shouldn't usually access the rows being deleted though)
My question is how should I structure my delete query on the following parameters :
Delete in a single bulk query vs deleting in batches ?
Delete across different tables in a single transaction vs deleting without using any transaction. Will there be any table level locks if I use delete in transactions even if I delete in batches?
I do not have any partitions set up, would fragmentation be an issue?

Assumption:
Isolation level : Repeatable Read -- Default Mysql Isolation Level.
Delete query which you have is based on range and not primary index.
Deleting all rows in one transaction,
Will have very long transaction, and a larger locks. This ll increase replication lag, replication lag is bad, new DC makes it really bad. Having larger locks also will reduce your write throughput. (In case of Isolation Level Serializable even reads throughput might also suffer.)
Deleting in batch.
Better than deleting all, but as deletes are happening for range, number of locks for each delete will be more, (will take gap locks and next row locks). So delete in batch on range will also have same problems just smaller.
Compared to delete in all and batch, doing it in batch is preferable.
Other way of doing : (We need to delete rows before some-time)
1. Have a daemon which runs every configured_time and.
i. select pk from table where purge-time < your-purge-time. -- no locks
ii. delete based on pk, using multiple threads. -- row level locks, small transaction (across tables.)
This approach will ensure smaller transaction and only row level locks. (delete based on primary key would only take row level locks). Also your query is simple so you can re run even when part of deletes are successful. And I feel having these atomic is not a requirement.
Or
Reduce your isolation level : To READ_COMMITED then even, with batch deletes you should be fine. In Read COMMITED isolations, locks are only on row even while accessing via secondary key.
Or
If your model allows shard based on time and drop the db itself :)

Related

MySQL Replication lag in slave due to Delete query - Row Based Replication

I have a delete query, which delete rows by chunk (each chunk 2000)
Delete from Table1 where last_refresh_time < {time value}
Here I want to delete the rows in the table which are not refreshed for last 5days.
Usually the delete will be around 10million rows. This process will be done once per-day in a little offtime.
This query executes little faster in Master, but due to ROW_BASED_REPLICATION the SLAVE is in heavy lag. As SLAVE - SQL_THREAD deletes each rows one by one from RELAY_LOG data.
We use READ_COMMITED isolation level,
Is it okay to change this query transaction alone to STATEMENT_BASED replication ?
will we face any issue?
In MySql, it is mentioned like below, can someone explain this will other transaction INSERT gets affected?
If you are using InnoDB tables and the transaction isolation level is READ COMMITTED or READ UNCOMMITTED, only row-based logging can be used. It is possible to change the logging format to STATEMENT, but doing so at runtime leads very rapidly to errors because InnoDB can no longer perform inserts
If other TRANSACTION INSERTS gets affected can we change ISOLATION LEVEL to REPEATABLE_READ for this DELETE QUERY TRANSACTION alone ? Is it recommended do like this?
Please share your views and Suggestions for this lag issue
Mysql - INNDOB Engine - 5.7.18
Don't do a single DELETE that removes 10M rows. Or 1M. Not even 100K.
Do the delete online. Yes, it is possible, and usually preferable.
Write a script that walks through the table 200 rows at a time. DELETE and COMMIT any "old" rows in that 200. Sleep for 1 second, then move on to the next 200. When it hits the end of the table, simply start over. (1K rows in a chunk may be OK.) Walk through the table via the PRIMARY KEY so that the effort to 'chunk' is minimized. Note that the 200 rows plus 1 second delay will let you get through the table in about 1 day, effectively as fast as your current code, but will much less interference.
More details: http://mysql.rjweb.org/doc.php/deletebig Note, especially, how it is careful to touch only N rows (N=200 or whatever) of the table per pass.
My suggestion helps avoid Replica lag in these ways
Lower count (200 vs 2000). That many 'events' will be dumped into the replication stream all at once. Hence, other events are stuck behind them.
Touch only 200 rows -- by using the PRIMARY KEY careful use of LIMIT, etc
"Sleep" between chunks -- The Primary primes the cache with an initial SELECT that is not replicated. Hence, in Row Based Replication, the Replica is likely to be caught off guard (rows to delete have not been cached). The Sleep gives it a chance to finish the deletes and handle other replication items before the next chunk comes.
Discussion: With Row Based Replication (which is preferable), a 10M DELETE will ship 10M 1-row deletes to the Replicas. This clogs replication, delays replication, etc. By breaking it into small chunks, such overhead has a reasonable impact on replication.
Don't worry about isolation mode, etc, simply commit each small chunk. 100 rows will easily be done in less than a second. Probably 1K will be that fast. 10M will certainly not.
You said "refreshed". Does this mean that the processing updates a timestamp in the table? And this happens at 'random' times for 'random' rows? And such an update can happen multiple times for a given row? If that is what you mean, then I do not recommend PARTITIONing, which is also discussed in the link above.
Note that I do not depend on an index on that timestamp, much less suggest partitioning by that timestamp. I want to avoid the overhead of updating such an index so rapidly. Walking through the table via the PK is a very good alternative.
Do you really need READ_COMMITED isolation level ? It's not actually standard and ACID.
But any way.
For this query you can change session isolation to REAPEATABLE_READ and use MIXED mode for binlog_format.
With that you will get STATEMENT base replication only for this session.
Maybe that table usage will better fit to noSQL tool like Mongodb and TTL index.

MySQL - Update table rows without locking the rows

I have requirement where we need to update the row without holding the lock for the while updating.
Here is the details of the requirements, we will be running a batch processing on a table every 5 mins update blogs set is_visible=1 where some conditions this query as to run on millions of records so we don't want to block all the rows for write during updates.
I totally understand the implications of not having write locks which is fine for us because is_visible column will be updated only by this batch process no other thread wil update this column. On the other hand there will be lot of updates to other columns of the same table which we don't want to block
First of all, if you default on the InnoDB storage engine of MySQL, then there is no way you can update data without row locks except setting the transaction isolation level down to READ UNCOMMITTED by running
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
However, I don't think the database behavior is what you expect since the dirty read is allowed in this case. READ UNCOMMITTED is rarely useful in practice.
To complement the answer from #Tim, it is indeed a good idea to have a unique index on the column used in the where clause. However, please note as well that there is no absolute guarantee that the optimizer will eventually choose such execution plan using the index created. It may work or not work, depending on the case.
For your case, what you could do is to split the long transaction into multiple short transactions. Instead of updating millions of rows in one shot, scanning only thousands of rows each time would be better. The X locks are released when each short transaction commits or rollbacks, giving the concurrent updates the opportunity to go ahead.
By the way, I assume that your batch has lower priority than the other online processes, thus it could be scheduled out of peak hours to further minimize the impact.
P.S. The IX lock is not on the record itself, but attached to the higher-granularity table object. And even with REPEATABLE READ transaction isolation level, there is no gap lock when the query uses a unique index.
Best practice is to always acquire a specific lock when there is a chance that an update could happen concurrently with other transactions. If your storage engine be MyISAM, then MySQL will lock the entire table during an update, and there isn't much you can do about that. If the storage engine be InnoDB, then it is possible that MySQL would only put an exclusive IX lock on the records targeted by the update, but there are caveats to this being the case. The first thing you would do to try to achieve this would be a SELECT ... FOR UPDATE:
SELECT * FROM blogs WHERE <some conditions> FOR UPDATE;
In order to ensure that InnoDB only locks the records being updated, there needs to be a unique index on the column which appears in the WHERE clause. In the case of your query, assuming id were the column involved, it would have to be a primary key, or else you would need to create a unique index:
CREATE UNIQUE INDEX idx ON blogs (id);
Even with such an index, InnoDB may still apply gap locks on the records in between index values, to ensure that the REPEATABLE READ contract is enforced.
So, you may add an index on the column(s) involved in your WHERE clause to optimize the update on InnoDB.

Can range lock in SQL be acquired in share mode

I have a query such as
Select count(*) from table log where num = ?;
If I set the isolation level to serializable, then the range lock will be acquired for the where clause.
My question is: Can other transaction also acquire the range lock in share mode to read the count as the above OR the range lock is exclusive and all other transactions have to wait until the current transaction commits before executing the above read query.
Background: I am trying to implement a view counter for heavy traffic website. To reduce IO to the database, I create a log table so that every time there is a view, I only write a new row in the log table. Once a while, I (randomly) decide if I want to clear the log table and add the number of rows in the log table into a column of a view count table. This means I have to be careful with interleaving transaction.
The statements below are relevant only to SQL Server and were made before the OP made clear this was really about MySQL, about which I know nothing. I'm leaving it here since it (and the resulting discussion) might be of some use nevertheless, but it is not a complete, relevant answer to the question.
SELECT statements only ever acquire shared locks, on all isolation levels (unless overridden with a table hint). And shared locks are always compatible with each other (see Lock Compatibility), so there's no problem if other transactions want to acquire shared (range) locks as well. So yes, you can have any number of queries performing SELECT COUNT(*) in parallel and they will never block each other.
This doesn't mean other transactions don't have to wait. In particular, a DELETE query must eventually acquire an exclusive lock, and it will have to wait if the SELECT is holding a shared lock. Normally this is not an issue since the engine releases locks as soon as possible. When it does become an issue, you'll want to look at solutions like snapshot isolation, which uses optimistic concurrency and conflict detection rather than locking. Under that model, a SELECT will never block any other query (save those that want table locks). Of course, this isn't free; the row versioning is uses takes up disk space and I/O.

Dummies guide to locking in innodb

The typical documentation on locking in innodb is way too confusing. I think it will be of great value to have a "dummies guide to innodb locking"
I will start, and I will gather all responses as a wiki:
The column needs to be indexed before row level locking applies.
EXAMPLE: delete row where column1=10; will lock up the table unless column1 is indexed
Here are my notes from working with MySQL support on a recent, strange locking issue (version 5.1.37):
All rows and index entries traversed to get to the rows being changed will be locked. It's covered at:
http://dev.mysql.com/doc/refman/5.1/en/innodb-locks-set.html
"A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement. It does not matter whether there are WHERE conditions in the statement that would exclude the row. InnoDB does not remember the exact WHERE condition, but only knows which index ranges were scanned. ... If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked, which in turn blocks all inserts by other users to the table."
That is a MAJOR headache if true.
It is. A workaround that is often helpful is to do:
UPDATE whichevertable set whatever to something where primarykey in (select primarykey from whichevertable where constraints order by primarykey);
The inner select doesn't need to take locks and the update will then have less work to do for the updating. The order by clause ensures that the update is done in primary key order to match InnoDB's physical order, the fastest way to do it.
Where large numbers of rows are involved, as in your case, it can be better to store the select result in a temporary table with a flag column added. Then select from the temporary table where the flag is not set to get each batch. Run updates with a limit of say 1000 or 10000 and set the flag for the batch after the update. The limits will keep the amount of locking to a tolerable level while the select work will only have to be done once. Commit after each batch to release the locks.
You can also speed this work up by doing a select sum of an unindexed column before doing each batch of updates. This will load the data pages into the buffer pool without taking locks. Then the locking will last for a shorter timespan because there won't be any disk reads.
This isn't always practical but when it is it can be very helpful. If you can't do it in batches you can at least try the select first to preload the data, if it's small enough to fit into the buffer pool.
If possible use the READ COMMITTED transaction isolation mode. See:
http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html
To get that reduced locking requires use of row-based binary logging (rather than the default statement based binary logging).
Two known issues:
Subqueries can be less than ideally optimised sometimes. In this case it was an undesirable dependent subquery - the suggestion I made to use a subquery turned out to be unhelpful compared to the alternative in this case because of that.
Deletes and updates do not have the same range of query plans as select statements so sometimes it's hard to properly optimise them without measuring the results to work out exactly what they are doing.
Both of these are gradually improving. This bug is one example where we've just improved the optimisations available for an update, though the changes are significant and it's still going through QA to be sure it doesn't have any great adverse effects:
http://bugs.mysql.com/bug.php?id=36569

Prevent read when updating the table

In MySQL:
Every one minute I empty the table and fill it with a new data. Now I want that users should not read data during the fill process, before or after is ok.
How do I achieve this?
Is transaction the way?
Assuming you use a transactional engine (Usually Innodb), clear and refill the table in the same transaction.
Be sure that your readers use READ_COMMITTED or higher transaction isolation level (the default is REPEATABLE READ which is higher).
That way readers will continue to be able to read the old contents of the table during the update.
There are a few things to be careful of:
If the table is so big that it exhausts the rollback area - this is possible if you update the whole of (say) a 1M row table. Of course this is tunable but there are limits
If the transaction fails part way through and gets rolled back - rolling back big transactions is VERY inefficient in InnoDB (it is optimised for commits, not rollbacks)
Be careful of deadlocks and lock wait timeouts, which are more likely if you use big transactions.
You can LOCK your table for the duration of your operation:
http://dev.mysql.com/doc/refman/5.1/en/lock-tables.html
A table lock protects only against
inappropriate reads or writes by other
sessions. The session holding the
lock, even a read lock, can perform
table-level operations such as DROP
TABLE. Truncate operations are not
transaction-safe, so an error occurs
if the session attempts one during an
active transaction or while holding a
table lock.
I don't know enough about the internal row-versioning mechanisms of MySql (or indeed, if there is one), but other databases (Oracle, Postgresql, and more recently, Sql Server) have invested a lot of effort into allowing writers to not block readers, in so far as readers have access to the version of the rows that existed immediately before the update/write process started. Once the update is committed, that version of the row becomes the one made availabe to all readers, thereby avoiding a bottleneck that the above behaviour in MySql will introduce.
This policy ensures that table locking
is deadlock free. There are, however,
other things you need to be aware of
about this policy: If you are using a
LOW_PRIORITY WRITE lock for a table,
it means only that MySQL waits for
this particular lock until there are
no other sessions that want a READ
lock. When the session has gotten the
WRITE lock and is waiting to get the
lock for the next table in the lock
table list, all other sessions wait
for the WRITE lock to be released. If
this becomes a serious problem with
your application, you should consider
converting some of your tables to
transaction-safe tables.
You can load your data into a shadow table as slowly as you like, then instantly swap the shadow and actual with RENAME TABLE:
truncate table shadow; # make sure it is clean to start with
insert into shadow .....; # lots of inserts etc against shadow table
rename table active to temp, shadow to active, temp to shadow;
truncate table shadow; # throw away the old active data
The rename statement is atomic. An intermediate name "temp" is used to help swap the names of temp and active.
This should work with all storage engines.
Rename table - MySQL Manual