I have a MySQL database (InnoDB, if that matters) and I want to add a lot of rows. I want to do this on a production database so there can be no downtime. Each time (about once a day) I want to add about 1M rows to the database, in batches of 10k (from some tests I ran this seemed to be the optimal batch size to minimize time). While I'm doing these inserts the table needs to be readable. What is the "correct" way to do this? For starters you can assume there are no indexes.
Option A: https://dev.mysql.com/doc/refman/5.7/en/commit.html
START TRANSACTION;
INSERT INTO my_table (etc etc batch insert);
INSERT INTO my_table (etc etc batch insert);
INSERT INTO my_table (etc etc batch insert);
INSERT INTO my_table (etc etc batch insert);
(more)
COMMIT;
SET autocommit = 0;
Options B
copy my_table into my_table_temp
INSERT INTO my_table_temp (etc etc batch insert);
INSERT INTO my_table_temp (etc etc batch insert);
INSERT INTO my_table_temp (etc etc batch insert);
INSERT INTO my_table_temp (etc etc batch insert);
(more)
RENAME my_table TO my_table_old;
RENAME my_table_temp TO my_table;
I've used the second method before and it works. There's only a tiny amount of time where something might be wrong which is the time it takes to rename the tables.
But my confusion is: if this were the best solution, then what's the point of START TRANSACTION/COMMIT? Surely that was invented to take care of the thing I'm describing, no?
Bonus question: What if we have indexes? My case is easily adaptable, just turn off the indexes in the temp table and turn them back on after the inserts were finished and before the rename. What about option A? Seems hard to reconciliate with doing inserts with indexes.
then what's the point of START TRANSACTION/COMMIT? Surely that was invented to take care of the thing I'm describing, no?
Yes, exactly. In InnoDB, thanks to its MVCC architecture, writers never block readers. You don't have to worry about bulk inserts blocking readers.
The exception is if you're doing locking reads with SELECT...FOR UPDATE or SELECT...LOCK IN SHARE MODE. Those might conflict with INSERTs, depending on the data you're selecting, and whether it requires gap locks where the new data is being inserted.
Likewise LOAD DATA INFILE does not block non-locking readers of the table.
You might like to see the results I got for bulk loading data in my presentation, Load Data Fast!
There's only a tiny amount of time where something might be wrong which is the time it takes to rename the tables.
It's not necessary to do the table-swapping for bulk INSERT, but for what it's worth, if you ever do need to do that, you can do multiple table renames in one statement. The operation is atomic, so there's no chance any concurrent transaction can sneak in between.
RENAME my_table TO my_table_old, my_table_temp TO my_table;
Re your comments:
what if I have indexes?
Let the indexes be updated incrementally as you do the INSERT or LOAD DATA INFILE. InnoDB will do this while other concurrent reads are using the index.
There is overhead to updating an index during INSERTs, but it's usually preferable to let the INSERT take a little longer instead of disabling the index.
If you disable the index, then all concurrent clients cannot use it. Other queries will slow down. Also, when you re-enable the index, this will lock the table and block other queries while it rebuilds the index. Avoid this.
why do I need to wrap the thing in "START TRANSACTION/COMMIT"?
The primary purpose of a transaction is to group changes that should be committed as one change, so that no other concurrent query sees the change in a partially-complete state. Ideally, we'd do all your INSERTs for your bulk-load in one transaction.
The secondary purpose of the transaction is to reduce overhead. If you rely on autocommit instead of explicitly starting and committing, you're still using transactions—but autocommit implicitly starts and commits one transaction for every INSERT statement. The overhead of starting and committing is small, but it adds up if you do it 1 million times.
There's also a practical, physical reason to reduce the number of individual transactions. InnoDB by default does a filesystem sync after each commit, to ensure data is safely stored on disk. This is important to prevent data loss if you have a crash. But a filesystem sync isn't free. You can only do a finite number of syncs per second (this varies based on what type of disk you use). So if you are trying to do 1 million syncs for individual transactions, but your disk can only physically do 100 syncs per second (this typical for a single hard disk of the non-SSD type), then your bulk load will take a minimum of 10,000 seconds. This is a good reason to group your bulk INSERT into batches.
So for both logical reasons of atomic updates, and physical reasons of being kind to your hardware, use transactions when you have some bulk work to do.
However, I don't want to scare you into using transactions to group things inappropriately. Do commit your work promptly after you do some other type of UPDATE. Leaving a transaction hanging open for an unbounded amount of time is not a good idea either. MySQL can handle the rate of commits of ordinary day-to-day work. I am suggesting batching work when you need to do a bunch of bulk changes in rapid succession.
I think that the best way is LOAD DATA IN FILE
Related
I have a delete query, which delete rows by chunk (each chunk 2000)
Delete from Table1 where last_refresh_time < {time value}
Here I want to delete the rows in the table which are not refreshed for last 5days.
Usually the delete will be around 10million rows. This process will be done once per-day in a little offtime.
This query executes little faster in Master, but due to ROW_BASED_REPLICATION the SLAVE is in heavy lag. As SLAVE - SQL_THREAD deletes each rows one by one from RELAY_LOG data.
We use READ_COMMITED isolation level,
Is it okay to change this query transaction alone to STATEMENT_BASED replication ?
will we face any issue?
In MySql, it is mentioned like below, can someone explain this will other transaction INSERT gets affected?
If you are using InnoDB tables and the transaction isolation level is READ COMMITTED or READ UNCOMMITTED, only row-based logging can be used. It is possible to change the logging format to STATEMENT, but doing so at runtime leads very rapidly to errors because InnoDB can no longer perform inserts
If other TRANSACTION INSERTS gets affected can we change ISOLATION LEVEL to REPEATABLE_READ for this DELETE QUERY TRANSACTION alone ? Is it recommended do like this?
Please share your views and Suggestions for this lag issue
Mysql - INNDOB Engine - 5.7.18
Don't do a single DELETE that removes 10M rows. Or 1M. Not even 100K.
Do the delete online. Yes, it is possible, and usually preferable.
Write a script that walks through the table 200 rows at a time. DELETE and COMMIT any "old" rows in that 200. Sleep for 1 second, then move on to the next 200. When it hits the end of the table, simply start over. (1K rows in a chunk may be OK.) Walk through the table via the PRIMARY KEY so that the effort to 'chunk' is minimized. Note that the 200 rows plus 1 second delay will let you get through the table in about 1 day, effectively as fast as your current code, but will much less interference.
More details: http://mysql.rjweb.org/doc.php/deletebig Note, especially, how it is careful to touch only N rows (N=200 or whatever) of the table per pass.
My suggestion helps avoid Replica lag in these ways
Lower count (200 vs 2000). That many 'events' will be dumped into the replication stream all at once. Hence, other events are stuck behind them.
Touch only 200 rows -- by using the PRIMARY KEY careful use of LIMIT, etc
"Sleep" between chunks -- The Primary primes the cache with an initial SELECT that is not replicated. Hence, in Row Based Replication, the Replica is likely to be caught off guard (rows to delete have not been cached). The Sleep gives it a chance to finish the deletes and handle other replication items before the next chunk comes.
Discussion: With Row Based Replication (which is preferable), a 10M DELETE will ship 10M 1-row deletes to the Replicas. This clogs replication, delays replication, etc. By breaking it into small chunks, such overhead has a reasonable impact on replication.
Don't worry about isolation mode, etc, simply commit each small chunk. 100 rows will easily be done in less than a second. Probably 1K will be that fast. 10M will certainly not.
You said "refreshed". Does this mean that the processing updates a timestamp in the table? And this happens at 'random' times for 'random' rows? And such an update can happen multiple times for a given row? If that is what you mean, then I do not recommend PARTITIONing, which is also discussed in the link above.
Note that I do not depend on an index on that timestamp, much less suggest partitioning by that timestamp. I want to avoid the overhead of updating such an index so rapidly. Walking through the table via the PK is a very good alternative.
Do you really need READ_COMMITED isolation level ? It's not actually standard and ACID.
But any way.
For this query you can change session isolation to REAPEATABLE_READ and use MIXED mode for binlog_format.
With that you will get STATEMENT base replication only for this session.
Maybe that table usage will better fit to noSQL tool like Mongodb and TTL index.
For data base MySQL
I want to insert rows as fast, as it can be done.
Inserts will be executing in multithreaded way. Let it be near 200 threads.
There are two ways to do it, as I want to do:
1) Use simple Insert command, each Insert will be wrapped into transaction.
There is a nice MySQL solution with Batch Insert
(INSERT INTO t() VALUES (),(),()...) but it can't be used, because of every single row must be independent in terms of transaction. In other words, if some problems appear with operation I want to rollback only one inserted row, but not all rows from the batch.
And here we can approach the second way:
2) Single thread can do the batch inserts with the fake data, totally empty rows except autoincremented IDs. This inserts works so fast, that we can even ignore this time (about 40 nano sec/row) in comparison with single Insert.
After batch insert client side can get LAST_INSERT_ID and ROW_COUNT, i.e. 'range' of inserted IDs. Next step is to do Update with data we wanted to insert before by ID which we can get from previous 'range'. Updates will be executing in multithreaded way. Result will be the same.
And now I want to ask: which way will be faster - single inserts, or batch insert + updates.
There are some indexes in the table.
None of the above.
You should be doing batch inserts. If a BatchUpdateException occurs, you can catch it and find out which inserts failed. You can however still commit what you have so far, and then continue from the point the batch failed (this is driver dependent, some drivers will execute all statements and inform you which ones failed).
The answer depends on the major cause of errors and whatyou want to do with the failed transactions, INSERT IGNORE may be sufficient:
INSERT IGNORE . . .
This will ignore errors in the batch but insert the valid data. This is tricky, if you want to catch the errors and do something about them.
If the errors are caused by duplicate keys (either unique or primary), then ON DUPLICATE KEY UPDATE is probably the best solution.
Plan A:
If there are secondary INDEXes, then the batch-insert + lots of updates is likely to be slower because it will need to insert index rows, then change them. OTOH, since secondary index operations are done in the "Change buffer", hence delayed, you may not notice the overhead immediately.
Do not use 200 threads for doing the multi-threaded inserts or updates. For 5.7, 64 might be the limit; for 5.6 it might be 48. YMMV. These numbers come from Oracle bragging about how they improved the multi-threaded aspects of MySQL. Beyond those numbers, throughput flat-lined and latency went through the roof. You should experiment with your situation, not trust those numbers.
Plan B:
If failed rows are rare, then be optimistic. Batch INSERTs, say, 64 at a time. If a failure occurs, redo them in 8 batches of 8. If any of those fail, then degenerate to one at a time. I have no idea what pattern is optimal. (64-8-1 or 64-16-4-1 or 25-5-1 or ...) Anyway it depends on your frequency of failure and number of rows to insert.
However, I will impart this bit of advice... Beyond 100 threads, you are well into "diminishing returns", so don't bother with large batch that might fail. I have measured that 100/batch is about 90% of the maximal speed.
Another tip (for any Plan):
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0
Caution: These help with speed (perhaps significantly), but run the risk of lost data in a power failure.
Instead of doing ALTER TABLE I prefer to create a new table, copy the data to it, and then move to use it. When doing so in InnoDB I always have a hard time performing:
INSERT INTO new_huge_tbl (SELECT * FROM old_huge_tbl)
Because of the natures of transactions, if at any time I need to stop this operation, the rollback isn't easy, to say the least. Is there any way I can perform this operation in InnoDB without it being a transaction?
No, it's not possible to avoid the transactional overhead in a simple way. You would perhaps have two options:
In your own application, use many smaller transactions (of e.g. 10k rows each) to copy the data in small batches.
Use an existing tool which does the copy for you using the same strategy. I could suggest pt-archiver from the Percona Toolkit.
Internally, when doing table copies for e.g. ALTER TABLE, InnoDB does in fact do exactly that, batching the copy into many smaller transactions.
I need to copy the content of one table to another. So I started using:
INSERT new_table SELECT * FROM old_table
However, I am getting the following error now:
1297, "Got temporary error 233 'Out of operation records in transaction coordinator (increase MaxNoOfConcurrentOperations)' from NDBCLUSTER"
I think I have an understanding why this occurs: My table is huge, and MySQL tries to take a snapshot in time (lock everything and make one large transaction out of it).
However, my data is fairly static and there is no other concurrent session that would modify the data. How can I tell MySQL to copy one row at a time, or in smaller chunks, without locking the whole thing?
Edit note: I already know that I can just read the whole table row-by-row into memory/file/dump and write back. I am interested to know if there is an easy way (maybe setting isolation level?). Note that the engine is InnoDB.
Data Migration is one of the few instances where a CURSOR can make sense, as you say, to ensure that the number of locks stays sane.
Use a cursor in conjunction with TRANSACTION, where you commit after every row, or after N rows (e.g. use a counter with modulo)
select the data from innodb into an outfile and load infile into
cluster
Currently we have a database and a script which has 2 update and 1 select, 1 insert.
The problem is we have 20,000 People who run this script every hour. Which cause the mysql to run with 100% cpu.
For the insert, it's for logging, we want to log all the data to our mysql, but as the table scale up, application become slower and slower. We are running on InnoDB, but some people say it should be MyISAM. What should we use? In this log table, we do sometimes pull out the log for statistical purpose. 40->50 times a day only.
Our solution is to use Gearman [http://gearman.org/] to delay insert to the database. But how about the update.
We need to update 2 table, 1 from the customer to update the balance(balance = balance -1), and the other is to update the count from another table.
How should we make this faster and more CPU efficient?
Thank you
but as the table scale up, application become slower and slower
This usually means that you're missing an index somewhere.
MyISAM is not good: in addition to being non ACID compliant, it'll lock the whole table to do an insert -- which kills concurrency.
Read the MySQL documentation carefully:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Especially "innodb_flush_log_at_trx_commit" -
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html
I would stay away from MyISAM as it has concurrency issues when mixing SELECT and INSERT statements. If you can keep your insert tables small enough to stay in memory, they'll go much faster. Batching your updates in a transaction will help them go faster as well. Setting up a test environment and tuning for your actual job is important.
You may also want to look into partitioning to rotate your logs. You'd drop the old partition and create a new one for the current data. This is much faster than than deleting the old rows.