For data base MySQL
I want to insert rows as fast, as it can be done.
Inserts will be executing in multithreaded way. Let it be near 200 threads.
There are two ways to do it, as I want to do:
1) Use simple Insert command, each Insert will be wrapped into transaction.
There is a nice MySQL solution with Batch Insert
(INSERT INTO t() VALUES (),(),()...) but it can't be used, because of every single row must be independent in terms of transaction. In other words, if some problems appear with operation I want to rollback only one inserted row, but not all rows from the batch.
And here we can approach the second way:
2) Single thread can do the batch inserts with the fake data, totally empty rows except autoincremented IDs. This inserts works so fast, that we can even ignore this time (about 40 nano sec/row) in comparison with single Insert.
After batch insert client side can get LAST_INSERT_ID and ROW_COUNT, i.e. 'range' of inserted IDs. Next step is to do Update with data we wanted to insert before by ID which we can get from previous 'range'. Updates will be executing in multithreaded way. Result will be the same.
And now I want to ask: which way will be faster - single inserts, or batch insert + updates.
There are some indexes in the table.
None of the above.
You should be doing batch inserts. If a BatchUpdateException occurs, you can catch it and find out which inserts failed. You can however still commit what you have so far, and then continue from the point the batch failed (this is driver dependent, some drivers will execute all statements and inform you which ones failed).
The answer depends on the major cause of errors and whatyou want to do with the failed transactions, INSERT IGNORE may be sufficient:
INSERT IGNORE . . .
This will ignore errors in the batch but insert the valid data. This is tricky, if you want to catch the errors and do something about them.
If the errors are caused by duplicate keys (either unique or primary), then ON DUPLICATE KEY UPDATE is probably the best solution.
Plan A:
If there are secondary INDEXes, then the batch-insert + lots of updates is likely to be slower because it will need to insert index rows, then change them. OTOH, since secondary index operations are done in the "Change buffer", hence delayed, you may not notice the overhead immediately.
Do not use 200 threads for doing the multi-threaded inserts or updates. For 5.7, 64 might be the limit; for 5.6 it might be 48. YMMV. These numbers come from Oracle bragging about how they improved the multi-threaded aspects of MySQL. Beyond those numbers, throughput flat-lined and latency went through the roof. You should experiment with your situation, not trust those numbers.
Plan B:
If failed rows are rare, then be optimistic. Batch INSERTs, say, 64 at a time. If a failure occurs, redo them in 8 batches of 8. If any of those fail, then degenerate to one at a time. I have no idea what pattern is optimal. (64-8-1 or 64-16-4-1 or 25-5-1 or ...) Anyway it depends on your frequency of failure and number of rows to insert.
However, I will impart this bit of advice... Beyond 100 threads, you are well into "diminishing returns", so don't bother with large batch that might fail. I have measured that 100/batch is about 90% of the maximal speed.
Another tip (for any Plan):
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0
Caution: These help with speed (perhaps significantly), but run the risk of lost data in a power failure.
Related
I have a MySQL database (InnoDB, if that matters) and I want to add a lot of rows. I want to do this on a production database so there can be no downtime. Each time (about once a day) I want to add about 1M rows to the database, in batches of 10k (from some tests I ran this seemed to be the optimal batch size to minimize time). While I'm doing these inserts the table needs to be readable. What is the "correct" way to do this? For starters you can assume there are no indexes.
Option A: https://dev.mysql.com/doc/refman/5.7/en/commit.html
START TRANSACTION;
INSERT INTO my_table (etc etc batch insert);
INSERT INTO my_table (etc etc batch insert);
INSERT INTO my_table (etc etc batch insert);
INSERT INTO my_table (etc etc batch insert);
(more)
COMMIT;
SET autocommit = 0;
Options B
copy my_table into my_table_temp
INSERT INTO my_table_temp (etc etc batch insert);
INSERT INTO my_table_temp (etc etc batch insert);
INSERT INTO my_table_temp (etc etc batch insert);
INSERT INTO my_table_temp (etc etc batch insert);
(more)
RENAME my_table TO my_table_old;
RENAME my_table_temp TO my_table;
I've used the second method before and it works. There's only a tiny amount of time where something might be wrong which is the time it takes to rename the tables.
But my confusion is: if this were the best solution, then what's the point of START TRANSACTION/COMMIT? Surely that was invented to take care of the thing I'm describing, no?
Bonus question: What if we have indexes? My case is easily adaptable, just turn off the indexes in the temp table and turn them back on after the inserts were finished and before the rename. What about option A? Seems hard to reconciliate with doing inserts with indexes.
then what's the point of START TRANSACTION/COMMIT? Surely that was invented to take care of the thing I'm describing, no?
Yes, exactly. In InnoDB, thanks to its MVCC architecture, writers never block readers. You don't have to worry about bulk inserts blocking readers.
The exception is if you're doing locking reads with SELECT...FOR UPDATE or SELECT...LOCK IN SHARE MODE. Those might conflict with INSERTs, depending on the data you're selecting, and whether it requires gap locks where the new data is being inserted.
Likewise LOAD DATA INFILE does not block non-locking readers of the table.
You might like to see the results I got for bulk loading data in my presentation, Load Data Fast!
There's only a tiny amount of time where something might be wrong which is the time it takes to rename the tables.
It's not necessary to do the table-swapping for bulk INSERT, but for what it's worth, if you ever do need to do that, you can do multiple table renames in one statement. The operation is atomic, so there's no chance any concurrent transaction can sneak in between.
RENAME my_table TO my_table_old, my_table_temp TO my_table;
Re your comments:
what if I have indexes?
Let the indexes be updated incrementally as you do the INSERT or LOAD DATA INFILE. InnoDB will do this while other concurrent reads are using the index.
There is overhead to updating an index during INSERTs, but it's usually preferable to let the INSERT take a little longer instead of disabling the index.
If you disable the index, then all concurrent clients cannot use it. Other queries will slow down. Also, when you re-enable the index, this will lock the table and block other queries while it rebuilds the index. Avoid this.
why do I need to wrap the thing in "START TRANSACTION/COMMIT"?
The primary purpose of a transaction is to group changes that should be committed as one change, so that no other concurrent query sees the change in a partially-complete state. Ideally, we'd do all your INSERTs for your bulk-load in one transaction.
The secondary purpose of the transaction is to reduce overhead. If you rely on autocommit instead of explicitly starting and committing, you're still using transactions—but autocommit implicitly starts and commits one transaction for every INSERT statement. The overhead of starting and committing is small, but it adds up if you do it 1 million times.
There's also a practical, physical reason to reduce the number of individual transactions. InnoDB by default does a filesystem sync after each commit, to ensure data is safely stored on disk. This is important to prevent data loss if you have a crash. But a filesystem sync isn't free. You can only do a finite number of syncs per second (this varies based on what type of disk you use). So if you are trying to do 1 million syncs for individual transactions, but your disk can only physically do 100 syncs per second (this typical for a single hard disk of the non-SSD type), then your bulk load will take a minimum of 10,000 seconds. This is a good reason to group your bulk INSERT into batches.
So for both logical reasons of atomic updates, and physical reasons of being kind to your hardware, use transactions when you have some bulk work to do.
However, I don't want to scare you into using transactions to group things inappropriately. Do commit your work promptly after you do some other type of UPDATE. Leaving a transaction hanging open for an unbounded amount of time is not a good idea either. MySQL can handle the rate of commits of ordinary day-to-day work. I am suggesting batching work when you need to do a bunch of bulk changes in rapid succession.
I think that the best way is LOAD DATA IN FILE
I have a mysql table that keep gaining new records every 5 seconds.
The questions are
can I run query on this set of data that may takes more than 5 seconds?
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
I'll go over your questions and some of the comments you added later.
can I run query on this set of data that may takes more than 5 seconds?
Can you? Yes. Should you? It depends. In a MySQL configuration I set up, any query taking longer than 3 seconds was considered slow and logged accordingly. In addition, you need to keep in mind the frequency of the queries you intend to run.
For example, if you try to run a 10 second query every 3 seconds, you can probably see how things won't end well. If you run a 10 second query every few hours or so, then it becomes more tolerable for the system.
That being said, slow queries can often benefit from optimizations, such as not scanning the entire table (i.e. search using primary keys), and using the explain keyword to get the database's query planner to tell you how it intends to work on that internally (e.g. is it using PKs, FKs, indices, or is it scanning all table rows?, etc).
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
"Affect" in what way? If you mean "prevent insert from actually inserting until the select has completed", that depends on the storage engine. For example, MyISAM and InnoDB are different, and that includes locking policies. For example, MyISAM tends to lock entire tables while InnoDB tends to lock specific rows. InnoDB is also ACID-compliant, which means it can provide certain integrity guarantees. You should read the docs on this for more details.
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
Part of "what happens" is determined by how the specific storage engine behaves. Regardless of what happens, the database is designed to answer application queries in a way that's consistent.
As an example, if the select statement were to lock an entire table, then the insert statement would have to wait until the select has completed and the lock has been released, meaning that the app would see the results prior to the insert's update.
I understand that locking database can prevent messing up the SELECT statement.
It can also put a potentially unacceptable performance bottleneck, especially if, as you say, the system is inserting lots of rows every 5 seconds, and depending on the frequency with which you're running your queries, and how efficiently they've been built, etc.
what is the good practice to do when I need the data for calculations while those data will be updated within short period?
My recommendation is to simply accept the fact that the calculations are based on a snapshot of the data at the specific point in time the calculation was requested and to let the database do its job of ensuring the consistency and integrity of said data. When the app requests data, it should trust that the database has done its best to provide the most up-to-date piece of consistent information (i.e. not providing a row where some columns have been updated, but others yet haven't).
With new rows coming in at the frequency you mentioned, reasonable users will understand that the results they're seeing are based on data available at the time of request.
All of your questions are related to locking of table.
Your all questions depend on the way database is configured.
Read : http://www.mysqltutorial.org/mysql-table-locking/
Perform Select Statement While insert statement working
If you want to perform a select statement during insert SQL is performing, you should check by open new connection and close connection every time. i.e If I want to insert lots of records, and want to know that last record has inserted by selecting query. I must have to open connection and close connection in for loop or while loop.
# send a request to store data
insert statement working // take a long time
# select statement in while loop.
while true:
cnx.open()
select statement
cnx.close
//break while loop if you get the result
I'm running an ETL process and streaming data into a MySQL table.
Now it is being written over a web connection (fairly fast one) -- so that can be a bottleneck.
Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index .... and then a few attributes.
If a new ID is found, insert, otherwise, update ... you get the idea.
Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.
When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.
The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once ... (20 is max allowed by web host) ... whereas any "update" function cannot have any parallels running.
Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.
My question is ... given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) .... by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?
Is this something easy to implement, or something that comes up often?
What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.
Ultimately there are 20 columns, max of probably varchar(50) ... should I be getting a lot more than 13 rows processed/ second?
There are many possible 'answers' to your questions.
13/second -- a lot that can be done...
INSERT ... ON DUPLICATE KEY UPDATE ... ('IODKU') is usually the best way to do "update, else insert" (unless I don't know what you mean by it).
Batched inserts is much faster than inserting one row at a time. Optimal is around 100 rows giving 10x speedup. IODKU can (usually) be batched, too; see the VALUES() pseudo function.
BEGIN;...lots of writes...COMMIT; cuts back significantly on the overhead for transaction.
Using a "staging" table for gathering things up update can have a significant benefit. My blog discussing that. That also covers batch "normalization".
Building Summary Tables on the fly interferes with high speed data ingestion. Another blog covers Summary tables.
Normalization can be used for de-dupping, hence shrinking the disk footprint. This can be important for decreasing I/O for the 'Fact' table in Data Warehousing. (I am referring to your 20 x VARCHAR(50).)
RAID striping is a hardware help.
Batter-Backed-Write-Cache on a RAID controller makes writes seem instantaneous.
SSDs speed up I/O.
If you provide some more specifics (SHOW CREATE TABLE, SQL, etc), I can be more specific.
Do it in the DBMS, and wrap it in a transaction.
To explain:
Load your data into a temporary table in MySQL in the fastest way possible. Bulk load, insert, do whatever works. Look at "load data infile".
Outer-join the temporary table to the target table, and INSERT those rows where the PK column of the target table is NULL.
Outer-join the temporary table to the target table, and UPDATE those rows where the PK column of the target table is NOT NULL.
Wrap steps 2 and 3 in a begin/commit (or [start transaction]/commit pair for a transaction. The default behaviour is probably autocommit, which will mean you're doing a LOT of database work after every insert/update. Use transactions properly, and the work is only done once for each block.
I don't have a testing environment for this yet. But before I think too much about solutions I'd like to know if people think this would be a problem.
I will have 10-20 java processes connected to a MySql db via JDBC. Each will be inserting unique records, all going to the same table. The rate of inserts will be on the order of 1000's per second.
My expectation is that some process will attempt to insert and encounter a table lock while another process is inserting, and this will result in a JDBC exception and that insert to fail.
Clearly if you increase the insert rate sufficiently there eventually will be a point where some buffer somewhere fills up faster than it can be emptied. When such a buffer hits its maximum capacity and can't contain any more data some of your insert statements will have to fail. This will result in an exception being thrown at the client.
However, assuming you have high-end hardware I don't imagine this should happen with 1000 inserts per second, but it does depend on the specific hardware, how much data there is per insert, how many indexes you have, what other queries are running on the system simultaneously, etc.
Regardless of whether you are doing 10 inserts per second or 1000 you still shouldn't blindly assume that every insert will succeed - there's always a chance that an insert will fail because of some network communication error or some other problem. Your application should be able to correctly handle the situation where an insert fails.
Use InnoDB as it supports reads and writes at the same time. MyISAM will usually lock the table during the insert, but give preference to SELECT statements. This can cause issues if you're doing reporting or visualization of the data while trying to do inserts.
If you have a natural primary key (no auto_increment), using it will help avoid some deadlock issues. (Newer versions have fixed this.)
http://www.mysqlperformanceblog.com/2007/09/26/innodb-auto-inc-scalability-fixed/
You might also want to see if you can queue your writes in memory and send them to the database in batches. Lots of little inserts will be much slower than doing batches in transactions.
Good presentation on getting the most out of the MySQL Connector/J JDBC driver:
http://assets.en.oreilly.com/1/event/21/Connector_J%20Performance%20Gems%20Presentation.pdf
What engine do you use? That can make a difference.
http://dev.mysql.com/doc/refman/5.5/en/concurrent-inserts.html
I'm trying to insert about 500 million rows of garbage data into a database for testing. Right now I have a PHP script looping through a few SELECT/INSERT statements each inside a TRANSACTION -- clearly this isn't the best solution. The tables are InnoDB (row-level locking).
I'm wondering if I (properly) fork the process, will this speed up the INSERT process? At the rate it's going, it will take 140 hours to complete. I'm concerned about two things:
If INSERT statements must acquire a write lock, then will it render forking useless, since multiple processes can't write to the same table at the same time?
I'm using SELECT...LAST_INSERT_ID() (inside a TRANSACTION). Will this logic break when multiple processes are INSERTing into the database? I could create a new database connection for each fork, so I hope this would avoid the problem.
How many processes should I be using? The queries themselves are simple, and I have a regular dual-core dev box with 2GB RAM. I set up my InnoDB to use 8 threads (innodb_thread_concurrency=8), but I'm not sure if I should be using 8 processes or if this is even a correct way to think about matching.
Thanks for your help!
The MySQL documentation has a discussion on efficient insertion of a large number of records. It seems that the clear winner is usage of the LOAD DATA INFILE command, followed by inserts that insert multiple values lists.
1) yes, there will be lock contention, but innodb is designed to handle multiple threads trying to insert. sure, they won't simultaneously insert, but it will handle serializing the inserts for you. just make sure you specifically close your transactions and you do it ASAP. this will ensure you get the best possible insert performance.
2) no, this logic will not break provided you have 1 connection per thread, since last_insert_id() is connection specific.
3) this is one of those things that you just need to benchmark to figure out. actually, i would make the program self-adjust. run 100 inserts with 8 threads and record the execution times. then try again with half as many and twice as many. whichever one is faster, then benchmark more thread count values around that number.
in general, you should always just go ahead and benchmark this kind of stuff to see which is faster. in the amount of time it takes you to think about it and write it up, you could probably already have preliminary numbers.