I have a system with two processes, one of which does a single insert, and the other a bulk insert. Obviously the second process is faster, and I'm working on migrating the first process to a bulk insert mechanism, but I was stumped this morning by a question from a colleague about "why bulk insert would be faster than single inserts".
So indeed, why is bulk insert faster than single insert?
Also, are there differences between bulk and single inserts in MySQL and HBase, given that their database architectures are completely different? I am using both for my project, and am wondering if there are differences in the bulk and single inserts for these two databases.
As far as i know, this depends on the Hbase configuration also. Normally a bulk insert would mean usage of List of Puts together, in this case, the insert ( called flushing in habse layer) is done automatically when you call table.put. Single inserts might wait for any other insert call so as to do a batch flush in the middle layer. However this will depend on the configuration also.
Another reason may be the easiness of task, its more efficient Map and Reduce, if you have more jobs at a time. The migration of file chunks are decided for all inputs single time. But in indvidual inserts, this becomes a crucial point.
In short - Bulkload operation bypasses regular write path. Thats's why it is fast.
So, what happens during normal write process when you do simple row by row put operation?
All the data is written simultaneously to WAL and memstore and when memestore is full, data is flushed to a new HFile. However in case of Bulkload , it directly writes to StoreFile in the running hbase cluster. NO Intermediate stuff...
Quick tip - if you don't want to use bulkload as often it is done in short burst which put additional burden on the cluster, you can writing to WAL false using Put.setWriteToWal(false) to save some timing.
But this will increase your data loss chances..
Related
I have a quick question regarding migrating large data sample sets from my local device to an Amazon Aurora RDS (no DMS approach).
So basically I am working on a Proof of Concept and I need to populate an Amazon Aurora DB with 2 Million rows of data. I have generated an SQL file with 2 Million INSERT commands. Now I need to get this sql file to the RDS. What is the best (by best I mean fastest) option to do this, can anyone suggest?
Something to consider if your data was loaded in S3 at some point. You could skip a few steps, and load directly from S3.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html
Obviously, this only applies if it makes sense for your data pipeline?
The answer depends on a few different things, like which database engine (PostgreSQL or MySQL) and the server settings. Here are some general things to consider. All of these work by running the mysql, psql, or whichever client program with the option for 'run the statements in this file'.
Don't have 2 million INSERT statements. Use multiple values in the VALUES clause for each one, e.g.
postgres=> create table t1 (x int, s varchar);
postgres=> insert into t1 values (1, 'one'), (2, 'two'), (3, 'three');
Since you have control over generating the text of the INSERT statements, you might bundle 1000 rows into each one.
Also, don't do 2 million COMMITs, as would happen if you did 2 million INSERT statements with 'autocommit' turned on. Start a transaction, do N inserts, then commit. Rinse and repeat. I'm not sure offhand what the ideal value of N is. Since you already reduced the number of INSERT statements in step 1, maybe each transaction only has a few of these gigantic inserts in it.
I think you don't want to do the whole thing in one single transaction though. Just because of the possibility of overloading memory. The right balance of number of VALUES per INSERT, number of INSERTs per transaction, that's something I don't have a recommendation at hand. That could also depend on how many columns are in each INSERT, how long the string values are, etc.
You can start up multiple sessions and do these transactions & inserts in parallel. No reason to wait until row 1000 is finished inserting before starting on row 50,000 or row 750,000. That means you'll split all these statements across multiple files. One of the strengths of Aurora is handling a lot of concurrent connections like this.
Lastly, another Aurora-specific technique. (Well, it would work for RDS databases too.) Modify the DB instance to a higher-capacity instance class, do the data loading, then modify it back to the original instance class. Certain operations like data loading and engine upgrades benefit from having lots of cores and lots of memory - that can give you huge time savings. Which can be worth it to pay for a few minutes of 8xlarge or whatever, even if after that your queries run fine with a much smaller instance class.
If you don't mind rewriting the data into CSV form or something other than actual INSERT statements, check out the mysqlimport command for MySQL, or the \copy command for PostgreSQL. (\copy takes the data off your local machine and so works for Aurora, whereas COPY assumes the data is on a file on the server, which you don't have ssh or ftp access to with Aurora.)
I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.
We've got a constant stream of simple updates to a single MySQL table (storing user activity information). Let's say we group these into batch updates each second.
I want a ballpark idea of when mysql on a typical 4-core 8GB box will start having an issue keeping up with the updates coming in each second. E.g. how many rows of updates can I make # 1 per second?
This is a thought exercise to decide if I should get going with MySQL in the early days of our applications release (simplify development), or if MySQL's likely to bomb so soon as to make it not worth even venturing down that path.
The only way you can get a decent figure is through benchmarking your specific use case. There are just too many variables and there is no way around that.
It shouldn't take too long either if you just knock a bash script or a small demo app and hammer it with jmeter, then that can give you a good idea.
I used jmeter when trying to benchmark a similar use case. The difference was I was looking for write throughput for number of INSERTS. The most useful thing that came out when I was playing was the 'innodb_flush_log_at_trx_commit' param. If you are using INNODB and don't need ACID compliance for your use case, then changing it to 0. This makes a huge difference to INSERT throughput and will likely do the same in your UPDATE use case. Although note that with this setting, changes only get flushed to disk once per second, so if your server gets a power cut or something, you could lose a seconds worth of data.
On my Quad Core 8GB Machine for my use case:
innodb_flush_log_at_trx_commit=1 resulted in 80 INSERTS per second
innodb_flush_log_at_trx_commit=0 resulted in 2000 INSERTS per second
These figures will probably bear no relevance to your use case - which is why you need to benchmark it yourself.
A lot of it depends on the quality of the code which you use to push to the DB.
If you write your batch to insert a single value per INSERT request (i.e.,
INSERT INTO table (field) VALUES (value_1);
INSERT INTO table (field) VALUES (value_2);
...
INSERT INTO table (field) VALUES (value_n);
, your performance will crash and burn.
If you insert multiple values using a single INSERT (i.e.
INSERT INTO table (field) values (value_1),(value_2)...(value_n);
, you'll find that you could easily insert many records per second
As an example, I wrote a quick app which needed to add the details of a request for an LDAP account to a holding DB. Inserting one field at a time (i.e., LDAP_field, LDAP_value), execution of the whole script took 10's of seconds. When I concatenated the values into a single INSERT request, execution time of the script went down to about 2 seconds from start to finish. This included the overhead of starting and committing a transaction
Hope this helps
Its not easy to give a general answer to this question. The numbers you ask for rely heavily not only on the hardware of your database server, MySQL itself, but also on server/client configuration, network and - equally important - on your database/table design too.
Generally speaking, with a naked MySQL setup on a state-of-the-art server and update statements using unique keys, I don't have issues below 200 update-statementsp er second if I fire them from localhost, at least that's what I get on my six year old winxp test enviroment. A naked installation on a new system will scale this way higher. If you think way bigger, one server isn't the way to go. MySQL can be tweaked and scaled out in some ways, therefore many companies rely heavily on it.
Just some basics:
If the fields you want to update have huge index files, the update
statements are alot slower since each statement has to write not only
data, but also index informations.
If your update statement cannot
use an index, it might take longer for the server to allocate the
required fields it has to update.
Slow memory and/or slow harddisks
might also slow down overall server performance.
Slow network
connection slows down communication between client and server.
There are whole books written about it, so I'll stop here and advise some further reading, if you're interested!
I'm trying to insert about 500 million rows of garbage data into a database for testing. Right now I have a PHP script looping through a few SELECT/INSERT statements each inside a TRANSACTION -- clearly this isn't the best solution. The tables are InnoDB (row-level locking).
I'm wondering if I (properly) fork the process, will this speed up the INSERT process? At the rate it's going, it will take 140 hours to complete. I'm concerned about two things:
If INSERT statements must acquire a write lock, then will it render forking useless, since multiple processes can't write to the same table at the same time?
I'm using SELECT...LAST_INSERT_ID() (inside a TRANSACTION). Will this logic break when multiple processes are INSERTing into the database? I could create a new database connection for each fork, so I hope this would avoid the problem.
How many processes should I be using? The queries themselves are simple, and I have a regular dual-core dev box with 2GB RAM. I set up my InnoDB to use 8 threads (innodb_thread_concurrency=8), but I'm not sure if I should be using 8 processes or if this is even a correct way to think about matching.
Thanks for your help!
The MySQL documentation has a discussion on efficient insertion of a large number of records. It seems that the clear winner is usage of the LOAD DATA INFILE command, followed by inserts that insert multiple values lists.
1) yes, there will be lock contention, but innodb is designed to handle multiple threads trying to insert. sure, they won't simultaneously insert, but it will handle serializing the inserts for you. just make sure you specifically close your transactions and you do it ASAP. this will ensure you get the best possible insert performance.
2) no, this logic will not break provided you have 1 connection per thread, since last_insert_id() is connection specific.
3) this is one of those things that you just need to benchmark to figure out. actually, i would make the program self-adjust. run 100 inserts with 8 threads and record the execution times. then try again with half as many and twice as many. whichever one is faster, then benchmark more thread count values around that number.
in general, you should always just go ahead and benchmark this kind of stuff to see which is faster. in the amount of time it takes you to think about it and write it up, you could probably already have preliminary numbers.
I have written a program in C to parse large XML files and then create files with insert statements. Some other process would ingest the files into a MySQL database.
This data will serve as a indexing service so that users can find documents easily.
I have chosen InnoDB for the ability of row-level locking. The C program will be generating any where from 500 to 5 million insert statements on a given invocation.
What is the best way to get all this data into the database as quickly as possible? The other thing to note is that the DB is on a separate server. Is it worth moving the files over to that server to speed up inserts?
EDIT: This table won't really be updated, but rows will be deleted.
Use the mysqlimport tool or the LOAD DATA INFILE command.
Temporarily disable indices that you don't need for data integrity
I'd do at least these things according to this link:
Move the files there and connect over the unix socket
Generate, instead of the INSERTS, a LOAD DATA INFILE file
Disabling indexes during the loading
MySQL with the standard table formats is wonderfully fast as long as it's a write-only table; so the first question is whether you are going to be updating or deleting. If not, don't go with innosys - there's no need for locking if you are just appending. You can truncate or rename the output file periodically to deal with table size.
1. Make sure you use a transaction.
Transactions eliminate the
INSERT, SYNC-TO-DISK
repetition phase and instead all the disk IO is performed when you COMMIT the transaction.
2. Make sure to utilize connection compression
Raw text + GZip compressed stream ~= as much as 90% bandwidth saving in some cases.
3. Utilise the parallel insert notation where possible
INSERT INTO TableName(Col1,Col2) VALUES (1,1),(1,2),(1,3)
( Less text to send, shorter action )
If you can't use LOAD DATA INFILE like others have suggested, use prepared queries for inserts.
Really depends on the engine. If you're using InnoDB, do use transactions (you can't avoid them - but if you use autocommit, each batch is implicitly in its own txn), but make sure they're neither too big or too small.
If you're using MyISAM, transactions are meaningless. You may achieve better insert speed by disabling and enabling indexes, but that is only good on an empty table.
If you start with an empty table, that's generally best.
LOAD DATA is a winner either way.