I'm trying to insert about 500 million rows of garbage data into a database for testing. Right now I have a PHP script looping through a few SELECT/INSERT statements each inside a TRANSACTION -- clearly this isn't the best solution. The tables are InnoDB (row-level locking).
I'm wondering if I (properly) fork the process, will this speed up the INSERT process? At the rate it's going, it will take 140 hours to complete. I'm concerned about two things:
If INSERT statements must acquire a write lock, then will it render forking useless, since multiple processes can't write to the same table at the same time?
I'm using SELECT...LAST_INSERT_ID() (inside a TRANSACTION). Will this logic break when multiple processes are INSERTing into the database? I could create a new database connection for each fork, so I hope this would avoid the problem.
How many processes should I be using? The queries themselves are simple, and I have a regular dual-core dev box with 2GB RAM. I set up my InnoDB to use 8 threads (innodb_thread_concurrency=8), but I'm not sure if I should be using 8 processes or if this is even a correct way to think about matching.
Thanks for your help!
The MySQL documentation has a discussion on efficient insertion of a large number of records. It seems that the clear winner is usage of the LOAD DATA INFILE command, followed by inserts that insert multiple values lists.
1) yes, there will be lock contention, but innodb is designed to handle multiple threads trying to insert. sure, they won't simultaneously insert, but it will handle serializing the inserts for you. just make sure you specifically close your transactions and you do it ASAP. this will ensure you get the best possible insert performance.
2) no, this logic will not break provided you have 1 connection per thread, since last_insert_id() is connection specific.
3) this is one of those things that you just need to benchmark to figure out. actually, i would make the program self-adjust. run 100 inserts with 8 threads and record the execution times. then try again with half as many and twice as many. whichever one is faster, then benchmark more thread count values around that number.
in general, you should always just go ahead and benchmark this kind of stuff to see which is faster. in the amount of time it takes you to think about it and write it up, you could probably already have preliminary numbers.
Related
I have a server that receives data from thousands of locations all over the world. Periodically, that server connects to my DB server and inserts records with multi-insert, some 11,000 rows at a time per multi, and up to 6 insert statements. When this happens, all 6 process lock the table being inserted into.
What I am trying to figure out is what causes the locking? Am I better off limiting my multi-insert to, say 100 rows at a time and doing them end to end? What do I use for guidelines?
The DB server has 100GB RAM and 12 processors. It is very lightly used but when these inserts come in, everyone freezes up for a couple minutes which disrupts peopel running reports, etc.
Thanks for any advice. I know I need to stagger the inserts, I am just asking what is a recommended way to do this.
UPDATE: I was incorrect. I spoke to the programmer and he said that there is a perl program running that sends single inserts to the server, as rapidly as it can. NOT a multi-insert. There are (currently) 6 of these perl processes running simultaneously. One of them is doing 91000 inserts, one at a time. Perhaps since we have a lot of RAM, a multi-insert would be better?
Your question lacks a bunch of details about how the system is structured. In addition, if you have a database running on a server with 100 Gbytes of RAM, you should have access to a professional DBA, and not rely on internet forums.
But, as lad2025 suggests in a comment, staging tables can probably solve your problem. Your locking is probably caused by indexes, or possibly by triggers. The suggestion would be to load the data into a staging table. Then, leisurely load the data from the staging table into the final table.
One possibility is doing 11,000 inserts, say one per second (that would require about three hours). Although there is more overhead in doing the inserts, each would be its own transaction and the locking times would be very short.
Of course, only inserting 1 record at a time may not be optimal. Perhaps 10 or 100 or even 1000 would suffice. You can manage the inserts using the event scheduler.
And, this assumes that the locking scales according to the volume of the input data. That is an assumption, but I think a reasonable one in the absence of other information.
We've got a constant stream of simple updates to a single MySQL table (storing user activity information). Let's say we group these into batch updates each second.
I want a ballpark idea of when mysql on a typical 4-core 8GB box will start having an issue keeping up with the updates coming in each second. E.g. how many rows of updates can I make # 1 per second?
This is a thought exercise to decide if I should get going with MySQL in the early days of our applications release (simplify development), or if MySQL's likely to bomb so soon as to make it not worth even venturing down that path.
The only way you can get a decent figure is through benchmarking your specific use case. There are just too many variables and there is no way around that.
It shouldn't take too long either if you just knock a bash script or a small demo app and hammer it with jmeter, then that can give you a good idea.
I used jmeter when trying to benchmark a similar use case. The difference was I was looking for write throughput for number of INSERTS. The most useful thing that came out when I was playing was the 'innodb_flush_log_at_trx_commit' param. If you are using INNODB and don't need ACID compliance for your use case, then changing it to 0. This makes a huge difference to INSERT throughput and will likely do the same in your UPDATE use case. Although note that with this setting, changes only get flushed to disk once per second, so if your server gets a power cut or something, you could lose a seconds worth of data.
On my Quad Core 8GB Machine for my use case:
innodb_flush_log_at_trx_commit=1 resulted in 80 INSERTS per second
innodb_flush_log_at_trx_commit=0 resulted in 2000 INSERTS per second
These figures will probably bear no relevance to your use case - which is why you need to benchmark it yourself.
A lot of it depends on the quality of the code which you use to push to the DB.
If you write your batch to insert a single value per INSERT request (i.e.,
INSERT INTO table (field) VALUES (value_1);
INSERT INTO table (field) VALUES (value_2);
...
INSERT INTO table (field) VALUES (value_n);
, your performance will crash and burn.
If you insert multiple values using a single INSERT (i.e.
INSERT INTO table (field) values (value_1),(value_2)...(value_n);
, you'll find that you could easily insert many records per second
As an example, I wrote a quick app which needed to add the details of a request for an LDAP account to a holding DB. Inserting one field at a time (i.e., LDAP_field, LDAP_value), execution of the whole script took 10's of seconds. When I concatenated the values into a single INSERT request, execution time of the script went down to about 2 seconds from start to finish. This included the overhead of starting and committing a transaction
Hope this helps
Its not easy to give a general answer to this question. The numbers you ask for rely heavily not only on the hardware of your database server, MySQL itself, but also on server/client configuration, network and - equally important - on your database/table design too.
Generally speaking, with a naked MySQL setup on a state-of-the-art server and update statements using unique keys, I don't have issues below 200 update-statementsp er second if I fire them from localhost, at least that's what I get on my six year old winxp test enviroment. A naked installation on a new system will scale this way higher. If you think way bigger, one server isn't the way to go. MySQL can be tweaked and scaled out in some ways, therefore many companies rely heavily on it.
Just some basics:
If the fields you want to update have huge index files, the update
statements are alot slower since each statement has to write not only
data, but also index informations.
If your update statement cannot
use an index, it might take longer for the server to allocate the
required fields it has to update.
Slow memory and/or slow harddisks
might also slow down overall server performance.
Slow network
connection slows down communication between client and server.
There are whole books written about it, so I'll stop here and advise some further reading, if you're interested!
Currently we have a database and a script which has 2 update and 1 select, 1 insert.
The problem is we have 20,000 People who run this script every hour. Which cause the mysql to run with 100% cpu.
For the insert, it's for logging, we want to log all the data to our mysql, but as the table scale up, application become slower and slower. We are running on InnoDB, but some people say it should be MyISAM. What should we use? In this log table, we do sometimes pull out the log for statistical purpose. 40->50 times a day only.
Our solution is to use Gearman [http://gearman.org/] to delay insert to the database. But how about the update.
We need to update 2 table, 1 from the customer to update the balance(balance = balance -1), and the other is to update the count from another table.
How should we make this faster and more CPU efficient?
Thank you
but as the table scale up, application become slower and slower
This usually means that you're missing an index somewhere.
MyISAM is not good: in addition to being non ACID compliant, it'll lock the whole table to do an insert -- which kills concurrency.
Read the MySQL documentation carefully:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Especially "innodb_flush_log_at_trx_commit" -
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html
I would stay away from MyISAM as it has concurrency issues when mixing SELECT and INSERT statements. If you can keep your insert tables small enough to stay in memory, they'll go much faster. Batching your updates in a transaction will help them go faster as well. Setting up a test environment and tuning for your actual job is important.
You may also want to look into partitioning to rotate your logs. You'd drop the old partition and create a new one for the current data. This is much faster than than deleting the old rows.
I have a large quantity of data in a production database that I want to update with batches of data while the data in the table is still available for end user use. The updates could be insertion of new rows or updates of existing rows. The specific table is approximately 50M rows, and the updates will be between 100k - 1M rows per "batch". What I would like to do is insert replace with a low priority.. In other words, I want the database to kind of slowly do the batch import without impacting performance of other queries that are occurring concurrently to the same disk spindles. To complicate this, the update data is heavily indexed. 8 b-tree indexes across multiple columns to facilitate various lookup that adds quite a bit of overhead to the import.
I've thought about batching the inserts down into 1-2k record blocks, then having the external script that loads the data just pause for a couple seconds between each insert, but that's really kind of hokey IMHO. Plus, during a 1M record batch, I really don't want to add 500-1000 2second pauses to add 20-40 minutes of extra load time if its not needed. Anyone have ideas on a better way to do this?
I've dealt with a similar scenario using InnoDB and hundreds of millions of rows. Batching with a throttling mechanism is the way to go if you want to minimize risk to end users. I'd experiment with different pause times and see what works for you. With small batches you have the benefit that you can adjust accordingly. You might find that you don't need any pause if you run this all sequentially. If your end users are using more connections then they'll naturally get more resources.
If you're using MyISAM there's a LOW_PRIORITY option for UPDATE. If you're using InnoDB with replication be sure to check that it's not getting too far behind because of the extra load. Apparently it runs in a single thread and that turned out to be the bottleneck for us. Consequently we programmed our throttling mechanism to just check how far behind replication was and pause as needed.
An INSERT DELAYED might be what you need. From the linked documentation:
Each time that delayed_insert_limit rows are written, the handler checks whether any SELECT statements are still pending. If so, it permits these to execute before continuing.
Check this link: http://dev.mysql.com/doc/refman/5.0/en/server-status-variables.html What I would do is write a script that will execute your batch updates when MySQL is showing Threads_running or Connections under a certain number. Hopefully you have some sort of test server where you can determine what a good number threshold might be for either of those server variables. There are plenty of other of server status variables to look at in there also. Maybe control the executions by the Innodb_data_pending_writes number? Let us know what works for you, its an interesting question!
I don't have a testing environment for this yet. But before I think too much about solutions I'd like to know if people think this would be a problem.
I will have 10-20 java processes connected to a MySql db via JDBC. Each will be inserting unique records, all going to the same table. The rate of inserts will be on the order of 1000's per second.
My expectation is that some process will attempt to insert and encounter a table lock while another process is inserting, and this will result in a JDBC exception and that insert to fail.
Clearly if you increase the insert rate sufficiently there eventually will be a point where some buffer somewhere fills up faster than it can be emptied. When such a buffer hits its maximum capacity and can't contain any more data some of your insert statements will have to fail. This will result in an exception being thrown at the client.
However, assuming you have high-end hardware I don't imagine this should happen with 1000 inserts per second, but it does depend on the specific hardware, how much data there is per insert, how many indexes you have, what other queries are running on the system simultaneously, etc.
Regardless of whether you are doing 10 inserts per second or 1000 you still shouldn't blindly assume that every insert will succeed - there's always a chance that an insert will fail because of some network communication error or some other problem. Your application should be able to correctly handle the situation where an insert fails.
Use InnoDB as it supports reads and writes at the same time. MyISAM will usually lock the table during the insert, but give preference to SELECT statements. This can cause issues if you're doing reporting or visualization of the data while trying to do inserts.
If you have a natural primary key (no auto_increment), using it will help avoid some deadlock issues. (Newer versions have fixed this.)
http://www.mysqlperformanceblog.com/2007/09/26/innodb-auto-inc-scalability-fixed/
You might also want to see if you can queue your writes in memory and send them to the database in batches. Lots of little inserts will be much slower than doing batches in transactions.
Good presentation on getting the most out of the MySQL Connector/J JDBC driver:
http://assets.en.oreilly.com/1/event/21/Connector_J%20Performance%20Gems%20Presentation.pdf
What engine do you use? That can make a difference.
http://dev.mysql.com/doc/refman/5.5/en/concurrent-inserts.html