MySQL multiple insert performance - mysql

I have data containing about 30 000 records. And I need to insert this data into MySQL table. I group this data in packages by 1000 and create multiple inserts like this:
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
How can I optimize performance of this inserting? Can I insert more than 1000 records per time? Each row contains data with size about 1KB. Thanks.

Try wrapping your bulk insert inside a transaction.
START TRANSACTION
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
COMMIT
That might improve performance, I'm not sure if mySQL can partially commit a bulk insert though (if it can't then this likely won't really help much)
Remember that even at 1.5 seconds, for 30,000 records each at ~1k in size, you're doing 20MB/s commit speed you could actually be drive limited depending on your hardware setup.
Advice then would be to investigate a SSD or changing your Raid setup or get faster mechanical drives (there's plenty of online articles on the pros and cons of using a SQL db mounted on a SSD).

You need to check mysql server configurations and specifically check buffer size etc.
You can remove indexes from the table, if any, to make it faster. Create the indexes onces data is in.
Look here, you should get all you need.
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
One insert statement with multiple values, it says, is much faster than multiple insert statements.

Is this a once off operation?
If so, just generate a single sql statement per data element and execute them all on the server. 30,000 really shouldnt take very long and you will have the simplest means of inputting your data.

Related

Fastest Way To Add Large Data Samples to an Aurora RDS

I have a quick question regarding migrating large data sample sets from my local device to an Amazon Aurora RDS (no DMS approach).
So basically I am working on a Proof of Concept and I need to populate an Amazon Aurora DB with 2 Million rows of data. I have generated an SQL file with 2 Million INSERT commands. Now I need to get this sql file to the RDS. What is the best (by best I mean fastest) option to do this, can anyone suggest?
Something to consider if your data was loaded in S3 at some point. You could skip a few steps, and load directly from S3.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html
Obviously, this only applies if it makes sense for your data pipeline?
The answer depends on a few different things, like which database engine (PostgreSQL or MySQL) and the server settings. Here are some general things to consider. All of these work by running the mysql, psql, or whichever client program with the option for 'run the statements in this file'.
Don't have 2 million INSERT statements. Use multiple values in the VALUES clause for each one, e.g.
postgres=> create table t1 (x int, s varchar);
postgres=> insert into t1 values (1, 'one'), (2, 'two'), (3, 'three');
Since you have control over generating the text of the INSERT statements, you might bundle 1000 rows into each one.
Also, don't do 2 million COMMITs, as would happen if you did 2 million INSERT statements with 'autocommit' turned on. Start a transaction, do N inserts, then commit. Rinse and repeat. I'm not sure offhand what the ideal value of N is. Since you already reduced the number of INSERT statements in step 1, maybe each transaction only has a few of these gigantic inserts in it.
I think you don't want to do the whole thing in one single transaction though. Just because of the possibility of overloading memory. The right balance of number of VALUES per INSERT, number of INSERTs per transaction, that's something I don't have a recommendation at hand. That could also depend on how many columns are in each INSERT, how long the string values are, etc.
You can start up multiple sessions and do these transactions & inserts in parallel. No reason to wait until row 1000 is finished inserting before starting on row 50,000 or row 750,000. That means you'll split all these statements across multiple files. One of the strengths of Aurora is handling a lot of concurrent connections like this.
Lastly, another Aurora-specific technique. (Well, it would work for RDS databases too.) Modify the DB instance to a higher-capacity instance class, do the data loading, then modify it back to the original instance class. Certain operations like data loading and engine upgrades benefit from having lots of cores and lots of memory - that can give you huge time savings. Which can be worth it to pay for a few minutes of 8xlarge or whatever, even if after that your queries run fine with a much smaller instance class.
If you don't mind rewriting the data into CSV form or something other than actual INSERT statements, check out the mysqlimport command for MySQL, or the \copy command for PostgreSQL. (\copy takes the data off your local machine and so works for Aurora, whereas COPY assumes the data is on a file on the server, which you don't have ssh or ftp access to with Aurora.)

MySQL bulk parametrized inserts performance

I am looking for someone more knowledgable in terms of MySQL query optimizer / cache.
I have to perform bulk insert operation of dataset changing in size (so each time number of rows will change). I wanted to know if, assuming these queries are parametrized, thas bulk insert query will be cached / reused? So basically I'm wondering if MySQL is smart enough to see that these inserts are all the same and cache a single insert query (whether its insert-per-row or multiple rows on for one row)? Then reuse the optimized, cached query?
Just to give you more idea of data amount we talk about, it around 500-600 batches of bulk inserts and each bulk insert has unknown number of rows to be inserted (as in, it may be anything between 1 and 100000). They are executed one after another.
So will that bulk insert be cached and reused? If not, would splitting the bulk insert into small batches (like a single INSERT query with X rows) change it?
I'm not looking for another way of doing this as for now. Simply, because it is not possible given how the application is written, so I am only limited to parametrized vs non-parametrized query and one bulk insert vs multiple batches. This is then called X number of times for each "object" I have, which like I said already, may even be 500 or 600 times (but this I can't change for now).
EDIT
I thought that this description may be vague, so here it is as a list:
Step into iteration of loop over objects
Identify data to be inserted for that object
Create a bulk insert query (non-parametrized currently, but I want to change that)
Execute bulk query
Go to another iteration
So what I really can change for now is points 3 and 4.

Uploading enormous amount of data to MySQL server

I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.

Problematic performance with continuous UPDATE / INSERT in Mysql

Currently we have a database and a script which has 2 update and 1 select, 1 insert.
The problem is we have 20,000 People who run this script every hour. Which cause the mysql to run with 100% cpu.
For the insert, it's for logging, we want to log all the data to our mysql, but as the table scale up, application become slower and slower. We are running on InnoDB, but some people say it should be MyISAM. What should we use? In this log table, we do sometimes pull out the log for statistical purpose. 40->50 times a day only.
Our solution is to use Gearman [http://gearman.org/] to delay insert to the database. But how about the update.
We need to update 2 table, 1 from the customer to update the balance(balance = balance -1), and the other is to update the count from another table.
How should we make this faster and more CPU efficient?
Thank you
but as the table scale up, application become slower and slower
This usually means that you're missing an index somewhere.
MyISAM is not good: in addition to being non ACID compliant, it'll lock the whole table to do an insert -- which kills concurrency.
Read the MySQL documentation carefully:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Especially "innodb_flush_log_at_trx_commit" -
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html
I would stay away from MyISAM as it has concurrency issues when mixing SELECT and INSERT statements. If you can keep your insert tables small enough to stay in memory, they'll go much faster. Batching your updates in a transaction will help them go faster as well. Setting up a test environment and tuning for your actual job is important.
You may also want to look into partitioning to rotate your logs. You'd drop the old partition and create a new one for the current data. This is much faster than than deleting the old rows.

Fork MySQL INSERT INTO (InnoDB)

I'm trying to insert about 500 million rows of garbage data into a database for testing. Right now I have a PHP script looping through a few SELECT/INSERT statements each inside a TRANSACTION -- clearly this isn't the best solution. The tables are InnoDB (row-level locking).
I'm wondering if I (properly) fork the process, will this speed up the INSERT process? At the rate it's going, it will take 140 hours to complete. I'm concerned about two things:
If INSERT statements must acquire a write lock, then will it render forking useless, since multiple processes can't write to the same table at the same time?
I'm using SELECT...LAST_INSERT_ID() (inside a TRANSACTION). Will this logic break when multiple processes are INSERTing into the database? I could create a new database connection for each fork, so I hope this would avoid the problem.
How many processes should I be using? The queries themselves are simple, and I have a regular dual-core dev box with 2GB RAM. I set up my InnoDB to use 8 threads (innodb_thread_concurrency=8), but I'm not sure if I should be using 8 processes or if this is even a correct way to think about matching.
Thanks for your help!
The MySQL documentation has a discussion on efficient insertion of a large number of records. It seems that the clear winner is usage of the LOAD DATA INFILE command, followed by inserts that insert multiple values lists.
1) yes, there will be lock contention, but innodb is designed to handle multiple threads trying to insert. sure, they won't simultaneously insert, but it will handle serializing the inserts for you. just make sure you specifically close your transactions and you do it ASAP. this will ensure you get the best possible insert performance.
2) no, this logic will not break provided you have 1 connection per thread, since last_insert_id() is connection specific.
3) this is one of those things that you just need to benchmark to figure out. actually, i would make the program self-adjust. run 100 inserts with 8 threads and record the execution times. then try again with half as many and twice as many. whichever one is faster, then benchmark more thread count values around that number.
in general, you should always just go ahead and benchmark this kind of stuff to see which is faster. in the amount of time it takes you to think about it and write it up, you could probably already have preliminary numbers.