I am consuming a high rate data stream and doing the following steps to store data in a MySQL database. For each new arriving item.
(1) Parse incoming item.
(2) Execute several "INSERT ... ON DUPLICATE KEY UPDATE"
I have used INSERT ... ON DUPLICATE KEY UPDATE to eliminate one additional round-trip to the database.
While trying to improve the overall performance, I have considered doing bulk updates in the following way:
(1) Parse incoming item.
(2) Generate SQL statement with "INSERT ... ON DUPLICATE KEY UPDATE" and append to a file.
Periodically flush the SQL statements in the file to the database.
Two questions:
(1) will this have a positive impact in the database load?
(2) how should I flush the statements to the database so that indices are only reconstructed after the complete flush? (using transactions?)
UPDATE: I am using Perl DBI + MySQL MyISAM.
Thanks in advance for any comments.
If your data does not need to go into the database immediately you can cache your insert data somewhere, then issue one larger insert statement, e.g.
insert into table_name (x, y, z) values (x1, y1, z1), (x2, y2, z2), ... (xN, yN, zN) on duplicate update ...;
To be clear, I would maintain a list of pending inserts. In this case a list of (x,z,y) triplets. Then once your list exceeds some threshold (N) you generate the insert statement and issue it.
I have no accurate timing figures for you, but this increased performance roughly 10 times when compared to inserting each row individually.
I also haven't played with the value of N, but I found 1000 to work nicely. I expect the optimal value is affected by hardware and database settings.
Hope this helps (I am also using MyIsam).
You don't say what kind of database access environment (PERL DBI? JDBC? ODBC?) you're running in, or what kind of table storage engine (MyISAM? InnoDB?) you're using.
First of all, you're right to pick INSERT ... ON DUPLICATE KEY UPDATE. Good move, unless you can guarantee unique keys.
Secondly, if your database access environment allows it, you should use prepared statements. You definitely won't get good performance if you write a bunch of statements into a file, and then make a database client read the file once again. Do the INSERT operations directly from the software package that consumes the incoming data stream.
Thirdly, pick the right kind of table storage engine. MyISAM inserts are going to be faster than InnoDB, so if you're logging data and retrieving it later that will be a win. But InnoDB has better transactional integrity. If you're really handling tonnage of data, and you don't need to read it very often, consider the ARCHIVE storage engine.
Finally, consider doing a START TRANSACTION at the beginning of a batch of INSERT ... commands, then doing a COMMIT and another START TRANSACTION after a fixed number of rows, like 100 or so. If you're using InnoDB, this will speed things up a lot. If you're using MyISAM or ARCHIVE, it won't matter.
Your big wins will come from the prepared statement stuff and the best choice of storage engine.
Related
I have a quick question regarding migrating large data sample sets from my local device to an Amazon Aurora RDS (no DMS approach).
So basically I am working on a Proof of Concept and I need to populate an Amazon Aurora DB with 2 Million rows of data. I have generated an SQL file with 2 Million INSERT commands. Now I need to get this sql file to the RDS. What is the best (by best I mean fastest) option to do this, can anyone suggest?
Something to consider if your data was loaded in S3 at some point. You could skip a few steps, and load directly from S3.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html
Obviously, this only applies if it makes sense for your data pipeline?
The answer depends on a few different things, like which database engine (PostgreSQL or MySQL) and the server settings. Here are some general things to consider. All of these work by running the mysql, psql, or whichever client program with the option for 'run the statements in this file'.
Don't have 2 million INSERT statements. Use multiple values in the VALUES clause for each one, e.g.
postgres=> create table t1 (x int, s varchar);
postgres=> insert into t1 values (1, 'one'), (2, 'two'), (3, 'three');
Since you have control over generating the text of the INSERT statements, you might bundle 1000 rows into each one.
Also, don't do 2 million COMMITs, as would happen if you did 2 million INSERT statements with 'autocommit' turned on. Start a transaction, do N inserts, then commit. Rinse and repeat. I'm not sure offhand what the ideal value of N is. Since you already reduced the number of INSERT statements in step 1, maybe each transaction only has a few of these gigantic inserts in it.
I think you don't want to do the whole thing in one single transaction though. Just because of the possibility of overloading memory. The right balance of number of VALUES per INSERT, number of INSERTs per transaction, that's something I don't have a recommendation at hand. That could also depend on how many columns are in each INSERT, how long the string values are, etc.
You can start up multiple sessions and do these transactions & inserts in parallel. No reason to wait until row 1000 is finished inserting before starting on row 50,000 or row 750,000. That means you'll split all these statements across multiple files. One of the strengths of Aurora is handling a lot of concurrent connections like this.
Lastly, another Aurora-specific technique. (Well, it would work for RDS databases too.) Modify the DB instance to a higher-capacity instance class, do the data loading, then modify it back to the original instance class. Certain operations like data loading and engine upgrades benefit from having lots of cores and lots of memory - that can give you huge time savings. Which can be worth it to pay for a few minutes of 8xlarge or whatever, even if after that your queries run fine with a much smaller instance class.
If you don't mind rewriting the data into CSV form or something other than actual INSERT statements, check out the mysqlimport command for MySQL, or the \copy command for PostgreSQL. (\copy takes the data off your local machine and so works for Aurora, whereas COPY assumes the data is on a file on the server, which you don't have ssh or ftp access to with Aurora.)
I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.
I write bot that collect web pages but some pages is large around 1-2mb (normally large Ex. pantip.com) so how can i speed up now
it take 4-9 sec to update per row if meet over 1mb pages.
Thank in advance
You need to definitely increase max_allowed_packet and restart mysql. Something like
[mysqld]
max_allowed_packet=256M
One of the silent killers of MySQL is the MySQL Packet which is governed by max_allowed_packet.
Understanding what the MySQL Packet is may clarify this.
According to the page 99 of "Understanding MySQL Internals" (ISBN 0-596-00957-7), here are paragraphs 1-3 explaining it:
MySQL network communication code was
written under the assumption that
queries are always reasonably short,
and therefore can be sent to and
processed by the server in one chunk,
which is called a packet in MySQL
terminology. The server allocates the
memory for a temporary buffer to store
the packet, and it requests enough to
fit it entirely. This architecture
requires a precaution to avoid having
the server run out of memory---a cap
on the size of the packet, which this
option accomplishes.
The code of interest in relation to
this option is found in
sql/net_serv.cc. Take a look at my_net_read(), then follow the call to my_real_read() and pay
particular attention to
net_realloc().
This variable also limits the length
of a result of many string functons.
See sql/field.cc and
sql/intem_strfunc.cc for details.
Given this explanation, making bulk INSERTs will load/unload a MySQL Packet rather quickly. This is especially true when max_allowed_packet is too small for the given load of data coming at it.
CONCLUSION
In most installs of MySQL, I usually set this to 256M or 512M. You should experiment with larger values with data loads involving BLOB and TEXT fields.
use LOAD DATA instead of insert for bulk insert
Are you using individual statements for each record? You might want to look at the LOAD DATA infile for a bulk update.
Tips for fast insertion:
Use the LOAD DATA INFILE syntax to let MySQL parse it and insert it, even if you have to mangle it and feed it after the manipulation.
Use this insert syntax:
insert into table (col1, col2) values (val1, val2), (val3, val4), ...
Remove all keys/indexes prior to insertion.
Do it in the fastest machine you've got (IO-wise mainly, but RAM and CPU also matter). Both the DB server, but also the inserting client, remember you'll be paying twice the IO price (once reading, the second inserting)
MySQL Inserting large data sets from file with Java
We can help you better if you show us the queries, the table information (SHOW CREATE TABLE), information about the server, the MySQL settings and maybe some example data.
But in general:
Try using transactions: BEGIN TRANSACTION, lots of inserts, COMMIT will be faster then lots of inserts;
You can do INSERTs in a cache, in temporary tables or in files, and only later load them into your permanent table.
You should take care to setup your server correctly, using the correct size for your buffers, and the correct table storage engine.
You can partition a table over multiple disks, to speed up inserts.
You can disable triggers and indexes during the inserts.
If you insert multiple rows at once, do one INSERT with multiple VALUES.
I have a write intensive application running on EC2. Any thoughts on how to optimize it to be able to make several thousands concurrent writes on the MySQL DB?
Write scaling is a hard problem. Perhaps, secret to write scaling is in read scaling. That is, cache reads as much as possible, so that the writes get all the throughput.
Having said that, there are a bunch of things one can do:
1) Start with the data model. Design a data model so that you do not ever delete or update a table. Only operation is an insert. Use Effective Date, Effective Sequence and Effective Status to implement Insert, Update and Delete operations using just the Insert Command. This concept is called Append Only model. Checkout RethinkDB..
2) Set the Concurrent Insert flag to 1. This makes sure that the tables keep inserting while reads are in progress.
3) When you have only Inserts at the tail, you may not need row-level locks. So, use MyISAM (this is not to take anything away from InnoDB, which I will come to later).
4) If all this does not do much, create a replica table in Memory Engine. If you have a table called MY_DATA, create a table called MY_DATA_MEM in memory table.
5) Redirect all Inserts to the MEM table. Create a View that UNIONS both tables and use that view as your Read Source.
6) Write a daemon that periodically moves MEM contents to the Main table and deletes from the Mem table. It may be ideal to implement the MOVE operation as a Delete trigger on the Mem table (I am hoping triggers are possible on Memory Engine, not entirely sure).
7) Do not do any deletes or Updates on the MEM table (they are slow) also pay attention to the cardinality of the keys in your table (HASH vs B-Tree : Low Card -> Hash, High Card-> B-Tree)
8) Even if all the above does not work, ditch jdbc/odbc. Move to InnoDB and use Handler Socket interface to do the direct inserts (Google for Yoshinori-San MySQL)
I have not used the HS myself, but the benchmarks are impressive. There is a even Java HS Project on Google Code.
Hope that helps..
I don't have a testing environment for this yet. But before I think too much about solutions I'd like to know if people think this would be a problem.
I will have 10-20 java processes connected to a MySql db via JDBC. Each will be inserting unique records, all going to the same table. The rate of inserts will be on the order of 1000's per second.
My expectation is that some process will attempt to insert and encounter a table lock while another process is inserting, and this will result in a JDBC exception and that insert to fail.
Clearly if you increase the insert rate sufficiently there eventually will be a point where some buffer somewhere fills up faster than it can be emptied. When such a buffer hits its maximum capacity and can't contain any more data some of your insert statements will have to fail. This will result in an exception being thrown at the client.
However, assuming you have high-end hardware I don't imagine this should happen with 1000 inserts per second, but it does depend on the specific hardware, how much data there is per insert, how many indexes you have, what other queries are running on the system simultaneously, etc.
Regardless of whether you are doing 10 inserts per second or 1000 you still shouldn't blindly assume that every insert will succeed - there's always a chance that an insert will fail because of some network communication error or some other problem. Your application should be able to correctly handle the situation where an insert fails.
Use InnoDB as it supports reads and writes at the same time. MyISAM will usually lock the table during the insert, but give preference to SELECT statements. This can cause issues if you're doing reporting or visualization of the data while trying to do inserts.
If you have a natural primary key (no auto_increment), using it will help avoid some deadlock issues. (Newer versions have fixed this.)
http://www.mysqlperformanceblog.com/2007/09/26/innodb-auto-inc-scalability-fixed/
You might also want to see if you can queue your writes in memory and send them to the database in batches. Lots of little inserts will be much slower than doing batches in transactions.
Good presentation on getting the most out of the MySQL Connector/J JDBC driver:
http://assets.en.oreilly.com/1/event/21/Connector_J%20Performance%20Gems%20Presentation.pdf
What engine do you use? That can make a difference.
http://dev.mysql.com/doc/refman/5.5/en/concurrent-inserts.html