I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.
Related
I have a quick question regarding migrating large data sample sets from my local device to an Amazon Aurora RDS (no DMS approach).
So basically I am working on a Proof of Concept and I need to populate an Amazon Aurora DB with 2 Million rows of data. I have generated an SQL file with 2 Million INSERT commands. Now I need to get this sql file to the RDS. What is the best (by best I mean fastest) option to do this, can anyone suggest?
Something to consider if your data was loaded in S3 at some point. You could skip a few steps, and load directly from S3.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html
Obviously, this only applies if it makes sense for your data pipeline?
The answer depends on a few different things, like which database engine (PostgreSQL or MySQL) and the server settings. Here are some general things to consider. All of these work by running the mysql, psql, or whichever client program with the option for 'run the statements in this file'.
Don't have 2 million INSERT statements. Use multiple values in the VALUES clause for each one, e.g.
postgres=> create table t1 (x int, s varchar);
postgres=> insert into t1 values (1, 'one'), (2, 'two'), (3, 'three');
Since you have control over generating the text of the INSERT statements, you might bundle 1000 rows into each one.
Also, don't do 2 million COMMITs, as would happen if you did 2 million INSERT statements with 'autocommit' turned on. Start a transaction, do N inserts, then commit. Rinse and repeat. I'm not sure offhand what the ideal value of N is. Since you already reduced the number of INSERT statements in step 1, maybe each transaction only has a few of these gigantic inserts in it.
I think you don't want to do the whole thing in one single transaction though. Just because of the possibility of overloading memory. The right balance of number of VALUES per INSERT, number of INSERTs per transaction, that's something I don't have a recommendation at hand. That could also depend on how many columns are in each INSERT, how long the string values are, etc.
You can start up multiple sessions and do these transactions & inserts in parallel. No reason to wait until row 1000 is finished inserting before starting on row 50,000 or row 750,000. That means you'll split all these statements across multiple files. One of the strengths of Aurora is handling a lot of concurrent connections like this.
Lastly, another Aurora-specific technique. (Well, it would work for RDS databases too.) Modify the DB instance to a higher-capacity instance class, do the data loading, then modify it back to the original instance class. Certain operations like data loading and engine upgrades benefit from having lots of cores and lots of memory - that can give you huge time savings. Which can be worth it to pay for a few minutes of 8xlarge or whatever, even if after that your queries run fine with a much smaller instance class.
If you don't mind rewriting the data into CSV form or something other than actual INSERT statements, check out the mysqlimport command for MySQL, or the \copy command for PostgreSQL. (\copy takes the data off your local machine and so works for Aurora, whereas COPY assumes the data is on a file on the server, which you don't have ssh or ftp access to with Aurora.)
I have data containing about 30 000 records. And I need to insert this data into MySQL table. I group this data in packages by 1000 and create multiple inserts like this:
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
How can I optimize performance of this inserting? Can I insert more than 1000 records per time? Each row contains data with size about 1KB. Thanks.
Try wrapping your bulk insert inside a transaction.
START TRANSACTION
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
COMMIT
That might improve performance, I'm not sure if mySQL can partially commit a bulk insert though (if it can't then this likely won't really help much)
Remember that even at 1.5 seconds, for 30,000 records each at ~1k in size, you're doing 20MB/s commit speed you could actually be drive limited depending on your hardware setup.
Advice then would be to investigate a SSD or changing your Raid setup or get faster mechanical drives (there's plenty of online articles on the pros and cons of using a SQL db mounted on a SSD).
You need to check mysql server configurations and specifically check buffer size etc.
You can remove indexes from the table, if any, to make it faster. Create the indexes onces data is in.
Look here, you should get all you need.
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
One insert statement with multiple values, it says, is much faster than multiple insert statements.
Is this a once off operation?
If so, just generate a single sql statement per data element and execute them all on the server. 30,000 really shouldnt take very long and you will have the simplest means of inputting your data.
I write bot that collect web pages but some pages is large around 1-2mb (normally large Ex. pantip.com) so how can i speed up now
it take 4-9 sec to update per row if meet over 1mb pages.
Thank in advance
You need to definitely increase max_allowed_packet and restart mysql. Something like
[mysqld]
max_allowed_packet=256M
One of the silent killers of MySQL is the MySQL Packet which is governed by max_allowed_packet.
Understanding what the MySQL Packet is may clarify this.
According to the page 99 of "Understanding MySQL Internals" (ISBN 0-596-00957-7), here are paragraphs 1-3 explaining it:
MySQL network communication code was
written under the assumption that
queries are always reasonably short,
and therefore can be sent to and
processed by the server in one chunk,
which is called a packet in MySQL
terminology. The server allocates the
memory for a temporary buffer to store
the packet, and it requests enough to
fit it entirely. This architecture
requires a precaution to avoid having
the server run out of memory---a cap
on the size of the packet, which this
option accomplishes.
The code of interest in relation to
this option is found in
sql/net_serv.cc. Take a look at my_net_read(), then follow the call to my_real_read() and pay
particular attention to
net_realloc().
This variable also limits the length
of a result of many string functons.
See sql/field.cc and
sql/intem_strfunc.cc for details.
Given this explanation, making bulk INSERTs will load/unload a MySQL Packet rather quickly. This is especially true when max_allowed_packet is too small for the given load of data coming at it.
CONCLUSION
In most installs of MySQL, I usually set this to 256M or 512M. You should experiment with larger values with data loads involving BLOB and TEXT fields.
use LOAD DATA instead of insert for bulk insert
Are you using individual statements for each record? You might want to look at the LOAD DATA infile for a bulk update.
Tips for fast insertion:
Use the LOAD DATA INFILE syntax to let MySQL parse it and insert it, even if you have to mangle it and feed it after the manipulation.
Use this insert syntax:
insert into table (col1, col2) values (val1, val2), (val3, val4), ...
Remove all keys/indexes prior to insertion.
Do it in the fastest machine you've got (IO-wise mainly, but RAM and CPU also matter). Both the DB server, but also the inserting client, remember you'll be paying twice the IO price (once reading, the second inserting)
MySQL Inserting large data sets from file with Java
We can help you better if you show us the queries, the table information (SHOW CREATE TABLE), information about the server, the MySQL settings and maybe some example data.
But in general:
Try using transactions: BEGIN TRANSACTION, lots of inserts, COMMIT will be faster then lots of inserts;
You can do INSERTs in a cache, in temporary tables or in files, and only later load them into your permanent table.
You should take care to setup your server correctly, using the correct size for your buffers, and the correct table storage engine.
You can partition a table over multiple disks, to speed up inserts.
You can disable triggers and indexes during the inserts.
If you insert multiple rows at once, do one INSERT with multiple VALUES.
I am consuming a high rate data stream and doing the following steps to store data in a MySQL database. For each new arriving item.
(1) Parse incoming item.
(2) Execute several "INSERT ... ON DUPLICATE KEY UPDATE"
I have used INSERT ... ON DUPLICATE KEY UPDATE to eliminate one additional round-trip to the database.
While trying to improve the overall performance, I have considered doing bulk updates in the following way:
(1) Parse incoming item.
(2) Generate SQL statement with "INSERT ... ON DUPLICATE KEY UPDATE" and append to a file.
Periodically flush the SQL statements in the file to the database.
Two questions:
(1) will this have a positive impact in the database load?
(2) how should I flush the statements to the database so that indices are only reconstructed after the complete flush? (using transactions?)
UPDATE: I am using Perl DBI + MySQL MyISAM.
Thanks in advance for any comments.
If your data does not need to go into the database immediately you can cache your insert data somewhere, then issue one larger insert statement, e.g.
insert into table_name (x, y, z) values (x1, y1, z1), (x2, y2, z2), ... (xN, yN, zN) on duplicate update ...;
To be clear, I would maintain a list of pending inserts. In this case a list of (x,z,y) triplets. Then once your list exceeds some threshold (N) you generate the insert statement and issue it.
I have no accurate timing figures for you, but this increased performance roughly 10 times when compared to inserting each row individually.
I also haven't played with the value of N, but I found 1000 to work nicely. I expect the optimal value is affected by hardware and database settings.
Hope this helps (I am also using MyIsam).
You don't say what kind of database access environment (PERL DBI? JDBC? ODBC?) you're running in, or what kind of table storage engine (MyISAM? InnoDB?) you're using.
First of all, you're right to pick INSERT ... ON DUPLICATE KEY UPDATE. Good move, unless you can guarantee unique keys.
Secondly, if your database access environment allows it, you should use prepared statements. You definitely won't get good performance if you write a bunch of statements into a file, and then make a database client read the file once again. Do the INSERT operations directly from the software package that consumes the incoming data stream.
Thirdly, pick the right kind of table storage engine. MyISAM inserts are going to be faster than InnoDB, so if you're logging data and retrieving it later that will be a win. But InnoDB has better transactional integrity. If you're really handling tonnage of data, and you don't need to read it very often, consider the ARCHIVE storage engine.
Finally, consider doing a START TRANSACTION at the beginning of a batch of INSERT ... commands, then doing a COMMIT and another START TRANSACTION after a fixed number of rows, like 100 or so. If you're using InnoDB, this will speed things up a lot. If you're using MyISAM or ARCHIVE, it won't matter.
Your big wins will come from the prepared statement stuff and the best choice of storage engine.
I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?