Handling bulk insert of huge data

Handling bulk insert of huge data - mysql

I have some data in csv files. The volume of the data is huge (around 65GB). I want to insert them all in a database so that later they can be queried.
The csv file itself is pretty simple, it has only 5 columns. So basically all the data will be inserted into a single table.
Now I have tried to insert these data into a mysql database but the time it's taking is quite huge. I have spent almost 6 hours to insert just 1.3GB of those data (My processor is core i5 2.9 GHz, RAM is 4GB DDR3).
This loading needs to be finished pretty quickly so that all the data inserts should be done within 4/5 days.
Which database will show the best performance in this case, provided that a reasonable query speed is acceptable on the data?
Also, are there some other steps/practices that I should follow ?

You probably don't even need to import it. You can create a table with the engine=CSV.
mysql> create table mycsv(id int not null) engine=csv;
Query OK, 0 rows affected (0.02 sec)
then go into your data directory and remove mycsv.CSV and move/copy/symlink your CSV file as mycsv.CSV. Go back to mysql and type flush tables; and you're good to go. (NOTE: it may not work with \r\n so you may need to convert those to \n first).
If you are using InnoDB, the problem is that it has to keep track of each undo log entry for every row inserted and this takes a lot of resources, taking a loooong time. Better to do it in smaller batches so it can do most of the undo log tracking in memory. The undo log is there in case you ctrl-c it in the middle of a load and it needs to roll back. After that batch has been loaded, it doesn't need to keep track of it anymore. If you do it all at once, then it has to keep track of all those undo log entries, probably having to go to disk -- and that's a killer.
I prefer to use myisam for data if I know I'm not going to do row level locking, like if I want to run one long program to analyze the data. The table is locked, but I only need one program running on it. Plus you can always use merge tables -- they take myisam tables and you can group them together into one table. I like doing this for log files where each table is a month of data. Then I have a merge table for a year. The merge table doesn't copy the data, it just points to each of the myisam tables.

Related

Creating indexes on large tables in MySQL (MariaDB) takes a verrry looong time

I have a table with a few billion rows of data and I am trying to build 5 indexes on it at once. The table format is MyISAM to save space. Once I build the indexes this will be a static table, I just need it to be read only.
I created the indexes using this command:
alter table links8 add index(uid,tid), add index (date), add index (tid), add index (userid), add index (updated,uid,tid,userid,date);
The command has been running for over 45 days. You read that right: 45 DAYS. I can see that the temp files are still being accessed, it isn't a dead query.
My question is: wtf? Seems like it should take a few hours at most to sort and build an index even with a few billion rows.
Since I have a static table, is there another storage engine that makes sense to use? Innodb takes up way too much space.

45 days doesn't seem right, because in that time, MySQL is bound to do something, and that something is likely either consuming RAM or storage, likely both, which means that you should have run out of either at some point.
I'd assume it's RAM, because that usually is where things get sparse ;)
Now, you're absolutely right, sorting a few billion values in memory shouldn't take ages. Sorting a few billion values that are the concatenated values in (updated,uid,tid,userid,date) though most likely doesn't happen in RAM. Assuming updated and date are of type datetime, they take 8 bytes each; uid,tid,userid would normally be 32 bit ints, but since your table has > 2**32 entries (I'm assuming that), unique ID's would be 8 byte long, too. So one value of type (updated,uid,tid,userid,date) would be 40B long.
Now throw in let's say 5 billion of these; you get 200 GB of pure row data that you'll need to sort to build an index. Assuming you're not doing this on some huge machine, you obviously need to swap out parts of these values to disk -- since you see temporary files appear, my wild guess is that this is happening, and MySQL is actively doing that itself. Now, sorting algorithms that work on parts of the rows iteratively are much slower, because first you sort all parts, then you mix up the parts in a manner that's better sorted than before, than you re-partition your data, you sort your parts ... with storing and loading from disk in between.
By the way, a 45 day lasting memory operation is likely to be prone to memory bit errors, if no correctional measures are taken (basically, use ECC for this kind of task, or you end up with indexed, corrupted data).
MySQL themselves suggest that you just build a special MD5 index that takes the hash of your search tuple and looks for that, since sorting 128bit (==16 byte) MD5 hashes might be easier than sorting 5*8Byte == 40*8 bit == 320bit long composite rows.

I found a better solution.
I created a new table with the indexes already in place then issued an insert from one table to the other. The way this works is it fills the MYD (raw data file) up and then creates the indexes after that. Once it has started creating the indexes I killed the query. Then on the filesystem I used myisamchk to repair the table manually.
That command looked like this:
myisamchk --force --fast --update-state --key_buffer_size=2000M --sort_buffer_size=2000M --read_buffer_size=10M --write_buffer_size=10M TABLE.MYI
And the whole thing took less than 12 hours and the data looks good!
UPDATE:
Here is the flow summarized.
create table2 indentical to table1 with indexes;
insert into table2 select * from table1;
once the MYD file is full and it starts on the MYI file kill the query
then shutdown mysql and run the myisamchk query and restart mysql
OR
copy table2.MYD and table2.MYI to table3.MYD and table3.MYI, then run myisamchk, then copy table2.frm to table3.frm and change the permissions, when it's all done you should be able to access table3 without a restart of mysql

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!

Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.

Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.

I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

Performance of MySQL LOAD DATA INFILE over small file

I'm loading a small data file which consists around 1K rows into a MyISAM table
{
id INT(8),
text TEXT(or VARCHAR(1000))
}
The cost is around 2 seconds for LOAD DATA INFILE. I've seen MySQL could load more than 10K rows per second in average when loading large files. And I roughly know there are cost such as open/close tables. Can someone help me know what exactly happen in this 2 seconds and is it possible to optimize it under seconds as my program is running in a time critical environment. Thanks.
Somebody asked a similar question here
http://forums.mysql.com/read.php?144,558753,558753.
Looks like it has not been well answered yet.
Scenario Description
The whole MySQL setup is for some academic projects, which has around 300G databases for various projects. Most of these databases are in MyISAM engine if not ALL. These databases contains imported dumps, and processed intermediate tables in experiments. There are delete and update operations on these tables, but now they are all idle. I have a project which generate some result tuples that are inserted into a table in one of the databases. The table is initialized to be empty. The schema is very simple which contains only two columns as I pasted. Now if I set the ENGINE=MyISAM, it always takes 2s to insert 1-1K row, however, if I switch to ENGINE=INNODB, it becomes 0.01s. I installed a new MySQL in the other machine, create the table with ENGINE=MyISAM, and insert the same number of rows, it only takes 0.01s.

At 1k rows, you may find that multi-inserts are faster. Some benchmarking should help. This should be helpful as well:
http://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb.html

MySQL locking processing large LOAD DATA INFILE when live SELECT queries still needed

Looking for some help and advice please from Super Guru MySQL/PHP pros who can spare a moment of their time.
I have a web application in PHP/MySQL which has grown over the years and gets alot of searches on it. Its hitting bottlenecks now when the various daily data dumps of new rows get processed using MySQL LOAD DATA INFILE.
Its a large MyISAM table with about 1.5 million rows and all the SELECT queries occur on it. When these take place during the LOAD DATA INFILE of about 600k rows (and deletion of out dated data) they just get backed up and take about 30+ minutes to be freed up making any of those searches fruitless.
I need to come up with a way to get that table updated while retaining the ability to provide SELECT results in a reasonable timeframe.
Im completely out of ideas and have not been able to come up with a solution myself as its the first time ive encountered this sort of issue.
Any helpful advice, solutions or pointers from similar past experiences would be greatly appreciated as I would love to learn to resolve this sort of problem.
Many thanks everyone for your time! J

You can use the CONCURRENT keywords for LOAD DATA INFILE. This way, when you load the data, the table is still able to server SELECTs.
Concerning the delete, this is more complicated. I would personally add a column called 'status' INT(1), who will define if the line is active or not( = deleted), and then partition my table with a rule based on this column status.
This way, it will be easier to delete all rows where status=0 :P I haven;t tested this last solution, I may do that in a near future.
The CONCURRENT keywords will work if your table is optimized. If there is any FREE_SPACE, then the LOAD DATA INFILE will lock the table.

MyISAM doesn't support row-level locking, so operations like mysqldump are forced to lock the entire table to guarantee a consistent dump. Your only practical options are to switch to another table like (like InnoDB) that supports row-level locking, and/or split your dump up into smaller pieces. The small dumps will still lock the table while they're dumping/reloading, but the lock periods would be shorter.
A hairier option would be to have "live" and "backup" tables. Do the dump/load operations on the backup table. When they're copmlete, swap it out for the live table (rename tables, or have your code dynamically change which table they're using).. If you can live with a short window of potential stale data, this could be a better option.

You should switch your table storage engine from MyISAM to InnoDB. InnoDB provides row-locking (as opposed to MyISAM's table-locking) meaning while one query is busy updating or inserting a row, another query can update a different row at the same time.

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.

If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.

Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.

I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008