MySQL load data infile - acceleration? - mysql

sometimes, I have to re-import data for a project, thus reading about 3.6 million rows into a MySQL table (currently InnoDB, but I am actually not really limited to this engine). "Load data infile..." has proved to be the fastest solution, however it has a tradeoff:
- when importing without keys, the import itself takes about 45 seconds, but the key creation takes ages (already running for 20 minutes...).
- doing import with keys on the table makes the import much slower
There are keys over 3 fields of the table, referencing numeric fields.
Is there any way to accelerate this?
Another issue is: when I terminate the process which has started a slow query, it continues running on the database. Is there any way to terminate the query without restarting mysqld?
Thanks a lot
DBa

if you're using innodb and bulk loading here are a few tips:
sort your csv file into the primary key order of the target table : remember innodb uses
clustered primary keys so it will load faster if it's sorted !
typical load data infile i use:
truncate <table>;
set autocommit = 0;
load data infile <path> into table <table>...
commit;
other optimisations you can use to boost load times:
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;
split the csv file into smaller chunks
typical import stats i have observed during bulk loads:
3.5 - 6.5 million rows imported per min
210 - 400 million rows per hour

This blog post is almost 3 years old, but it's still relevant and has some good suggestions for optimizing the performance of "LOAD DATA INFILE":
http://www.mysqlperformanceblog.com/2007/05/24/predicting-how-long-data-load-would-take/

InnoDB is a pretty good engine. However, it highly relies on being 'tuned'. One thing is that if your inserts are not in the order of increasing primary keys, innoDB can take a bit longer than MyISAM. This can easily be overcome by setting a higher innodb_buffer_pool_size. My suggestion is to set it at 60-70% of your total RAM on a dedicated MySQL machine.

Related

Why LOAD DATA INFILE MYSQL is slow [duplicate]

sometimes, I have to re-import data for a project, thus reading about 3.6 million rows into a MySQL table (currently InnoDB, but I am actually not really limited to this engine). "Load data infile..." has proved to be the fastest solution, however it has a tradeoff:
- when importing without keys, the import itself takes about 45 seconds, but the key creation takes ages (already running for 20 minutes...).
- doing import with keys on the table makes the import much slower
There are keys over 3 fields of the table, referencing numeric fields.
Is there any way to accelerate this?
Another issue is: when I terminate the process which has started a slow query, it continues running on the database. Is there any way to terminate the query without restarting mysqld?
Thanks a lot
DBa
if you're using innodb and bulk loading here are a few tips:
sort your csv file into the primary key order of the target table : remember innodb uses
clustered primary keys so it will load faster if it's sorted !
typical load data infile i use:
truncate <table>;
set autocommit = 0;
load data infile <path> into table <table>...
commit;
other optimisations you can use to boost load times:
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;
split the csv file into smaller chunks
typical import stats i have observed during bulk loads:
3.5 - 6.5 million rows imported per min
210 - 400 million rows per hour
This blog post is almost 3 years old, but it's still relevant and has some good suggestions for optimizing the performance of "LOAD DATA INFILE":
http://www.mysqlperformanceblog.com/2007/05/24/predicting-how-long-data-load-would-take/
InnoDB is a pretty good engine. However, it highly relies on being 'tuned'. One thing is that if your inserts are not in the order of increasing primary keys, innoDB can take a bit longer than MyISAM. This can easily be overcome by setting a higher innodb_buffer_pool_size. My suggestion is to set it at 60-70% of your total RAM on a dedicated MySQL machine.

Speed up tokudb "alter table ... engine=TokuDB”

I'm trying to convert a 400 million row Innodb table to the tokudb engine. When I start with "alter table ... engine=TokuDB" things run really fast in the beginning, (Using SHOW PROCESSLIST) I see it reading in about 1 million rows every 10 seconds. But once I reach about 19-20 million rows, it starts to slow reading and is more like 10k rows every few seconds.
Are there any mysql or tokudb variables that affect the speed of which an ALTER TABLE to tokudb works? I tried the tmp_table_size and some others but can't seem to get past that hurdle.
Any ideas?
The solution to this for me was to export "into outfile" and import "load data infile"
This was several orders of magnitude faster for me (110 million records). Everytime I modify a large tokudb database (alter table) it takes forever (~30k/sec).
It has been quicker to full export and import (~500k/sec)
Dropped alter table times from hours to minutes.
This is true when either converting from innodb or altering native tokudb (any alter table).
select a.*,calcfields from table1 a into outfile 'temp.txt';
create table table2 .....<br>
load data infile 'temp.txt' into table table2 (field1,field2,...);
ps: experiment with the create table with row_format=tokudb_lzma or tokudb_uncompressed). You can try 3 ways pretty quick (you need to do an OS level directory ls to see size). I find offline indexes help too.
set tokudb_create_index_online=off;
create clustering index field1 on table2(field1); (much faster)
Multiple Clustering indexes can make a world of difference when you learn when to use them.
I was using GUI tools that alter table for index changes (waiting hours each time)
Hand doing this make things far more productive (I had spent days going nowhere via GUI, to done in 30 min)
Using 5.5.30-tokudb-7.0.1-MariaDB and VERY HAPPY.
Hopefully this can help others when experimenting. Obviously very late for original asker.
The only existing response was not constructive at all for me. (The question was)
Here are the important variables, make sure they are set globally prior to starting the operation or locally within the session executing the storage engine change:
tokudb_load_save_space : default is off and should be left alone unless you are low on disk space.
tokudb_cache_size : if unset the TokuDB will allocate 50% of RAM for it's own caching mechanism, we generally recommend leaving this setting alone. As you are running on an existing server you need to make sure that you aren't over-committing memory between TokuDB, InnoDB, and MyISAM.

Uploading enormous amount of data to MySQL server

I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.

improving performance of mysql load data infile

I'm trying to bulk load around 12m records into a InnoDB table in a (local) mysql using LOAD DATA INFILE (from CSV) and finding it's taking a very long time to complete.
The primary key type is UUID and the keys are unsorted in the data files.
I've split the data file into files containing 100000 records and import it as:
mysql -e 'ALTER TABLE customer DISABLE KEYS;'
for file in *.csv
mysql -e "SET sql_log_bin=0;SET FOREIGN_KEY_CHECKS=0; SET UNIQUE_CHECKS=0;
SET AUTOCOMMIT=0;LOAD DATA INFILE '${file}' INTO TABLE table
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; COMMIT"
This works fine for the first few hundred thousand records but then the insert time for each subsequent load seems to keep growing (from around 7 seconds to around 2 minutes per load before I killed it.)
I'm running on a machine with 8GB RAM and have set the InnoDB parameters to:
innodb_buffer_pool_size =1024M
innodb_additional_mem_pool_size =512M
innodb_log_file_size = 256M
innodb_log_buffer_size = 256M
I've also tried loading a single CSV containing all rows with no luck - this ran in excess of 2 hours before I killed it.
Is there anything else that could speed this up as this seems like an excessive time to only load 12m records?
If you know the data is "clean", then you can drop indexes on the affected tables prior to the import and then re-add them after it is complete.
Otherwise, each record causes an index-recalc, and if you have a bunch of indexes, this can REALLY slow things down.
Its always hard to tell what is the cause of performance issues but these are my 2 cents:
Your key being a uuid is randomly distributed which makes it hard to maintain an index. The reason being that keys are stored by range in a file system block, so having random uuids follow each other makes the OS read and write blocks to the file system without leveraging the cache. I don't know if you can change the key, but you could maybe sort the uuids in the input file and see if that helps.
FYI, to understand this issue better I would take a look at this blog post and maybe read this book mysql high performance it has a nice chapter about innodb clustered index.
Good Luck!

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?