efficiently load csv file into mysql - mysql

I have a csv file around 20gig in size with about 60m rows, that I would like to load into a table within mysql.
I have defined my table in mysql with a composite primary key of (col_a, col_b) prior to starting any load.
I have initiated my load as below:
LOAD DATA LOCAL INFILE '/mypath/mycsv.csv'
INTO TABLE mytable
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 0 LINES
(#v1v, #v2v, #v3v, #v4v, #v5v, etc...)
SET
col_a = nullif(#v1v,''),
col_b = nullif(#v2v,''),
col_c = nullif(#v3v,''),
col_d = nullif(#v4v,''),
col_e = nullif(#v5v,''),
etc...,
load_dttm = NOW();
This seemed to work fine, until the the dataset got to around 10g in size, at which point the loading significantly slowed, and what looked like it might take an hour has been running all night and not got much larger.
Are there more efficient ways of loading (depending on your definition of this word) "large" csv's into mysql.
My immediate thoughts are:
1) Should I remove my composite primary key, and only apply it after the load
2) Should I break down the csv into smaller chunks
As I understand it mysql is mainly limited by system constraints, which should not be an issue in my case - I am using an Linux Red-hat server with "MemTotal: 396779348 kB"! and terabytes of space.
This is my first time of using mysql, so please bear this in mind in any answers.

My issue it turns out was due to the /var/lib/mysql directory not being allocated enough space. It seems that mysql will slow down rather than throw an error when space become low when processing a load data command. To resolve this I have moved the datadir using How to change MySQL data directory?

Related

LOAD DATA LOCAL statement taking a little while on MySql RDS

So i have the following issue, i have a medium sized databased (1.6 gigabytes ). And i am running the following, statement in RDS MySql.
LOAD DATA LOCAL INFILE "E:/csms/Query/Tratado_interno/Dataset_Raw.csv"
INTO TABLE Eventtracks_Take.Prod
FIELDS TERMINATED BY '|'
ENCLOSED BY '"'
IGNORE 1 ROWS;
It has come to my attention, that such load into were supposed to occur on very high speeds at least that is what people in tutorials show, around 1 second no longer no less. But the ones i am handling take around 3.6 minutes. Any ideas on what could be the probable cause? My EC2 instance is t3.large so i would expect that that is not the cause of the issue. Sometimes i also run into a few specific warnings telling me a specific column of data was truncated even tho i can guarantee i gave it enough character on the schema. Nonetheless, would appreciate any given theory.

Why LOAD DATA INFILE MYSQL is slow [duplicate]

sometimes, I have to re-import data for a project, thus reading about 3.6 million rows into a MySQL table (currently InnoDB, but I am actually not really limited to this engine). "Load data infile..." has proved to be the fastest solution, however it has a tradeoff:
- when importing without keys, the import itself takes about 45 seconds, but the key creation takes ages (already running for 20 minutes...).
- doing import with keys on the table makes the import much slower
There are keys over 3 fields of the table, referencing numeric fields.
Is there any way to accelerate this?
Another issue is: when I terminate the process which has started a slow query, it continues running on the database. Is there any way to terminate the query without restarting mysqld?
Thanks a lot
DBa
if you're using innodb and bulk loading here are a few tips:
sort your csv file into the primary key order of the target table : remember innodb uses
clustered primary keys so it will load faster if it's sorted !
typical load data infile i use:
truncate <table>;
set autocommit = 0;
load data infile <path> into table <table>...
commit;
other optimisations you can use to boost load times:
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;
split the csv file into smaller chunks
typical import stats i have observed during bulk loads:
3.5 - 6.5 million rows imported per min
210 - 400 million rows per hour
This blog post is almost 3 years old, but it's still relevant and has some good suggestions for optimizing the performance of "LOAD DATA INFILE":
http://www.mysqlperformanceblog.com/2007/05/24/predicting-how-long-data-load-would-take/
InnoDB is a pretty good engine. However, it highly relies on being 'tuned'. One thing is that if your inserts are not in the order of increasing primary keys, innoDB can take a bit longer than MyISAM. This can easily be overcome by setting a higher innodb_buffer_pool_size. My suggestion is to set it at 60-70% of your total RAM on a dedicated MySQL machine.

MySQL Optimization for LOAD DATA INFILE

I see everywhere programmers discuting optimisation for fastest LOAD DATA INFILE inserts. But they never explain much their values choices etc, and optimisation depends on environment and also on the actual real needs.
So, would like some explainations on what would be the best values in my mysql config file for reaching the fastest insert possible, please.
My config, an intel dual-core # 3.30 GHz, 4Gb DDR4 RAM (windows7 says "2.16Gb available" tho because of reserved memory).
My backup.csv file is plaintext as about 5 billions entries, so its a huge 500Gb file size like this schem (but hexadecimal string 64 length):
"sdlfkjdlfkjslfjsdlfkjslrtrtykdjf";"dlksfjdrtyrylkfjlskjfssdlkfjslsdkjf"
Only two fields in my table and the first one is Unique index.
ROW-FORMAT is set on FIXED for space saving questions. And for same reason, fields type is set as BINARY(32).
Im using MyISAM engine. (because innoDB requires much more space!) (MySQL version 5.1.41)
here is the code i planned to use for now :
ALTER TABLE verification DISABLE KEYS;
LOCK TABLES verification WRITE;
LOAD DATA INFILE 'G:\\backup.csv'
IGNORE INTO TABLE verification
FIELDS TERMINATED BY ';' ENCLOSED BY '"' LINES TERMINATED BY '\r\n'
(#myhash, #myverif) SET hash = UNHEX(#myhash), verif = UNHEX(#myverif);
UNLOCK TABLES;
ALTER TABLE verification ENABLE KEYS;
As you can see, the command use LOAD DATA INFILE takes the plain text values, turn them into HEX (both are hexadecimal hashes finaly so...)
I heard about the buffer sizes etc, and all those values from the MySQL config file. What should i change, and what would be the best values please?
As you can see, i lock the table and also disable keys for speeding it already.
I also read on documentation :
myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
Doing that before the insert would speed it up also. But what is really tblName ? (because i have a .frm file, a .MYD and a .MYI, so which one am i supposed to point?)
Here are the lasts short hints i did read about optimisation
EDIT : Forgot to tell, everything is localhost.
So, i finaly managed to Insert my 500GB database of more than 3 billions entries, in something like 5 hours.
i have tried many ways, and while rebuilding the Primary Index i was stuck with this error ERROR 1034 (HY000): Duplicate key 1 for record at 2229897540 against new record at 533925080.
I will explain now how i achieved to complete my insert:
i sorted my .csv file with GNU CoreUtils : sort.exe (im on windows) keep in mind doing that, you need 1.5x your csv file as free space, for temporary files. (so counting the .csv file, its 2.5x finaly)
You create the table, with indexes and all.
Execute mysqladmin flush-tables -u a_db_user -p
Execute myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
Insert the data : (DO NOT USE ALTER TABLE tblname DISABLE KEYS; !!!)
LOCK TABLES verification WRITE;
LOAD DATA INFILE 'G:\\backup.csv'
IGNORE INTO TABLE verification
FIELDS TERMINATED BY ';'
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(#myhash, #myverif) SET hash = UNHEX(#myhash), verif = UNHEX(#myverif);
UNLOCK TABLES;
when data is inserted, you rebuild the indexes by Executing myisamchk --key_buffer_size=1024M --sort_buffer_size=1024M -rqq /var/lib/mysql/dbName/tblName
(note the -rqq, doubling the q will ignore the possible duplicate error by trying to repair them (Instead of just stopping the inserts after many hours of waiting!)
Execute mysqladmin flush-tables -u a_db_user -p
And i was done!
I noticed a huge boost in speed if the .csv file is on another drive than the database, and same for the sort operation, put temp file in another drive. (Read/Write speed as not both datas in the same place)
source of this again was here : Credits here to this solution
I'm pretty sure it is the verification, not verification.MYD or the other two. .MYD is data, .MYI is indexes, .frm is schema.
How long are the strings? Are hex? If 32 hex digits, then don't you want BINARY(16) for the output of the UNHEX?
The long part of the process will probably be ENABLE KEYS, when is when it will be building the index. Do SHOW PROCESSLIST; while it is running -- If it says "using keybuffer", kill it, it will take forever. If is says something like "building by repair", then that it good -- it is sorting, then loading the index efficiently.
You can save 5GB of disk space by setting myisam_data_pointer_size=5 before starting the process. Seems like there is also myisam_index_pointer_size, but it may be defaulted to 5, which is probably correct for your case. (I encountered that setting once on ver 4.0 in about 2004; but never again.)
I don't think key_buffer_size will matter during the load and indexing -- since you really want it not to use the key_buffer. Don't set it so high that you run out of RAM. Swapping is terrible for performance.

improving performance of mysql load data infile

I'm trying to bulk load around 12m records into a InnoDB table in a (local) mysql using LOAD DATA INFILE (from CSV) and finding it's taking a very long time to complete.
The primary key type is UUID and the keys are unsorted in the data files.
I've split the data file into files containing 100000 records and import it as:
mysql -e 'ALTER TABLE customer DISABLE KEYS;'
for file in *.csv
mysql -e "SET sql_log_bin=0;SET FOREIGN_KEY_CHECKS=0; SET UNIQUE_CHECKS=0;
SET AUTOCOMMIT=0;LOAD DATA INFILE '${file}' INTO TABLE table
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; COMMIT"
This works fine for the first few hundred thousand records but then the insert time for each subsequent load seems to keep growing (from around 7 seconds to around 2 minutes per load before I killed it.)
I'm running on a machine with 8GB RAM and have set the InnoDB parameters to:
innodb_buffer_pool_size =1024M
innodb_additional_mem_pool_size =512M
innodb_log_file_size = 256M
innodb_log_buffer_size = 256M
I've also tried loading a single CSV containing all rows with no luck - this ran in excess of 2 hours before I killed it.
Is there anything else that could speed this up as this seems like an excessive time to only load 12m records?
If you know the data is "clean", then you can drop indexes on the affected tables prior to the import and then re-add them after it is complete.
Otherwise, each record causes an index-recalc, and if you have a bunch of indexes, this can REALLY slow things down.
Its always hard to tell what is the cause of performance issues but these are my 2 cents:
Your key being a uuid is randomly distributed which makes it hard to maintain an index. The reason being that keys are stored by range in a file system block, so having random uuids follow each other makes the OS read and write blocks to the file system without leveraging the cache. I don't know if you can change the key, but you could maybe sort the uuids in the input file and see if that helps.
FYI, to understand this issue better I would take a look at this blog post and maybe read this book mysql high performance it has a nice chapter about innodb clustered index.
Good Luck!

MySQL load data infile - acceleration?

sometimes, I have to re-import data for a project, thus reading about 3.6 million rows into a MySQL table (currently InnoDB, but I am actually not really limited to this engine). "Load data infile..." has proved to be the fastest solution, however it has a tradeoff:
- when importing without keys, the import itself takes about 45 seconds, but the key creation takes ages (already running for 20 minutes...).
- doing import with keys on the table makes the import much slower
There are keys over 3 fields of the table, referencing numeric fields.
Is there any way to accelerate this?
Another issue is: when I terminate the process which has started a slow query, it continues running on the database. Is there any way to terminate the query without restarting mysqld?
Thanks a lot
DBa
if you're using innodb and bulk loading here are a few tips:
sort your csv file into the primary key order of the target table : remember innodb uses
clustered primary keys so it will load faster if it's sorted !
typical load data infile i use:
truncate <table>;
set autocommit = 0;
load data infile <path> into table <table>...
commit;
other optimisations you can use to boost load times:
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;
split the csv file into smaller chunks
typical import stats i have observed during bulk loads:
3.5 - 6.5 million rows imported per min
210 - 400 million rows per hour
This blog post is almost 3 years old, but it's still relevant and has some good suggestions for optimizing the performance of "LOAD DATA INFILE":
http://www.mysqlperformanceblog.com/2007/05/24/predicting-how-long-data-load-would-take/
InnoDB is a pretty good engine. However, it highly relies on being 'tuned'. One thing is that if your inserts are not in the order of increasing primary keys, innoDB can take a bit longer than MyISAM. This can easily be overcome by setting a higher innodb_buffer_pool_size. My suggestion is to set it at 60-70% of your total RAM on a dedicated MySQL machine.