I see everywhere programmers discuting optimisation for fastest LOAD DATA INFILE inserts. But they never explain much their values choices etc, and optimisation depends on environment and also on the actual real needs.
So, would like some explainations on what would be the best values in my mysql config file for reaching the fastest insert possible, please.
My config, an intel dual-core # 3.30 GHz, 4Gb DDR4 RAM (windows7 says "2.16Gb available" tho because of reserved memory).
My backup.csv file is plaintext as about 5 billions entries, so its a huge 500Gb file size like this schem (but hexadecimal string 64 length):
"sdlfkjdlfkjslfjsdlfkjslrtrtykdjf";"dlksfjdrtyrylkfjlskjfssdlkfjslsdkjf"
Only two fields in my table and the first one is Unique index.
ROW-FORMAT is set on FIXED for space saving questions. And for same reason, fields type is set as BINARY(32).
Im using MyISAM engine. (because innoDB requires much more space!) (MySQL version 5.1.41)
here is the code i planned to use for now :
ALTER TABLE verification DISABLE KEYS;
LOCK TABLES verification WRITE;
LOAD DATA INFILE 'G:\\backup.csv'
IGNORE INTO TABLE verification
FIELDS TERMINATED BY ';' ENCLOSED BY '"' LINES TERMINATED BY '\r\n'
(#myhash, #myverif) SET hash = UNHEX(#myhash), verif = UNHEX(#myverif);
UNLOCK TABLES;
ALTER TABLE verification ENABLE KEYS;
As you can see, the command use LOAD DATA INFILE takes the plain text values, turn them into HEX (both are hexadecimal hashes finaly so...)
I heard about the buffer sizes etc, and all those values from the MySQL config file. What should i change, and what would be the best values please?
As you can see, i lock the table and also disable keys for speeding it already.
I also read on documentation :
myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
Doing that before the insert would speed it up also. But what is really tblName ? (because i have a .frm file, a .MYD and a .MYI, so which one am i supposed to point?)
Here are the lasts short hints i did read about optimisation
EDIT : Forgot to tell, everything is localhost.
So, i finaly managed to Insert my 500GB database of more than 3 billions entries, in something like 5 hours.
i have tried many ways, and while rebuilding the Primary Index i was stuck with this error ERROR 1034 (HY000): Duplicate key 1 for record at 2229897540 against new record at 533925080.
I will explain now how i achieved to complete my insert:
i sorted my .csv file with GNU CoreUtils : sort.exe (im on windows) keep in mind doing that, you need 1.5x your csv file as free space, for temporary files. (so counting the .csv file, its 2.5x finaly)
You create the table, with indexes and all.
Execute mysqladmin flush-tables -u a_db_user -p
Execute myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
Insert the data : (DO NOT USE ALTER TABLE tblname DISABLE KEYS; !!!)
LOCK TABLES verification WRITE;
LOAD DATA INFILE 'G:\\backup.csv'
IGNORE INTO TABLE verification
FIELDS TERMINATED BY ';'
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(#myhash, #myverif) SET hash = UNHEX(#myhash), verif = UNHEX(#myverif);
UNLOCK TABLES;
when data is inserted, you rebuild the indexes by Executing myisamchk --key_buffer_size=1024M --sort_buffer_size=1024M -rqq /var/lib/mysql/dbName/tblName
(note the -rqq, doubling the q will ignore the possible duplicate error by trying to repair them (Instead of just stopping the inserts after many hours of waiting!)
Execute mysqladmin flush-tables -u a_db_user -p
And i was done!
I noticed a huge boost in speed if the .csv file is on another drive than the database, and same for the sort operation, put temp file in another drive. (Read/Write speed as not both datas in the same place)
source of this again was here : Credits here to this solution
I'm pretty sure it is the verification, not verification.MYD or the other two. .MYD is data, .MYI is indexes, .frm is schema.
How long are the strings? Are hex? If 32 hex digits, then don't you want BINARY(16) for the output of the UNHEX?
The long part of the process will probably be ENABLE KEYS, when is when it will be building the index. Do SHOW PROCESSLIST; while it is running -- If it says "using keybuffer", kill it, it will take forever. If is says something like "building by repair", then that it good -- it is sorting, then loading the index efficiently.
You can save 5GB of disk space by setting myisam_data_pointer_size=5 before starting the process. Seems like there is also myisam_index_pointer_size, but it may be defaulted to 5, which is probably correct for your case. (I encountered that setting once on ver 4.0 in about 2004; but never again.)
I don't think key_buffer_size will matter during the load and indexing -- since you really want it not to use the key_buffer. Don't set it so high that you run out of RAM. Swapping is terrible for performance.
Related
I'm running a query which imports 500k records into a table from a .CSV file.
The query is runs for 1h15min, I think is taking way to much time to do this. I was expecting about 10, 20 min.
I've already did a insert query of 35k and it took about 30 seconds.
Is there a way of speeding up the process?
The statement I'm calling is the one below:
LOAD DATA LOCAL INFILE 'c:/users/migue/documents/www/mc-cron/excels/BD_YP_vFinal_2.csv'
INTO TABLE leads
FIELDS TERMINATED BY ';' ENCLOSED BY '"'
LINES TERMINATED BY '\n';
The speed of importing rows into the database mainly depends on the structure of the database, rather than on the imported data, or on the specific command that you use to initiate the import. So, it would be a lot more useful to show us the CREATE TABLE statement that creates your leads table instead of showing us your LOAD DATA command.
A few suggestions for fast inserts:
Try disabling any and all constraints on the table, and re-enabling them afterwards.
This includes unique constraints, foreign keys, everything.
Try dropping any triggers before the import, and once the import is done:
put the triggers back
manually perform the operations that the triggers would have performed if they were active during the import
Try dropping all indexes before the import, (yes, this also includes the primary key index,) and re-creating them once the import is complete.
I have a csv file around 20gig in size with about 60m rows, that I would like to load into a table within mysql.
I have defined my table in mysql with a composite primary key of (col_a, col_b) prior to starting any load.
I have initiated my load as below:
LOAD DATA LOCAL INFILE '/mypath/mycsv.csv'
INTO TABLE mytable
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 0 LINES
(#v1v, #v2v, #v3v, #v4v, #v5v, etc...)
SET
col_a = nullif(#v1v,''),
col_b = nullif(#v2v,''),
col_c = nullif(#v3v,''),
col_d = nullif(#v4v,''),
col_e = nullif(#v5v,''),
etc...,
load_dttm = NOW();
This seemed to work fine, until the the dataset got to around 10g in size, at which point the loading significantly slowed, and what looked like it might take an hour has been running all night and not got much larger.
Are there more efficient ways of loading (depending on your definition of this word) "large" csv's into mysql.
My immediate thoughts are:
1) Should I remove my composite primary key, and only apply it after the load
2) Should I break down the csv into smaller chunks
As I understand it mysql is mainly limited by system constraints, which should not be an issue in my case - I am using an Linux Red-hat server with "MemTotal: 396779348 kB"! and terabytes of space.
This is my first time of using mysql, so please bear this in mind in any answers.
My issue it turns out was due to the /var/lib/mysql directory not being allocated enough space. It seems that mysql will slow down rather than throw an error when space become low when processing a load data command. To resolve this I have moved the datadir using How to change MySQL data directory?
I use select * into outfile option in mysql to backup the data into text files in tab separated format. i call this statement against each table.
And I use load data infile to import data into mysql for each table.
I have not yet done any lock or disable keys while i perform this operation
Now I face some issues:
While it is taking backup the other, updates and selects are getting slow.
It takes too much time to import data for huge tables.
How can I improve the method to solve the above issues?
Is mysqldump an option? I see that it uses insert statements, so before I try it, I wanted to request advice.
Does using locks and disable keys before each "load data" improve speed in import?
If you have a lot of databases/tables, it will definitely be much easier for you to use mysqldump, since you only need to run it once per database (or even once for all databases, if you do a full backup of your system). Also, it has the advantage that it also backs up your table structure (something you cannot do using only select *).
The speed is probably similar, but it would be best to test both and see which one works best in your case.
Someone here tested the options, and mysqldump proved to be faster in his case. But again, YMMV.
If you're concerned about speed, also take a look at the mysqldump/mysqlimport combination. As mentioned here, it is faster than mysqldump alone.
As for locks and disable keys, I am not sure, so I will let someone else answer that part :)
Using mysqldump is important if you want your data backup to be consistent. That is, the data dumped from all tables represents the same instant in time.
If you dump tables one by one, they are not in sync, so you could have data for one table that references rows in another table that aren't included in the second table's backup. When you restore, it won't be pretty.
For performance, I'm using:
mysqldump --single-transaction --tab mydatabase
This dumps for each table, one .sql file for table definition, and one .txt file for data.
Then when I import, I run the .sql files to define tables:
mysqladmin create mydatabase
cat *.sql | mysql mydatabase
Then I import all the data files:
mysqlimport --local --use-threads=4 mydatabase *.txt
In general, running mysqlimport is faster than running the insert statements output by default by mysqldump. And running mysqlimport with multiple threads should be faster too, as long as you have the CPU resources to spare.
Using locks when you restore does not help performance.
The disable keys is intended to defer index creation until after the data is fully loaded and keys are re-enabled, but this helps only for non-unique indexes in MyISAM tables. But you shouldn't use MyISAM tables.
For more information, read:
https://dev.mysql.com/doc/refman/5.7/en/mysqldump.html
https://dev.mysql.com/doc/refman/5.7/en/mysqlimport.html
I'm trying to bulk load around 12m records into a InnoDB table in a (local) mysql using LOAD DATA INFILE (from CSV) and finding it's taking a very long time to complete.
The primary key type is UUID and the keys are unsorted in the data files.
I've split the data file into files containing 100000 records and import it as:
mysql -e 'ALTER TABLE customer DISABLE KEYS;'
for file in *.csv
mysql -e "SET sql_log_bin=0;SET FOREIGN_KEY_CHECKS=0; SET UNIQUE_CHECKS=0;
SET AUTOCOMMIT=0;LOAD DATA INFILE '${file}' INTO TABLE table
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; COMMIT"
This works fine for the first few hundred thousand records but then the insert time for each subsequent load seems to keep growing (from around 7 seconds to around 2 minutes per load before I killed it.)
I'm running on a machine with 8GB RAM and have set the InnoDB parameters to:
innodb_buffer_pool_size =1024M
innodb_additional_mem_pool_size =512M
innodb_log_file_size = 256M
innodb_log_buffer_size = 256M
I've also tried loading a single CSV containing all rows with no luck - this ran in excess of 2 hours before I killed it.
Is there anything else that could speed this up as this seems like an excessive time to only load 12m records?
If you know the data is "clean", then you can drop indexes on the affected tables prior to the import and then re-add them after it is complete.
Otherwise, each record causes an index-recalc, and if you have a bunch of indexes, this can REALLY slow things down.
Its always hard to tell what is the cause of performance issues but these are my 2 cents:
Your key being a uuid is randomly distributed which makes it hard to maintain an index. The reason being that keys are stored by range in a file system block, so having random uuids follow each other makes the OS read and write blocks to the file system without leveraging the cache. I don't know if you can change the key, but you could maybe sort the uuids in the input file and see if that helps.
FYI, to understand this issue better I would take a look at this blog post and maybe read this book mysql high performance it has a nice chapter about innodb clustered index.
Good Luck!
Is there anything better (faster or smaller) than pages of plain text CREATE TABLE and INSERT statements for dumping MySql databases? It seems awfully inefficient for large amounts of data.
I realise that the underlying database files can be copied, but I assume they will only work in the same version of MySql that they come from.
Is there a tool I don't know about, or a reason for this lack?
Not sure if this is what you're after, but I usually pipe the output of mysqldump directly to gzip or bzip2 (etc). It tends to be a considerably faster than dumping to stdout or something like that, and the output files are much smaller thanks to the compression.
mysqldump --all-databases (other options) | gzip > mysql_dump-2010-09-23.sql.gz
It's also possible to dump to XML with the --xml option if you're looking for "portability" at the expense of consuming (much) more disk space than the gzipped SQL...
Sorry, no binary dump for MySQL. However the binary logs of MySQL are specifically for backup and database replication purposes http://dev.mysql.com/doc/refman/5.5/en/binary-log.html . They are not hard to configure. Only changes such as update and delete are logged, so each log file (created authomatically by MySQL) is also an incremental backup of the changes in the DB. This way you can save from time to time a whole snapshot of the db (once in a month?), and then store just the log files, and in case of crash, restore the latest snapshot and run through the logs.
It's worth noting that MySQL has a special syntax for doing bulk inserts. From the manual:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
Would insert 3 rows in a single operation. So loading this way isn't as inefficient as it might otherwise be with one statement per row, and instead of 129 bytes in 3 INSERT statements, this is 59 bytes, and that advantage only gets bigger the more rows you have.
I've never tried this, but aren't mysql tables just binary files on the hard drive? Couldn't you just copy the table files themselves? Presumably that's essentially what you are asking for.
I don't know how to stitch that together, but it seems to me a copy of /var/lib/mysql would do the trick