I'm running a query which imports 500k records into a table from a .CSV file.
The query is runs for 1h15min, I think is taking way to much time to do this. I was expecting about 10, 20 min.
I've already did a insert query of 35k and it took about 30 seconds.
Is there a way of speeding up the process?
The statement I'm calling is the one below:
LOAD DATA LOCAL INFILE 'c:/users/migue/documents/www/mc-cron/excels/BD_YP_vFinal_2.csv'
INTO TABLE leads
FIELDS TERMINATED BY ';' ENCLOSED BY '"'
LINES TERMINATED BY '\n';
The speed of importing rows into the database mainly depends on the structure of the database, rather than on the imported data, or on the specific command that you use to initiate the import. So, it would be a lot more useful to show us the CREATE TABLE statement that creates your leads table instead of showing us your LOAD DATA command.
A few suggestions for fast inserts:
Try disabling any and all constraints on the table, and re-enabling them afterwards.
This includes unique constraints, foreign keys, everything.
Try dropping any triggers before the import, and once the import is done:
put the triggers back
manually perform the operations that the triggers would have performed if they were active during the import
Try dropping all indexes before the import, (yes, this also includes the primary key index,) and re-creating them once the import is complete.
Related
I see everywhere programmers discuting optimisation for fastest LOAD DATA INFILE inserts. But they never explain much their values choices etc, and optimisation depends on environment and also on the actual real needs.
So, would like some explainations on what would be the best values in my mysql config file for reaching the fastest insert possible, please.
My config, an intel dual-core # 3.30 GHz, 4Gb DDR4 RAM (windows7 says "2.16Gb available" tho because of reserved memory).
My backup.csv file is plaintext as about 5 billions entries, so its a huge 500Gb file size like this schem (but hexadecimal string 64 length):
"sdlfkjdlfkjslfjsdlfkjslrtrtykdjf";"dlksfjdrtyrylkfjlskjfssdlkfjslsdkjf"
Only two fields in my table and the first one is Unique index.
ROW-FORMAT is set on FIXED for space saving questions. And for same reason, fields type is set as BINARY(32).
Im using MyISAM engine. (because innoDB requires much more space!) (MySQL version 5.1.41)
here is the code i planned to use for now :
ALTER TABLE verification DISABLE KEYS;
LOCK TABLES verification WRITE;
LOAD DATA INFILE 'G:\\backup.csv'
IGNORE INTO TABLE verification
FIELDS TERMINATED BY ';' ENCLOSED BY '"' LINES TERMINATED BY '\r\n'
(#myhash, #myverif) SET hash = UNHEX(#myhash), verif = UNHEX(#myverif);
UNLOCK TABLES;
ALTER TABLE verification ENABLE KEYS;
As you can see, the command use LOAD DATA INFILE takes the plain text values, turn them into HEX (both are hexadecimal hashes finaly so...)
I heard about the buffer sizes etc, and all those values from the MySQL config file. What should i change, and what would be the best values please?
As you can see, i lock the table and also disable keys for speeding it already.
I also read on documentation :
myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
Doing that before the insert would speed it up also. But what is really tblName ? (because i have a .frm file, a .MYD and a .MYI, so which one am i supposed to point?)
Here are the lasts short hints i did read about optimisation
EDIT : Forgot to tell, everything is localhost.
So, i finaly managed to Insert my 500GB database of more than 3 billions entries, in something like 5 hours.
i have tried many ways, and while rebuilding the Primary Index i was stuck with this error ERROR 1034 (HY000): Duplicate key 1 for record at 2229897540 against new record at 533925080.
I will explain now how i achieved to complete my insert:
i sorted my .csv file with GNU CoreUtils : sort.exe (im on windows) keep in mind doing that, you need 1.5x your csv file as free space, for temporary files. (so counting the .csv file, its 2.5x finaly)
You create the table, with indexes and all.
Execute mysqladmin flush-tables -u a_db_user -p
Execute myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
Insert the data : (DO NOT USE ALTER TABLE tblname DISABLE KEYS; !!!)
LOCK TABLES verification WRITE;
LOAD DATA INFILE 'G:\\backup.csv'
IGNORE INTO TABLE verification
FIELDS TERMINATED BY ';'
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(#myhash, #myverif) SET hash = UNHEX(#myhash), verif = UNHEX(#myverif);
UNLOCK TABLES;
when data is inserted, you rebuild the indexes by Executing myisamchk --key_buffer_size=1024M --sort_buffer_size=1024M -rqq /var/lib/mysql/dbName/tblName
(note the -rqq, doubling the q will ignore the possible duplicate error by trying to repair them (Instead of just stopping the inserts after many hours of waiting!)
Execute mysqladmin flush-tables -u a_db_user -p
And i was done!
I noticed a huge boost in speed if the .csv file is on another drive than the database, and same for the sort operation, put temp file in another drive. (Read/Write speed as not both datas in the same place)
source of this again was here : Credits here to this solution
I'm pretty sure it is the verification, not verification.MYD or the other two. .MYD is data, .MYI is indexes, .frm is schema.
How long are the strings? Are hex? If 32 hex digits, then don't you want BINARY(16) for the output of the UNHEX?
The long part of the process will probably be ENABLE KEYS, when is when it will be building the index. Do SHOW PROCESSLIST; while it is running -- If it says "using keybuffer", kill it, it will take forever. If is says something like "building by repair", then that it good -- it is sorting, then loading the index efficiently.
You can save 5GB of disk space by setting myisam_data_pointer_size=5 before starting the process. Seems like there is also myisam_index_pointer_size, but it may be defaulted to 5, which is probably correct for your case. (I encountered that setting once on ver 4.0 in about 2004; but never again.)
I don't think key_buffer_size will matter during the load and indexing -- since you really want it not to use the key_buffer. Don't set it so high that you run out of RAM. Swapping is terrible for performance.
I working on database optimization where there is a bulk insert from .csv file (around 3800 records) at an interval of every 15 minutes.
For this, I'm running a mis.sql file through cron. This file contains Nine (09) mysql queries that performs duplicate removal from the table where bulk insert is targeted, Inner join Inserts, Deletes and Updates (ALTER, DELETE, INSERT & UPDATE).
Recently, a problem is being experienced with a query that runs just prior to the Bulk Insert query. The query is -
ALTER IGNORE TABLE pb ADD UNIQUE INDEX(hn, time);
ERROR 1069 (42000): Too many keys specified; max 64 keys allowed
On encountering above error, all the subsequent queries are being skipped. Then I checked table pb and found that there are 64 Unique Index Keys created with same cardinal value along with 02 Index Keys and 01 Primary Key.
While trying to remove one of the Unique Indexes, it's taking too much of time (almost 15 mins for 979,618 records), but at the end, it's not being removed.
Is there any solution to this problem?
The first thing: Why is there an ALTER TABLE command at all? New data should change the data not the database design. So while INSERT, UPDATE and DELETE are valid options in such a script ALTER TABLE doesn't belong there. Remove it.
As to deleting the index: That should only take a fraction of a second. There is nothing to build or rebuild, simply to remove.
DROP INDEX index_name ON tbl_name;
The only reason for this taking so long I can think of is that there isn't even a short time slice when no inserts, updates and deletes take place. So maybe you'll have to stop your job for a moment (or run it on an empty file), drop all those unnecessary indexes (only keep one), and start your job again.
This is a two part question.
The first is what architecture should I use for the following issue?
And the second is the how i.e. what commands I should use?
I have some log files I want to read into a database. The log files contain fields that are unnecessary (because they can be calculated from other fields).
Approach 1: Should I parse each line of the log file and insert it into the database?
Con: The log entries have to be unique, so I need to first do a SELECT, check if the LogItemID exists, and then INSERT if it doesn’t. This seems to be a high overhead activity and at some point this will be done on an hourly basis.
Approach 2: Or do I use LOAD DATA INFILE (can I even use that in PHP?) and just load the log file into a temporary table, then move the records into the permanent table?
Con: Even in this method though, I will still have to go through the cycle of SELECT, then INSERT.
Approach 3: Or is there a better way? Is there a command to bulk copy records from one table to another with selected fields? Will REPLACE INTO .... ON DUPLICATE UPDATE work (I don't want to UPDATE if the item exists, just ignore) as long as LogItemID is set to UNIQUE ? Either way, I need to throw the extraneous fields out. Which of these approaches is better? Not just easier, but from the standpoint of writing good, scalable code?
P.S. Unrelated, but part of the Architecture issue here is this...
If I have StartTime, EndTime and Interval (EndTime-StartTime), which should I keep - the first two or Interval? And Why?
Edit: To clarify why I did not want to store all three fields - the issue is of course normalization and therefore not good practice. For audit reasons, perhaps I'll store them. Perhaps in another table?
TIA
LOAD DATA INFILE is going to be a lot faster than running individual inserts.
You could load to a separate, temporary table, and then run an INSERT ... SELECT from the temporary table into your actual store. But it's not clear why you would need to do that. To "skip" some fields in the CSV, just assign those to dummy user-defined variables. There's no need to load those fields into the temporary table.
I'd define a UNIQUE key (constraint) and just use INSERT IGNORE; that will be a lot faster than running a separate SELECT, and faster than a REPLACE. (If your requirement is that you don't have any need to update the existing row, you just want to "ignore" the new row.
LOAD DATA INFILE 'my.csv'
IGNORE
INTO TABLE mytable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
( mycol
, #dummy2
, #dummy3
, #mm_dd_yyyy
, somecol
)
SET mydatecol = STR_TO_DATE(#mm_dd_yyyy,'%m-%d-%Y')
If you have start, end and duration, go ahead and store all three. There's redundancy there, the main issues are performance and update anomalies. (If you update end, should you also update duration?) If I don't have a need to do updates, I'd just store all three. I could calculate duration from start_time and end_time, but having the column stored would allow me to add an index, and get better performance on queries looking for durations less than 10 minutes, or whatever. Absent the column, I'd be forced to evaluate an expression for every row in the table, and that gets expensive on large sets.
You could use perl to parse out a subset of the csv fields you want to load, then use the command 'uniq' to remove duplicates, then use LOAD DATA INFILE to load the result.
Typically loading data into a temporary table, then traversing is slower than preprocessing the data ahead of time. As for the LogItemID, if you set it to unique the inserts should fail when you load subsequent matching lines.
When it comes to deciding to store StartTime+Interval (more typically called Duration) or StartTime and EndTime, it really depends on how you plan on using the resulting database table. If you store the duration and are constantly computing the end time it might be better to just store the start/end. If you believe the duration will be commonly used, store it. Depending on how big the database you might decide to just store all three, one more field may not add much overhead.
I am very new to the database game so forgive my ignorance.
I am loading millions of rows into a simply structured MySQL database table
SQLStr = "LOAD DATA LOCAL INFILE 'f:/Smallscale/02 MMToTxt/flat.txt'
INTO TABLE `GMLObjects` FIELDS TERMINATED BY ','
LINES STARTING BY 'XXX';"
At the moment the table has been set to no duplicates against one field.
However, I am wondering if it would be quicker to remove the no duplicates rule and deal with the problem of duplicates later, either by using ALTER TABLE or SELECT DISTINCT, or some such query.
What are your thoughts?
P.S the database engine is InnoDB
You can not "alter" a table with duplicates into a table without.
MySQL can not know which of the rows it should remove. It would mean a lot of work and trouble to delete them later, and it would produce deleted entries without any benefit.
So avoid duplicates as early as possible.
Why load duplicates in your DB in the first place?
Avoid them as early as possible. That is better for performance and you don't have to write complicated queries.
Issue: hundreds of identical (schema) tables. Some of these have some duplicated data that needs to be removed. My usual strategy for this is:
walk list of tables - for each do
create temp table with unique key on all fields
insert ignore select * from old table
truncate original table
insert select * back into original table
drop or clean temp table
For smaller tables this works fine. Unfortunately the tables I'm cleaning often have 100s of millions of records so my jobs and client connections are timing out while I'm running this. (Since there are hundreds of these tables I'm using Perl to walk the list and clean each one. This is where the timeout happens).
Some options I'm looking into:
mysqldump - fast but I don't see how to do the subsequent 'insert ignore' step
into outfile / load infile - also fast but I'm running from a remote host and 'into outfile' creates all the files on the mysql server. Hard to clean up.
do the insert/select in blocks of 100K records - this avoid the db timeout but its pretty slow.
I'm sure there is a better way. Suggestions?
If an SQL query to find the duplicates can complete without timing out, I think you should be able to do a SELECT with a Count() operator with a WHERE clause that restricts the output to only the rows with duplicate data (Count(DUPEDATA) > 1). The results of this SELECT can be placed INTO a temporary table, which can then be joined with the primary table for the DELETE query.
This approach uses the set-operations strengths of SQL/MySQL -- no need for Perl coding.