Am looking for advice about whether there is any way to speed up the import of about 250 GB of data into a MySQL table (InnoDB) from eight source csv files of approx. 30 GB each. The csv's have no duplicates within themselves, but do contain duplicates between files -- in fact some individual records appear in all 8 csv files. So those duplicates need to be removed at some point in the process. My current approach creates an empty table with a primary key, and then uses eight “LOAD DATA INFILE [...] IGNORE” statements to sequentially load each csv file, while dropping duplicate entries. It works great on small sample files. But with the real data, the first file takes about 1 hour to load, then the second takes more than 2 hours, the third one more than 5, the fourth one more than 9 hours, which is where I’m at right now. It appears that as the table grows, the time required to compare the new data to the existing data is increasing... which of course makes sense. But with four more files to go, it looks like it might take another 4 or 5 days to complete if I just let it run its course.
Would I be better off importing everything with no indexes on the table, and then removing duplicates after? Or should I import each of the 8 csv's into separate temporary tables and then do a union query to create a new consolidated table without duplicates? Or are those approaches going to take just as long?
Plan A
You have a column for dedupping; lets call it name.
CREATE TABLE New (
name ...,
...
PRIMARY KEY (name) -- no other indexes
) ENGINE=InnoDB;
Then, 1 csv at a time:
* Sort the csv by name (this makes any caching work better)
LOAD DATA ...
Yes, something similar to Plan A could be done with temp tables, but it might not be any faster.
Plan B
Sort all the csv files together (probably the unix "sort" can do this in a single command?).
Plan B is probably fastest, since it is extremely efficient in I/O.
Related
I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.
I have an external csv file that I need to import into the MySQL database: the csv has 473 columns (144k rows) which in my opinion is too much columns for one single table.
The problem: I was thinking of doing some normalization and split data into more tables but this will require extra work whenever a new csv is released (with more or less columns).
Is it okay if I keep the structure of the CSV/Table intact and have a big table? what are the performance impact of both approaches on MySQL/Doctrine?
The data:
I don't have ownership of this data to split it onto more tables: this data comes from government public resources as it is: no column duplicates.. so there's no way to split it :( I must take it as it is... Any additional categorization/splitting is overwork and may change on the next update of data.
Digging deep into the CSV data I found some interesting kind of organization: it can be split into 18 different tables (providers).
Each table has its own columns (some columns exist in multiple tables) but the largest one is around 180 column..
This is so far how I can split the data: since I don't have ownership of the CSV I cannot go ahead and just group similar columns/tables.
Good day everyone, i need a suggestion how to efficient upload data from sufficiently large files into mySql database. So i have two files 5,85Gb and 6Gb of data. For uploading i have used `LOAD DATA LOCAL INTO FILE. First file still uploading (for 36hours). Current index size 7,2 GB. I have two questions :
1) The data is formatted Lile : {string , int, int, int, int}. I do not need this int values, so i created table with one field of type varchar(128), my query is LOAD DATA LOCAL INFILE "file" INTO TABLE "tale" , so will the data be correct(i mean only strings without int fields).
2) Than larger index than larger load time for next. So, an i doing something wrong? I mean, what i need. I need a fast search then in that strings(espicially in last word). So all of the strings cantains exactly 5 words does it makes any sence to put every single words in different column(n rows, 5 columns).
Please, any suggestions.
Can u drop index for now and recreate index once data is loaded on table. I think this will work.
Recreation of index will take time.
I've got 2 tables, named: csv (a csv dump), and items (primary data table) with 7M (csv dump) and 15M rows respectively. I need to update a column in items that exists in table csv.
Both tables have a commonly indexed join ID (a VARCHAR(255)).
An UPDATE query with a join on the mutual ID column (indexed) still takes multiple days to run. After researching it I believe the inefficiency is in MySQL scanning the csv table and making per-row random-access queries against the items table.
Even though there are indexes, those indexes don't fit in memory, so the required 7M random access queries are nose diving performance.
Are there "typical" ways of addressing this kind of issue?
Update:
We're basically taking multiple catalogs of "items" and storing them
in our items table (this is bit of a simplification for discussion).
Each of say 10 catalogs will have 7M items (some duplicates across catalogs that we
normalize to 1 row in our item table). We need to compare and
verify changes to those 10 catalogs daily (UPDATES w/ joins between two big
tables, or other such mechanism).
In reality we have an items table and an items_map table, but no
need to discuss that additional level of abstraction here. I'd be
happy to find a way to perform an update between the csv dump table
and an items table (given that they both have a common ID that's
indexed in both tables). But given that the items table might have
20M rows, and the csv table might have 7M rows.
In this case indexes don't fit in memory and we're hammering the drive with random seeks I believe
Well, I finally put this query to an 8 core box with 12 GB of ram dedicated to InnoDB and it did completed but after 7 hours!
Our solution: We're moving this process off of MySQL. We'll use MapReduce (Hadoop) to maintain the entire large table in flat file format, do the major update processes in parallel and then finally use LOAD DATA INFILE to update the table once quickly (~daily).
I have 3 very large tables with clustered indexes on composite keys. No updates only inserts. New inserts will not be within the existing index range but the new inserts will not align with the clustered index and these tables get a lot of inserts (hundreds - thousands per second). What would like to do is DBREINDEX with Fill Factor = 100 but then set a Fill Factor of 5 and have that Fill Factor ONLY applied to inserts. Right now a Fill Factor applies to the whole table only. Is there a way to have a Fill Factor that applies to inserts (or inserts and updates) only? I don't care about select speed at this time. I am loading data. When the data load is complete then I will DBREINDEX at 100. A Fill Factor of 10 versus 30 doubles the rates at which new data is inserted. This load will takes a couple days and it cannot go live until the data is loaded. The clustered indexes are aligned with dominate query used by the end user application.
My practice is to DBREINDEX daily but the problem is now that the tables are getting large a 10 DBREINDEX takes a long time. I have considered indexing into "daily" tables and then inserting that data daily sorted by the clustered index into the production tables.
If you read this far even more. The indexes are all composite and I am running 6 instances of the parser on an 8 core server (lot of testing and that seems to have the best throughput). The data out of a SINGLE parser is in PK order and I am doing the inserts 990 values at a time (SQL value limits). The 3 active tables only share data via a foreign key relationship with a single relative inactive 4th table. My thought at this time is to have holding tables for each parser and then have another process that polls those table for the next complete insert and move the data into the production table in PK order. That is going to be a lot of work. I hope someone has a better idea.
The parses start in PK order but rarely finish in PK order. Some individual parses are so large that I could not hold all the data in memory until the end. Right now the SQL insert is slightly faster than the parse that creates the data. In an individual parse I run the insert asynch and go on parsing but don't insert until the prior insert is complete.
I agree you should have holding tables for the parser data and only insert to the main tables when you're ready. I implemented something similar in a former life (it was quasi-hashed into 10 tables based on mod 10 of the unique ID, then rolled into the primary table later - primarily to assist in load speed). If you're going to use holding tables then I see no need to have them at anything but FF = 100. The less pages you have to use the better.
Apparently, too, you should test the difference permanent tables, #temp tables and table-valued parameters. :-)