So far every time I have done a data load, it was almost always on enterprise grade hardware and so I guess I never realized 3.4 million records is big. Now to the question...
I have a local MySQL server on Windows 7, 64 bit, 4 GB RAM machine. I am importing a csv through standard 'Import into Table' functionality that is shipped with the Developer package.
My data has around 3,422,000 rows and 18 columns. 3 Columns of type double and rest is all text. Size of cvs file is about 500 MB
Both data source (CSV) and destination (MySQL) are on the same machine so I guess no network bottle neck.
It took almost 7 hours to load 200,000 records. By this speed it might take me 4 days to load entire data. Given the widespread popularity of MySQL I think there has to be a better way to make data load faster.
I have data only in CSV format and the rudimentary way I can think of is to split it into different blocks and try loading it.
Can you please suggest what optimizations can I do to speed up ?
Related
I'm new to SSIS, I need to your suggestion, I have created SSIS package which retrieve data from source server around 5 million records from server A and save data into destination server. in this process it is taking nearly 3 hours to complete the task. can we have any other way to reduce the timeline. I have tried to increase the buffer size, but still same.
Thanks in Advance.
There are many factors influencing the speed of execution. both hardware and software. Based on the structure of the database, a solution can be determined.
In a test project, I have transferred 40 million records in 30 minutes on a system with 4 GB of RAM.
We have a need to do the initial data copy on a table that has 4+ billion records to target SQL Server (2014) from source MySQL (5.5). The table in question is pretty wide with 55 columns, however none of them are LOB. I'm looking for options for copying this data in the most efficient way possible.
We've tried loading via Attunity Replicate (which has worked wonderfully for tables not this large) but if the initial data copy with Attunity Replicate fails then it starts over from scratch ... losing whatever time was spent copying the data. With patching and the possibility of this table taking 3+ months to load Attunity wasn't the solution.
We've also tried smaller batch loads with a linked server. This is working but doesn't seem efficient at all.
Once the data is copied we will be using Attunity Replicate to handle CDC.
For something like this I think SSIS would be the most simple. It's designed for large inserts as big as 1TB. In fact, I'd recommend this MSDN article We loaded 1TB in 30 Minutes and so can you.
Doing simple things like dropping indexes and performing other optimizations like partitioning would make your load faster. While 30 minutes isn't a feasible time to shoot for, it would be a very straightforward task to have an SSIS package run outside of business hours.
My business doesn't have a load on the scale you do, but we do refresh our databases of more than 100M nightly which doesn't take more than 45 minutes, even with it being poorly optimized.
One of the most efficient way to load huge data is to read them by chunks.
I have answered many similar question for SQLite, Oracle, Db2 and MySQL. You can refer to one of them for to get more information on how to do that using SSIS:
Reading Huge volume of data from Sqlite to SQL Server fails at pre-execute (SQLite)
SSIS failing to save packages and reboots Visual Studio (Oracle)
Optimizing SSIS package for millions of rows with Order by / sort in SQL command and Merge Join (MySQL)
Getting top n to n rows from db2 (DB2)
On the other hand there are many other suggestions such as drop indexes in destination table and recreate them after insert, Create needed indexes on source table, use fast-load option to insert data ...
Before dive into the actual coding I am trying to understand the logistics around Spark.
I have server logs split in 10 csv's round 2GB each.
I am looking for a way to extract some data e.g. how many failures occured in a period of 30 minutes per server.
(the logs have entries from multiple servers aka there is no any predefined order in time and per server)
Is that something I could do with spark?
If yes would that mean I need a box with 20+ GB of RAM?
When I operate in Spark with RDDs,does take into account the full dataset?E.g. an operation of ordering by timestamp and server id would execute to the full 20GB dataset?
Thanks!
Every month, I do some analysis on a customer database. My predecessor would create a segment in Eloqua (Our CRM) for each country, and then spend about 10 (tedious, slow) hours refreshing them all. When I took over, I knew that I wouldn't be able to do it in Excel (we had over 10 million customers) so I used Access.
This process has worked pretty well. We're now up to 12 million records, and it's still going strong. However, when importing the master list of customers prior to doing any work on it, the database is inflating. This month it hit 1.3 GB.
Now, I'm not importing ALL of my columns - only 3. And Access freezes if I try to do my manipulations on a linked table. What can I do to reduce the size of my database during import? My source files are linked CSVs with only the bare minimum of columns; after I import the data, my next steps have to be:
Manipulate the data to get counts instead of individual lines
Store the manipulated data (only a few hundred KB)
Empty my imported table
Compress and Repair
This wouldn't be a problem, but i have to do all of this 8 times (8 segments, each showing a different portion of the database), and the 2GB limit is looming over the next horizon.
An alternate question might be: How can I simulate / re-create the "Linked Table" functionality in MySQL/MariaDB/something else free?
For such big number of records MS Access with 2 GB limit is not good solution as data storage. I would use MySQL as backend:
Create table in MySQL and link it to MS Access
Import CSV data directly to MySQL table using native import features of MySQL. Of course Access can be used for data import, but it will work slower.
Use Access for data analyse using this linked MySQL table as regular table.
You could import the CSV to a (new/empty) separate Access database file.
Then, in your current application, link the table from that file. Access will not freeze during your operations as it will when linking text files directly.
I have a 2 GB CSV file with 9 M records that I import into MongoDB using the native mongoimport tool. It imports the CSV at a rate of 8K per second. The overall time taken is 10 minutes. The speed of import is quite reasonable, but it seems to be much slower than the MySQL LOAD DATA INFILE version(takes only 2 minutes to insert all of the records into the database). While this is acceptable (MongoDB is built for JSON type objects and speed ups are generally in querying and not in inserting), I would like to know if there is some way that I can speed up the number of inserts per second done in MongoDB mongoimport?
I have only one computer with 8 GB RAM and 4 cores.
Thanks.
Since the majority of time is likely spent serializing JSON objects into BSON (native MongoDB format) you will likely get faster import if you can split up your file and have several parallel jobs each running mongoimport with a separate file.