Every month, I do some analysis on a customer database. My predecessor would create a segment in Eloqua (Our CRM) for each country, and then spend about 10 (tedious, slow) hours refreshing them all. When I took over, I knew that I wouldn't be able to do it in Excel (we had over 10 million customers) so I used Access.
This process has worked pretty well. We're now up to 12 million records, and it's still going strong. However, when importing the master list of customers prior to doing any work on it, the database is inflating. This month it hit 1.3 GB.
Now, I'm not importing ALL of my columns - only 3. And Access freezes if I try to do my manipulations on a linked table. What can I do to reduce the size of my database during import? My source files are linked CSVs with only the bare minimum of columns; after I import the data, my next steps have to be:
Manipulate the data to get counts instead of individual lines
Store the manipulated data (only a few hundred KB)
Empty my imported table
Compress and Repair
This wouldn't be a problem, but i have to do all of this 8 times (8 segments, each showing a different portion of the database), and the 2GB limit is looming over the next horizon.
An alternate question might be: How can I simulate / re-create the "Linked Table" functionality in MySQL/MariaDB/something else free?
For such big number of records MS Access with 2 GB limit is not good solution as data storage. I would use MySQL as backend:
Create table in MySQL and link it to MS Access
Import CSV data directly to MySQL table using native import features of MySQL. Of course Access can be used for data import, but it will work slower.
Use Access for data analyse using this linked MySQL table as regular table.
You could import the CSV to a (new/empty) separate Access database file.
Then, in your current application, link the table from that file. Access will not freeze during your operations as it will when linking text files directly.
Related
I've got a production database with a wp_options table reportedly totalling around 951,679,500,288 (900GB+) in total data length. However, when I export the database and examine it locally it only reports a small number of MB (usually 3-7MB).
There are about 2,000-10,000 rows of data in this table. The reason for this fluctuation is there is a great number of transient cache data being stored in this table and the cron is scheduled to remove them routinely. That's why there is a discrepancy in the number of rows in the 2 screenshots. Otherwise, I have checked numerous times and the non-transient data is all exactly the same and represented in both environments.
It's like there's almost a TB of garbage data hiding in this table that I can't access or see and it's only present on production. Staging and local environments with the same database operate just fine without the missing ~TB of data.
summary of table on production:
summary of table from same db on local:
summary of both db sizes in comparison:
What could be causing the export of a SQL file to dis-regard 900GB of data? I've exported SQL and CSV via Adminer as well as using the 'wp db export' command.
And how could there be 900GB of data on production that I cannot see or account for other than when it calculates the total data length of the table?
It seems like deleted rows have not been purged. You can try OPTIMIZE TABLE.
Some WP plugins create "options" but fail to clean up after themselves. Suggest you glance through that huge table to see what patterns you find in the names. (Yeah, that will be challenging.) Then locate the plugin, and bury it more than 6 feet under.
OPTIMIZE TABLE might clean it up. But you probably don't know what the setting of innodb_file_per_table was when the table was created. So, I can't predict whether it will help a lot, take a long time, not help at all, or even crash.
I have to work with a database which exceeds the 2gb limit. I have tried splitting the database as well but I'm not able to do it. Please suggest what I can do to solve this problem
When faced with the inherent 2GB limit on the size of your MS Access database, there are several steps that you can undertake, depending on how aggressively you need to reduce the size of the database:
Step 0: Compact the Database
The obvious first step, but I'll mention it here in case it has been overlooked.
Step 1: Splitting the Database
The operation of splitting the database will separate the 'front end' data (queries, reports, forms, macros etc.) from the 'back end' data (the tables).
To do this, go to Database Tools > Move Data > Access Database
This operation will export all tables in your current database to a separate .accdb database file, and will then link the tables from the new .accdb database file back into the original database.
As a result of this operation, The size of the back end database will be marginally reduced since it no longer contains the definitions of the various front end objects, along with resources such as images which may have been used on reports/forms and which may have contributed more towards the overall size of the original database.
But since the vast majority of the data within the file will be stored in the database tables, you will only see marginal gains in database size following this operation.
If this initial step does not significantly reduce the size of the back end database below the 2GB limit, the next step might be:
Step 2: Dividing the Backend Database
The in-built operation offered by MS Access to split the database into a separate frontend and backend database will export all tables from the original database into a single backend database file, and will then relink such tables into the front end database file.
However, if the resulting backend database is still approaching the 2GB limit, you can divide the backend database further into separate smaller chunks - each with its own 2GB limit.
To do this, export the larger tables from your backend database into a separate .accdb database file, and then link this new separate database file to your frontend database in place of the original table.
Taking this process to the limit would result in each table residing within its own separate .accdb database file.
Step 3: Dividing Table Data
This is the very last resort and the feasibility of this step will depend on the type of data you are working with.
If you are operating will dated data, you might consider exporting all data dated prior to a specific cutoff date into a separate table within a separate .accdb database file, and then link the two separate tables into your frontend database (such that you have a 'live' table and an 'archive' table).
Note however that you will not be able to union the data from the two tables within a single query, as the 2GB MS Access limit applies to the amount of data that MS Access is able to manipulate within the single .accdb file, not just the data which may be stored in the tables.
Step 4: Migrate to another RDBMS
If you're frequently hitting the 2GB limit imposed by an MS Access database and find yourself sabotaging the functionality of your database as a result of having to dice up the data into ever smaller chunks, consider opting for a more heavyweight database management system, such as SQL Server, for example.
You can split your database into several files. The feature exists in the menu Database Tools > Move data.
You can read the Microsoft documentation about it
But prepare yourself to migrate to a new RDBMS in a near future because you are reaching the system limits ...
We have a need to do the initial data copy on a table that has 4+ billion records to target SQL Server (2014) from source MySQL (5.5). The table in question is pretty wide with 55 columns, however none of them are LOB. I'm looking for options for copying this data in the most efficient way possible.
We've tried loading via Attunity Replicate (which has worked wonderfully for tables not this large) but if the initial data copy with Attunity Replicate fails then it starts over from scratch ... losing whatever time was spent copying the data. With patching and the possibility of this table taking 3+ months to load Attunity wasn't the solution.
We've also tried smaller batch loads with a linked server. This is working but doesn't seem efficient at all.
Once the data is copied we will be using Attunity Replicate to handle CDC.
For something like this I think SSIS would be the most simple. It's designed for large inserts as big as 1TB. In fact, I'd recommend this MSDN article We loaded 1TB in 30 Minutes and so can you.
Doing simple things like dropping indexes and performing other optimizations like partitioning would make your load faster. While 30 minutes isn't a feasible time to shoot for, it would be a very straightforward task to have an SSIS package run outside of business hours.
My business doesn't have a load on the scale you do, but we do refresh our databases of more than 100M nightly which doesn't take more than 45 minutes, even with it being poorly optimized.
One of the most efficient way to load huge data is to read them by chunks.
I have answered many similar question for SQLite, Oracle, Db2 and MySQL. You can refer to one of them for to get more information on how to do that using SSIS:
Reading Huge volume of data from Sqlite to SQL Server fails at pre-execute (SQLite)
SSIS failing to save packages and reboots Visual Studio (Oracle)
Optimizing SSIS package for millions of rows with Order by / sort in SQL command and Merge Join (MySQL)
Getting top n to n rows from db2 (DB2)
On the other hand there are many other suggestions such as drop indexes in destination table and recreate them after insert, Create needed indexes on source table, use fast-load option to insert data ...
My database is nearly 10GB in size (12 tables roughly equal in size).
What is the proper way of moving this kind of data to a different server?
My thoughts are, breaking down each table into several files each containing cca 100 000 rows of given table. Then on the new machine loop through all files.
Please let there be a more efficient way, this sounds exhausting.
You can export database tar.gz format. When you export database must be
extend execute time.
i also export 3GB database in tar.gz format size 400MB.
I’ve read a few csv vs database debates and in many cased people recommended db solution over csv.
However it has never been exactly the same setup I have.
So here is the setup.
- Every hour around 50 csv files generated representing performance group from around 100 hosts
- Each performance group has from 20 to 100 counters
- I need to extract data to create a number of predefined reports (e.g. daily for certain counters and nodes) - this should be relatively static
- I need to extract data add-hoc when needed (e.g. for investigation purposes) based on variable time period, host, counter
- In total around 100MB a day (in all 50 files)
Possible solutions?
1) Keep it in csv
- To create a master csv file for each performance group and every hour just append the latest csv file
To generate my reports using just scripts with shell commands (grep, sed, cut, awk)
2) Load it to database (e.g. MySQL)
- To create tables mirroring performance groups and load those csv files into the tables
To generate my reports using sql queries
When I tried simulate and to use just shell commands on csv files and it was very fast.
I worry that database queries would be slower (considering the amount of data).
I also know that databases don’t like too wide tables – in my scenario I would need in some cases 100+ columns.
It will be read only for most of time (only appending new files).
I’d like to keep data for a year so it would be around 36GB. Would the database solution still perform ok (1-2 core VM, 2-4GB memory expected).
I haven’t simulate the database solution that’s why I’d like to ask you if you have any view/experience with similar scenario.