I’ve read a few csv vs database debates and in many cased people recommended db solution over csv.
However it has never been exactly the same setup I have.
So here is the setup.
- Every hour around 50 csv files generated representing performance group from around 100 hosts
- Each performance group has from 20 to 100 counters
- I need to extract data to create a number of predefined reports (e.g. daily for certain counters and nodes) - this should be relatively static
- I need to extract data add-hoc when needed (e.g. for investigation purposes) based on variable time period, host, counter
- In total around 100MB a day (in all 50 files)
Possible solutions?
1) Keep it in csv
- To create a master csv file for each performance group and every hour just append the latest csv file
To generate my reports using just scripts with shell commands (grep, sed, cut, awk)
2) Load it to database (e.g. MySQL)
- To create tables mirroring performance groups and load those csv files into the tables
To generate my reports using sql queries
When I tried simulate and to use just shell commands on csv files and it was very fast.
I worry that database queries would be slower (considering the amount of data).
I also know that databases don’t like too wide tables – in my scenario I would need in some cases 100+ columns.
It will be read only for most of time (only appending new files).
I’d like to keep data for a year so it would be around 36GB. Would the database solution still perform ok (1-2 core VM, 2-4GB memory expected).
I haven’t simulate the database solution that’s why I’d like to ask you if you have any view/experience with similar scenario.
Related
I've got a production database with a wp_options table reportedly totalling around 951,679,500,288 (900GB+) in total data length. However, when I export the database and examine it locally it only reports a small number of MB (usually 3-7MB).
There are about 2,000-10,000 rows of data in this table. The reason for this fluctuation is there is a great number of transient cache data being stored in this table and the cron is scheduled to remove them routinely. That's why there is a discrepancy in the number of rows in the 2 screenshots. Otherwise, I have checked numerous times and the non-transient data is all exactly the same and represented in both environments.
It's like there's almost a TB of garbage data hiding in this table that I can't access or see and it's only present on production. Staging and local environments with the same database operate just fine without the missing ~TB of data.
summary of table on production:
summary of table from same db on local:
summary of both db sizes in comparison:
What could be causing the export of a SQL file to dis-regard 900GB of data? I've exported SQL and CSV via Adminer as well as using the 'wp db export' command.
And how could there be 900GB of data on production that I cannot see or account for other than when it calculates the total data length of the table?
It seems like deleted rows have not been purged. You can try OPTIMIZE TABLE.
Some WP plugins create "options" but fail to clean up after themselves. Suggest you glance through that huge table to see what patterns you find in the names. (Yeah, that will be challenging.) Then locate the plugin, and bury it more than 6 feet under.
OPTIMIZE TABLE might clean it up. But you probably don't know what the setting of innodb_file_per_table was when the table was created. So, I can't predict whether it will help a lot, take a long time, not help at all, or even crash.
I have to work with a database which exceeds the 2gb limit. I have tried splitting the database as well but I'm not able to do it. Please suggest what I can do to solve this problem
When faced with the inherent 2GB limit on the size of your MS Access database, there are several steps that you can undertake, depending on how aggressively you need to reduce the size of the database:
Step 0: Compact the Database
The obvious first step, but I'll mention it here in case it has been overlooked.
Step 1: Splitting the Database
The operation of splitting the database will separate the 'front end' data (queries, reports, forms, macros etc.) from the 'back end' data (the tables).
To do this, go to Database Tools > Move Data > Access Database
This operation will export all tables in your current database to a separate .accdb database file, and will then link the tables from the new .accdb database file back into the original database.
As a result of this operation, The size of the back end database will be marginally reduced since it no longer contains the definitions of the various front end objects, along with resources such as images which may have been used on reports/forms and which may have contributed more towards the overall size of the original database.
But since the vast majority of the data within the file will be stored in the database tables, you will only see marginal gains in database size following this operation.
If this initial step does not significantly reduce the size of the back end database below the 2GB limit, the next step might be:
Step 2: Dividing the Backend Database
The in-built operation offered by MS Access to split the database into a separate frontend and backend database will export all tables from the original database into a single backend database file, and will then relink such tables into the front end database file.
However, if the resulting backend database is still approaching the 2GB limit, you can divide the backend database further into separate smaller chunks - each with its own 2GB limit.
To do this, export the larger tables from your backend database into a separate .accdb database file, and then link this new separate database file to your frontend database in place of the original table.
Taking this process to the limit would result in each table residing within its own separate .accdb database file.
Step 3: Dividing Table Data
This is the very last resort and the feasibility of this step will depend on the type of data you are working with.
If you are operating will dated data, you might consider exporting all data dated prior to a specific cutoff date into a separate table within a separate .accdb database file, and then link the two separate tables into your frontend database (such that you have a 'live' table and an 'archive' table).
Note however that you will not be able to union the data from the two tables within a single query, as the 2GB MS Access limit applies to the amount of data that MS Access is able to manipulate within the single .accdb file, not just the data which may be stored in the tables.
Step 4: Migrate to another RDBMS
If you're frequently hitting the 2GB limit imposed by an MS Access database and find yourself sabotaging the functionality of your database as a result of having to dice up the data into ever smaller chunks, consider opting for a more heavyweight database management system, such as SQL Server, for example.
You can split your database into several files. The feature exists in the menu Database Tools > Move data.
You can read the Microsoft documentation about it
But prepare yourself to migrate to a new RDBMS in a near future because you are reaching the system limits ...
In my primary role, I handle laboratory testing data files that can contain upwards of 2000 parameters for each unique test condition. These files are generally stored and processed as CSV formatted files, but that becomes very unwieldy when working with 6000+ files with 100+ rows each.
I am working towards a future database storage and query solution to improve access and efficiency, but I am stymied by the row length limitation of MySQL (specifically MariaDB 5.5.60 on RHEL 7.5). I am using MYISAM instead of InnoDB, which has allowed me to get to around 1800 mostly-double formatted data fields. This version of MariaDB forces dynamic columns to be numbered, not named, and I cannot currently upgrade to MariaDB 10+ due to administrative policies.
Should I be looking at a NoSQL database for this application, or is there a better way to handle this data? How do others handle many-variable data sets, especially numeric data?
For an example of the CSV files I am trying to import, see below. The identifier I have been using is an amalgamation of TEST, RUN, TP forming a 12-digit unsigned bigint key.
Example File:
RUN ,TP ,TEST ,ANGLE ,SPEED ,...
1.000000E+00,1.000000E+00,5.480000E+03,1.234567E+01,6.345678E+04,...
Example key:
548000010001 <-- Test = 5480, Run = 1, TP = 1
I appreciate any input you have.
The complexity comes from the fact that you have to handle a huge number of data, not from the fact that they are split over many files with many rows.
Using a database storage & query system will superficially hide some of this complexity, but at the expense of complexity at several other levels, as you have already experienced, including obstacles that are out of your control like changing versions and conservative admins. Database storage & query system are made for other application scenarios where they have advantages that are not pertinent for your case.
You should seriously reconsider leaving your data in files, i.e. use your file system as your database storage system. Possibly, transcribe you CSV input into a modern self-documenting data format like YAML or HDF5. For queries, you may be better off writing scripts or programs that directly access those files, instead of writing SQL queries.
Every month, I do some analysis on a customer database. My predecessor would create a segment in Eloqua (Our CRM) for each country, and then spend about 10 (tedious, slow) hours refreshing them all. When I took over, I knew that I wouldn't be able to do it in Excel (we had over 10 million customers) so I used Access.
This process has worked pretty well. We're now up to 12 million records, and it's still going strong. However, when importing the master list of customers prior to doing any work on it, the database is inflating. This month it hit 1.3 GB.
Now, I'm not importing ALL of my columns - only 3. And Access freezes if I try to do my manipulations on a linked table. What can I do to reduce the size of my database during import? My source files are linked CSVs with only the bare minimum of columns; after I import the data, my next steps have to be:
Manipulate the data to get counts instead of individual lines
Store the manipulated data (only a few hundred KB)
Empty my imported table
Compress and Repair
This wouldn't be a problem, but i have to do all of this 8 times (8 segments, each showing a different portion of the database), and the 2GB limit is looming over the next horizon.
An alternate question might be: How can I simulate / re-create the "Linked Table" functionality in MySQL/MariaDB/something else free?
For such big number of records MS Access with 2 GB limit is not good solution as data storage. I would use MySQL as backend:
Create table in MySQL and link it to MS Access
Import CSV data directly to MySQL table using native import features of MySQL. Of course Access can be used for data import, but it will work slower.
Use Access for data analyse using this linked MySQL table as regular table.
You could import the CSV to a (new/empty) separate Access database file.
Then, in your current application, link the table from that file. Access will not freeze during your operations as it will when linking text files directly.
I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.