I have a general question about the best way to set up my tables to deal with large volume data that I import on a daily basis.
I will import 10 csv files containing 1000's records each day so this table will expand rapidly.
It consists of 15 or so columns ranging from tiny and medium ints to 30 character varchars.
There is no ID field - I can join 6 columns to form a primary key - this would be a var char total length about 45.
When it's imported I need to report on this data through a web front end at summary levels so I see myself having to build reporting tables from this after importing.
Within this data are many fields that repeat themselves in each days import - date, region, customer etc, only half the columns each day are specific to the record.
Questions:
Should I import it all into one table immediately as a dump table.
Should I transform the data through the import process and split the import across different tables
Should I form an id field based on the columns I can to get a unique key during the import
Should I use auto inc id field for this.
What sort of table should this be InnoDB etc
My fear is data overload on this table which will make extracting to reporting tables harder and harder as it builds?
Advice really helpful. Thanks.
Having autoinc id is usually more helpful than not having it
To ensure data integrity you can have uniq index on your 6 columns that make up ID
MySQL is pretty comfortable with millions of records in database if you have enough RAM
If you still have a fear of millions of records - just aggregate your data on monthly basis into another table. If you can't - add more RAM.
Transform as much of your data during importing as possible as long as it doesn't hurt performance. Transforming the data when it's already imported adds unnecessary load to MySQL server and if you can avoid doing so - avoid.
MyISAM is(was?) usually better for statistical kind of data, kind that doesn't get UPDATEd too often but InnoDB has caught up in past few years(have a look at percona's XtraDB engine) and is basically same performance-wise.
I think the most important point here is to define your data retention rates - it's rare that you have to retain daily resolution after a year or two.
Aggregate into lower resolution frames and archive(mysqldump > bzip is quite efficient) if you think you might still need daily resolution in future.
Related
I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.
I'm writing a C# program where I'm looking at ~5300 stock tickers. I'm storing the data in a MySQL database with the following fields: date, tickername, closingPrice, movingaverage50, movingaverage200, ... and a few others. Each stock can have up to 15300 different datapoints. So the total database will be 5300x15300x6 or so different fields.
My question is, is there a more efficient way to store all this data other than one big table? Would breaking the data up into different tables, say by decade, buy me anything? Is there some link/website where I should go to get a general feel of what considerations I should look at to design a database to be as fast as possible, or does the MySQL database itself make it efficient?
I'm currently reading in 5500 excel files one at a time to fill my c# objects with data, and that takes around 15minutes... I'm assuming once I get my MySQL going that will be cut way down.
Thanks for any help; this is more of a fishing for a place to get started thinking about database design I guess.
This is too long for a comment.
In general, it is a bad idea to store multiple tables with the same format. That becomes a maintenance problem and has dire consequences for certain types of queries. So, one table is preferred.
The total number of rows is 486,540,000. This is pretty large but not extraordinary.
The question about data layout depends not only on the data but how it is being used. My guess is that the use of indexes and perhaps partitions would solve your performance problems.
Processing 5,500 Excel files in 15 minutes seems pretty good. Whether the database will be significantly faster depends on the volume of data between the server and the application. If you are reading the "Excel" files as CSV text files, then the database may not be a big gain. If you are reading through Excel, then it might be better.
Note: with a database, you may be able to move processing from C# into the database. This allows the database to take advantage of parallel processing, which can open other avenues for performance improvement.
One table.
PRIMARY KEY(ticker, date) -- This makes getting historical info about a single ticker efficient because of the clustering.
PARTITION BY (TO_DAYS(date)) -- This leads to all the INSERT activity being in a single partition. This partition is of finite size, hence the random accessing to insert 5300 new rows every night scattered around will probably still be in cache.
Partition by month, or something of approximately that size -- small enough for a partition to be cached, but not so small enough that you have an unwieldy number of partitions. (It is good to keep a table under 50 partitions. This 'limitation' may lift with "native partitions" that are coming in 5.7.)
If you already have several months' data in a table, put that in a single, oversized, partition; there is no advantage in splitting it up by month.
Minimize column sizes. 2-byte SMALLINT UNSIGNED for ticker_id, linked to a normalization table of tickers. 3-byte DATE; etc. Volume can be too big for INT UNSIGNED, either use FLOAT (with some roundoff error) or some DECIMAL. Prices are tricky -- rounding errors with FLOAT, excessive size with DECIMAL: need at least (9,4) (5 bytes) for US tickers, worse if you go back to the days of fractional pricing (eg, 5-9/16).
Think through the computation of moving averages; this may be the most intensive activity.
I have a CSV file with 74 columns and about 60K rows. The contents of this CSV file have to be imported to a MySQL database, every month.
After the data is inserted, the end-user can query the contents of the MySQL database with predefined filters.
Putting everything in a single table would mean faster inserts, but slower reads. Splitting the content in multiple tables (with foreign keys) would mean slower inserts, faster reads and, I think, higher chance of failure.
What do you think is the best option for me, or are there any other possibilities?
If all the data relationships (between the buses, clients, and trips) are 1 to 1 and information is not being duplicated throughout your CSV, you can go with a single table for these reasons:
Simplest conversion from CVS to database, each column in the CVS will correspond with one column in the database
Anyone that works on the database after you will know exactly what data is where because it will "look like" the CVS
Your main concern "Reads being slower" won't be a big problem because when you query the database for information, you ask for only the data you want and filter out the columns you don't. (e.g. SELECT departure, arrival, distance FROM bustrips WHERE distance > 1000)\
However, if you look at the data, and there is a massive amount of duplication in the CVS, possibly from more than one client riding on the same trip, or the same bus is used for more than one trip, etc. I would create a new table for each block of unique data. One example that I might already see would be a new table for buses:
Bus_ID;
Numberplate;
Handicap;
Odometer reading;
I hope this helps you make the decision. It is not about "easy read" vs. "easy write" it is about information clarity by reduction of redundancy.
Without looking at your columns, I can almost guarantee that multiple tables is the way to go.
It will reduce human error
by reducing redundancy,
and as a bonus, any updates to say, a clients address, can by made once to the the client's table instead of having to be updated on every line item they've been involved with.
You'll also notice that insertions become easier as entire lines of data covered in another table can be summed up be referencing a single foreign key!
If database insertion time does become a big problem, you could always take a little bit of time to write a macro to do it for you.
So here's the deal. I've designed as schema which stores the daily stock quotes data. I've two tables (among others) "todayData" and "historicalData" with the same structure. The two tables have innodb engine as their storage engine. There is no FK between two tables and are independent.
if i need to see data for today, i query today table and if i need to generate reports or trending analysis etc i rely on historical table. During midnight, today's data will move to historical table.
The question is historical will be mammoth in few weeks (> 10 GB and counting) and needless to say serving this data from a single table is mindless.
What should i do to make sure the reports generate off of historical will be fast and responsive.
People have suggested partitioning etc but i would like to know are there any other ways to do this?
Thank you
Bo
There is no silver bullet for Big Data. All depends on the data and data usage (access patterns, etc.). First, make sure the table is properly indexed, your queries are optimal, and you have enough memory. And if you still have too much data to contain on single server, shard/partition (but mind the access patterns when you choose shard key - if you have to query multiple partitions for single report, it's bad. Buf if you really have to, make sure you can query them in parallel - something not possible currently with the build-in partitioning (so you need app-level sharding logic))
I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.
I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.
I have a common results table format for each machine which is as follows:
ClientRequestID PK
SampleNumber PK
MeasureDtTm
Operator
AnalyteName
UnitOfMeasure
Value
A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.
This is fine for all machines except one!
It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber.
If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements.
My other tables will only grow at a rate of 3000-5000 rows per year.
So after all that, my question is this:
Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).
All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?
BTW the database is MS Jet / Access 2010
Any help would be greatly appreciated!
Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.
However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.
I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).
You can decouple the AnalyteName column from the 'common results' table:
-- Table Common Results
ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value
-- Table Results Analyte
ClientRequestID PK SampleNumber PK AnalyteName
You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.
Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.