How to consolidate data before saving to database? - mysql

I have processes to retrieve CPU, memory, disk usage from a couple of machines and store data into MySQL database. I would like to draw them in line charts, like CPU usage vs. date or memory usage vs. date.
After a couple of months, my database has millions of data. When selecting data for a periods of time for charting, it spent plenty of time. (DB normalization has been applied.)
Are there any known statistic methods to consolidate data to minute or hour before saving to database?
Or what keywords should I search in Internet for consolidating these kinds of monitoring data?
Thanks.

It is a better to have full data for storage and consolidated data for building graphs and etc. You can use triggers, cron jobs, materialized views and other kinds of scripts to create a separate set of consolidated data.
Another kind of solution is to create temporary set of unconsolidated data and then to compress it to what you need in you main data table.
About statistical methods - you should apply common sense. Sometimes you need just AVG value, sometimes MAX or MIN (I think that you'd like to see MAX CPU usage, MIN disk space left, MAX RAM usage and so on).

Related

How to store historic time series data

we're storing a bunch of time series data from several measurement devices.
All devices may provide different dimensions (energy, temp, etc.)
Currently we're using MySQL to store all this data in different tables (according to the dimension) in the format
idDevice, DateTime, val1, val2, val3
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
This is running quite fine, but we're running out of disk space as we are growing and in general I doubt that a RDBMS is the right answer to keep archive data.
So we're thinking of moving old/cold data on Amazon S3 and write some fancy getter that can recieve this data.
So here my question comes: what could be a good data format to support the following needs:
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
We're using PHP -> would be nice to have connectors/classes :)
I've had a look on HDF5, but it seems there is no PHP lib.
We're also willing to have a look on cloud based TimeSeries Databases.
Thank you in advance!
B
You might consider moving to a dedicated time-series database. I work for InfluxDB and our product meets most of your requirements right now, although it is still pre-1.0 release.
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
InfluxDB has built-in tools to automatically downsample and expire data. All you do is write the raw points and set up a few queries and retention policies, InfluxDB handles the rest internally.
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
As long as historic writes are fairly infrequent they are no problem for InfluxDB. If you are frequently writing in non-sequential data the write performance can slow down, but only while the non-sequential points are being replicated.
InfluxDB is not quite schema-less, but the schema cannot be pre-defined, and is derived from the points inserted. You can add new tags (metadata) or fields (metrics) simply by writing a new point that includes them, and you can automatically compose or decompose series by excluding or including the relevant tags when querying.
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
InfluxDB silently overwrites points when a new matching point comes in. (Matching means same series and timestamp, to the nanosecond)
We're using PHP -> would be nice to have connectors/classes :)
There are a handful of PHP libraries out there for InfluxDB 0.9. None are officially supported but likely one fits your needs enough to extend or fork.
You haven't specified what you want enough.
Do you care about latency? If not, just write all your data points to per-interval files in S3, then periodically collect them and process them. (No Hadoop needed, just a simple script downloading the new files should usually be plenty fast enough.) This is how logging in S3 works.
The really nice part about this is you will never outgrow S3 or do any maintenance. If you prefix your files correctly, you can grab a day's worth of data or the last hour of data easily. Then you do your day/week/month roll-ups on that data, then store only the roll-ups in a regular database.
Do you need the old data at high resolution? You can use Graphite to roll-up your data automatically. The downside is that it looses resolution as you age. But the upside is that your data is a fixed size and never grows, and writes can be handled quickly. You can even combine the above approach and send data to Graphite for quick viewing, but keep the data in S3 for other uses down the road.
I haven't researched the various TSDBs extensively, but here is a nice HN thread about it. InfluxDB is nice, but new. Cassandra is more mature, but the tooling to use it as a TSB isn't all there yet.
How much new data do you have? Most tools will handle 10,000 datapoints per second easily, but not all of them can scale beyond that.
I'm with the team that develops Axibase Time-Series Database. It's a non-relational database that allows you to efficiently store timestamped measurements with various dimensions. You can also store device properties (id, location, type, etc) in the same database for filtering and grouped aggregations.
ATSD doesn't delete raw data by default. Each sample takes 3.5+ bytes per tuple: time:value. Period aggregations are performed at request time and the list of functions includes: MIN, MAX, AVG, SUM, COUNT, PERCENTILE(n), STANDARD_DEVIATION, FIRST, LAST, DELTA, RATE, WAVG, WTAVG as well as some some additional functions for computing threshold violations per period.
Backfilling historical data is fully supported except that the timestamp has to be greater than January 1, 1970. Time precision is milliseconds or seconds.
As for deployment options, you could host this database on AWS. It runs on most Linux distributions. We could run some storage efficiency and throughput tests for you if you want to post sample data from your dataset here.

mysql harddrive efficiency with millions of rows

I have a program that receives about 20 arbitrary measurements per second from some source. Each measurement has a type, timestamp, min, avg, and max value. I then need to create up to X aggregates of each measurement type.
The program can be set up with 100s of sources at the same time, this results in a lot of data that I need to store quickly and retrieve quickly.
The system that this will run on has no memory/storage/cpu limitations, but there is another service on there that is writing to the hdd at almost the limit of the its capability. For this question, let's assume that this is a "top of the line" HDD and I won't be able to upgrade to a hdd.
What I'm doing right now is generating a table per measurement type (20x source) with partitioning along the timestamp value of each measurement as new measurement types are encountered. I'm doing this so as not to fragment the measurement data across the HDD which will allow me to insert or query data with a minimum amount of 'seeking.'
Does this make sense? I have no need of doing any joins or complex queries, it's all either a straight-forward batch-inserts or a single measurement type query by a timestamp range.
How does MySql store the data in the tables across the HDD? How can I better design the DB to minimize the HDD seek during insert & query?
You are asking general questions which can be discovered by reading the documentation or browsing knowledgebase articles by using google or whatever search engine that you prefer. If you are using the MyISAM engine, which is the default then each table is stored as three files in a db-specific directory, with the big ones being the MYD file for the row-data and the MYI file for all of the indexes.
The most important thing that you can do is to get you configuration parameters correct so that it can optimise access and caching. MySQL will do a better job than you can realistically expect to do. See http://dev.mysql.com/doc/refman/5.1/en/option-files.html for more info and compare the settings for the my-small.cnf and my-large.cnf that you will find on your system as this section discusses.

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.

Database architecture for millions of new rows per day

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
Website
Visitor
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
My questions are:
1) Is MySQL a good database for this purpose?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
EDIT
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
EDIT #2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
Some suggestions in a database agnostic fashion.
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
M
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
Not a solution but hopefully some help on the way, good luck :)