How to store historic time series data - mysql

we're storing a bunch of time series data from several measurement devices.
All devices may provide different dimensions (energy, temp, etc.)
Currently we're using MySQL to store all this data in different tables (according to the dimension) in the format
idDevice, DateTime, val1, val2, val3
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
This is running quite fine, but we're running out of disk space as we are growing and in general I doubt that a RDBMS is the right answer to keep archive data.
So we're thinking of moving old/cold data on Amazon S3 and write some fancy getter that can recieve this data.
So here my question comes: what could be a good data format to support the following needs:
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
We're using PHP -> would be nice to have connectors/classes :)
I've had a look on HDF5, but it seems there is no PHP lib.
We're also willing to have a look on cloud based TimeSeries Databases.
Thank you in advance!
B

You might consider moving to a dedicated time-series database. I work for InfluxDB and our product meets most of your requirements right now, although it is still pre-1.0 release.
We're also aggregating this data from min -> Hour -> Day -> Month -> Year each time we insert new data
InfluxDB has built-in tools to automatically downsample and expire data. All you do is write the raw points and set up a few queries and retention policies, InfluxDB handles the rest internally.
The data must be extensible in terms: once i a while a device will provide more data, then in the past -> the count of rows can grow/increase
As long as historic writes are fairly infrequent they are no problem for InfluxDB. If you are frequently writing in non-sequential data the write performance can slow down, but only while the non-sequential points are being replicated.
InfluxDB is not quite schema-less, but the schema cannot be pre-defined, and is derived from the points inserted. You can add new tags (metadata) or fields (metrics) simply by writing a new point that includes them, and you can automatically compose or decompose series by excluding or including the relevant tags when querying.
The data must be updated. When a customer delivers historic data, we need to be able to update that for the past.
InfluxDB silently overwrites points when a new matching point comes in. (Matching means same series and timestamp, to the nanosecond)
We're using PHP -> would be nice to have connectors/classes :)
There are a handful of PHP libraries out there for InfluxDB 0.9. None are officially supported but likely one fits your needs enough to extend or fork.

You haven't specified what you want enough.
Do you care about latency? If not, just write all your data points to per-interval files in S3, then periodically collect them and process them. (No Hadoop needed, just a simple script downloading the new files should usually be plenty fast enough.) This is how logging in S3 works.
The really nice part about this is you will never outgrow S3 or do any maintenance. If you prefix your files correctly, you can grab a day's worth of data or the last hour of data easily. Then you do your day/week/month roll-ups on that data, then store only the roll-ups in a regular database.
Do you need the old data at high resolution? You can use Graphite to roll-up your data automatically. The downside is that it looses resolution as you age. But the upside is that your data is a fixed size and never grows, and writes can be handled quickly. You can even combine the above approach and send data to Graphite for quick viewing, but keep the data in S3 for other uses down the road.
I haven't researched the various TSDBs extensively, but here is a nice HN thread about it. InfluxDB is nice, but new. Cassandra is more mature, but the tooling to use it as a TSB isn't all there yet.
How much new data do you have? Most tools will handle 10,000 datapoints per second easily, but not all of them can scale beyond that.

I'm with the team that develops Axibase Time-Series Database. It's a non-relational database that allows you to efficiently store timestamped measurements with various dimensions. You can also store device properties (id, location, type, etc) in the same database for filtering and grouped aggregations.
ATSD doesn't delete raw data by default. Each sample takes 3.5+ bytes per tuple: time:value. Period aggregations are performed at request time and the list of functions includes: MIN, MAX, AVG, SUM, COUNT, PERCENTILE(n), STANDARD_DEVIATION, FIRST, LAST, DELTA, RATE, WAVG, WTAVG as well as some some additional functions for computing threshold violations per period.
Backfilling historical data is fully supported except that the timestamp has to be greater than January 1, 1970. Time precision is milliseconds or seconds.
As for deployment options, you could host this database on AWS. It runs on most Linux distributions. We could run some storage efficiency and throughput tests for you if you want to post sample data from your dataset here.

Related

Day wise aggregation considering client's timezone from millions of rows

Suppose I've a table where visitors'(website visitor) information is stored. Suppose, the table structure consists of the following fields:
ID
visitor_id
visit_time (stored as milliseconds in UTC since
'1970-01-01 00:00:00')
Millions of rows are in this table and it's still growing.
In that case, If I want to see a report (day vs visitors) from any timezone then one solution is :
Solution #1:
Get the timezone of the report viewer (i.e. client)
Aggregate the data from this table considering the client's timezone
Show the result day wise
But In that case performance will degrade. Another solution may be the following:
Solution #2:
Using Pre-aggregated tables / summary tables where client's timezone is ignored
But in either case there is a trade off between performance and correctness.
Solution #1 ensures correctness and Solution #2 ensures better performance.
I want to know what is the best practice in this particular scenario?
The issue of handling time comes up a fair amount when you get into distributed systems, users and matching events between various sources of data.
I would strongly suggest that you ensure all logging systems use UTC. This allows collection from any variety of servers (which are all hopefully kept synchronized with respect to their view of the current UTC time) located anywhere in the world.
Then, as requests come in, you can convert from the users timezone to UTC. At this point you have the same decision -- perform a real-time query or perhaps access some data previously summarized.
Whether or not you want to aggregate the data in advance will depend on a bunch of things. Some of these might entail the ability to reduce the amount of data kept, reducing the amount of processing to support queries, how often queries will be performed or even the cost of building a system versus the amount of use it might see.
With respect to best practices -- keep the display characteristics (e.g. time zone) independent from the processing of the data.
If you haven't already, be sure you consider the lifetime of the data you are keeping. Will you need ten years of back data available? Hopefully not. Do you have a strategy for culling old data when it is no longer required? Do you know how much data you'll have if you store every record (estimate with various traffic growth rates)?
Again, a best practice for larger data sets is to understand how you are going to deal with the size and how you are going to manage that data over time as it ages. This might involve long term storage, deletion, or perhaps reduction to summarized form.
Oh, and to slip in a Matrix analogy, what is really going to bake your noodle in terms of "correctness" is the fact that correctness is not at issue here. Every timezone has a different view of traffic during a "day" in their own zone and every one of them is "correct". Even those oddball time zones that differ from yours by an adjustment that isn't measured only in hours.

Karma Tracking Reports: Should use redis-only or redis-mysql hybrid?

Background: I'm building a web-app that consists of a tool and an accompanying reporting system to track the total usage of the tool. I want to show the user reports based on daily usage, monthly usage, yearly usage and total usage, all in terms of minutes. Think minutes of usage = "karma" points.
I'm planning on implementing this usage tracking in redis. Now I could
1) increment multiple counters at the same time(Daily, Monthly,Yearly).
Or
2) I could just keep 2 sets of records:
a) Total Karma (simple Redis counter)
b) A Row in MySql with the Karma and the Date and use SQL queries to generate the reports for annual Karma and monthly Karma.
The advantage of example b) is that it won't clutter up Redis with a whole lot of denormalized data. But that might not be a disadvantage IF its trivial to port this data to MySQL when need be.
Any thoughts?
There's no reason why you can't use redis as a primary data store. However, keep the following in mind:
Your working set should fit in memory. Otherwise it becomes a mess.
You'll need to back up your redis data regularly and treat it just as important as a MySQL backup.
If you see your redis data growing larger than a single instance, I suggest looking at presharding.

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.

Efficient and scalable storage for JSON data with NoSQL databases

We are working on a project which should collect journal and audit data and store it in a datastore for archive purposes and some views. We are not quite sure which datastore would work for us.
we need to store small JSON documents, about 150 bytes, e.g. "audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}" or "journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}"
we are expecting about one million entries per day, about 150 MB data
data will be stored and read but never modified
data should stored in an efficient way, e.g. binary format used by Apache Avro
after a retention time data may be deleted
custom queries, such as 'get audit for user and time period' or 'get journal for terminalid and time period'
replicated data base for failsafe
scalable
Currently we are evaluating NoSQL databases like Hadoop/Hbase, CouchDB, MongoDB and Cassandra. Are these databases the right datastore for us? Which of them would fit best?
Are there better options?
One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)
Your requirement "after a retention time data may be deleted" fits Cassandra's column TTLs nicely - when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.
"data should stored in an efficient way, e.g. binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.
custom queries, such as 'get audit for user and time period' - in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query
or 'get journal for terminalid and time period' - as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.
Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)
Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous
Avro supports schema evolution and is a good fit for this kind of problem.
If your system does not require low latency data loads, consider receiving the data to files in a reliable file system rather than loading directly into a live database system. Keeping a reliable file system (such as HDFS) running is simpler and less likely to have outages than a live database system. Also, separating the responsibilities ensures that your query traffic won't ever impact the data collection system.
If you will only have a handful of queries to run, you could leave the files in their native format and write custom map reduces to generate the reports you need. If you want a higher level interface, consider running Hive over the native data files. Hive will let you run arbitrary friendly SQL-like queries over your raw data files. Or, since you only have 150MB/day, you could just batch load it into MySQL readonly compressed tables.
If for some reason you need the complexity of an interactive system, HBase or Cassandra or might be good fits, but beware that you'll spend a significant amount of time playing "DBA", and 150MB/day is so little data that you probably don't need the complexity.
We're using Hadoop/HBase, and I've looked at Cassandra, and they generally use the row key as the means to retrieve data the fastest, although of course (in HBase at least) you can still have it apply filters on the column data, or do it client side. For example, in HBase, you can say "give me all rows starting from key1 up to, but not including, key2".
So if you design your keys properly, you could get everything for 1 user, or 1 host, or 1 user on 1 host, or things like that. But, it takes a properly designed key. If most of your queries need to be run with a timestamp, you could include that as part of the key, for example.
How often do you need to query the data/write the data? If you expect to run your reports and it's fine if it takes 10, 15, or more minutes (potentially), but you do a lot of small writes, then HBase w/Hadoop doing MapReduce (or using Hive or Pig as higher level query languages) would work very well.
If your JSON data has variable fields, then a schema-less model like Cassandra could suit your needs very well. I'd expand the data into columns rather then storing it in binary format, that will make it easier to query. With the given rate of data, it would take you 20 years to fill a 1 TB disk, so I wouldn't worry about compression.
For the example you gave, you could create two column families, Audit and Journal. The row keys would be TimeUUIDs (i.e. timestamp + MAC address to turn them into unique keys). Then the audit row you gave would have four columns, host:'foo', username:'bar', task:'foo', and result:0. Other rows could have different columns.
A range scan over the row keys would allow you to query efficiently over time periods (assuming you use ByteOrderedPartitioner). You could then use secondary indexes to query on users and terminals.