Database design - How much data to store, performance vs quality - mysql

There is some value, x, which I am recording every 30 seconds, currently into a database with three fields:
ID
Time
Value
I am then creating a mobile app which will use that data to plot charts in views of:
Last hour
Last 24 hours.
7 Day
30 Day
Year
Obviously, saving every 30 seconds for the last year and then sending that data to a mobile device will be too much (it would mean sending 1051200 values).
My second thought was perhaps I could use the average function in MySQL, for example, collect all of the averages for every 7 days (creating 52 points for a year), and send those points. This would work, but still MySQL would be trawling through creating averages and if many users connect, it's going to be bad.
So simply put, if these are my views, then I do not need to keep track of all that data. Nobody should care what x was a year ago to the precision of every 30 seconds, this is fine. I should be able to use "triggers" to create some averages.
I'm looking for someone to check what I have below is reasonable:
Store values every 30s in a table (this will be used for the hour view, 120 points)
When there are 120 rows are in the 30s table (120 * 30s = 60 mins = 1 hour), use a trigger to store the first half an hour in a "half hour average" table, remove the first 60 entries from the 30s table. This new table will need to have an id, start time, end time and value. This half hour average will be used for the 24 hour view (48 data points).
When the half hour table has more than 24 entries (12 hours), store the first 6 as an average in a 6 hour average table and then remove from the table. This 6 hour average will be used for the 7 day view (28 data points).
When there are 8 entries in the 6 hour table, remove the first 4 and store this as an average day, to be used in the 30 day view (30 data points).
When there are 14 entries in the day view, remove the first 7 and store in a week table, this will be used for the year view.
This doesn't seem like the best way to me, as it seems to be more complicated than I would imagine it should be.
The alternative is to keep all of the data and let mysql find averages as and when needed. This will create a monstrously huge database. I have no idea about the performance yet. The id is an int, time is a datetime and value is a float. Is 1051200 records too many? Now is a good time to add, I would like to run this on a raspberry pi, but if not.. I do have my main machine which I could use.

Your proposed design looks OK. Perhaps there are more elegant ways of doing this, but your proposal should work too.
RRD (http://en.wikipedia.org/wiki/Round-Robin_Database) is a specialised database designed to do all of this automatically, and it should be much more performant than MySQL for this specialised purpose.
An alternative is the following: keep only the original table (1051200 records), but have a trigger that generates the last hour/day/year etc views every time a new record is added (e.g. every 30 seconds) and store/cache the result somewhere. Then your number-crunching workload is independent of the number of requests/clients you have to serve.
1051200 records may or may not be too many. Test in your Raspberry Pi to find out.

Let me give a suggestion on the physical layout of your table, regardless on whether you decide to keep all data or "prune" it from time to time...
Since you generate a new row "every 30 seconds", then Time can serve as a natural key without fear of exceeding the resolution of the underlying data type and causing duplicated keys. You don't need ID in this scenario1, so your table is simply:
Time (PK)
Value
And since InnoDB tables are clustered, not having secondary indexes2 means the whole table is stored in a single B-Tree, which is as efficient as it gets from storage and querying perspective. On top of that, Value is automatically covered, which may not have been the case in your original design unless you specifically designed your index(es) for that.
Using time as key can be tricky in general, but I think may be worth it in this particular case.
1 Unless there are other tables that reference it through FOREIGN KEYs, or you have already written too much code that depends on it.
2 Which would be necessary in the original design to support efficient aggregation.

Related

Elastic Search Performance Solution

We have records taken every 30 seconds. Over the past few years we have a quite a few built up. Creating a windowing search on those records based on some interval of time.
For Example:
User inputs 15 minute increment between records based on current time: Time now: 8:53 (next 15 would be 9:08, 9:23, etc) within 1 hour.
User wants records between two dates (could be as big as 3 years of records at a 15 minute increment).
The records are not guaranteed to be exactly at those times but more like a maybe... we need to return the closest record to each 15 minute window of time.
There are several other query additions to filter these records by but the main issue is the problem above.
I have already created something that meets this goal but it likely can be refined and more performant. Also the way the UI queries and applies the incremented records can likely be improved.
Tell me if anyone has an idea for a solution? Thanks in advance.

Database design for statistics of views/likes/downloads?

So i am trying to work out a database design to keep track of views, likes and downloads. Now the amount of entries to keep track of is expected to be 1m or more, so normally i would just track each entry daily, but with 1m i am having concerns about performance and maybe even size on harddisk.
Customers wish is being able to show top statistics in ranges like last week, last month, last year. So i am not sure if should split the data up by adding up numbers from days to weeks or months and delete everything else that isn't relevant any more or just keep it a bit more flexible by tracking all of it, being able to query needed statistics freely.
Database: MySQL
I have to take save the counters once a day.
I think you need to keep track of it daily and then by the end of the week you delete the daily stats and you do so for the weekly data by the second week and do on for the months and year if you want to keep your last stats. You can create a job to make this job. I hope this may help you.
Due the operation only appending the data, store it on time series format.
Whenever you receive the clicks/views data goes in, count it and put the result as smallest dimension as we can. If we can summarizing it hourly, it's best choice ~ good for performance. Whenever we need to know data from higher dimension, just sum it up. Don't counting it (the statistic) on demand, for example : Scanning 10 millions of rows just for counting 1 day data, this is heavy operation.
With this approach you will save two thing :
Storage, you can backup older data that more than x times ( example more than 3 months), so the db size keep compact
Performance

Filter records from a database on a minimum time interval for making graph

We have a MySQL database table with statistical data that we want to present as a graph, with timestamp used as the x axis. We want to be able to zoom in and out of the graph between resolutions of, say, 1 day and 2 years.
In the zoomed out state, we will not want to get all data from the table, since that would mean to much data being shipped through the servers, and the graph resolution will be good enough with less data anyway.
In MySQL you can make queries that only select e.g. every tenth value and similar, which could be usable in this case. However, the intervals between values stored in the database isn't consistent, two values can be separated by as little as 10 minutes and as much as 6 hours, possibly more.
So the issue is that it is difficult to calculate a good stepping interval for the query, if we skip every tenth value for some reslution, that may work for series 10 minutes inbetween, but for 6 hour intervals we will throw away too much and the graph will end up having a too low resolution for comfort.
My impression is that MySQL isn't able to have a stepping interval depend on time so it would skip rows that are e.g. in the vicinity of five minutes of am included rows.
One solution could be to set 6 hours as a minimal resolution requirement for the graph, so we don't throw away values unless 6 hours is represented by a sufficiently small distance in the graph. I fear that this may result in too much data being read and sent through the system if the interval actually is smaller.
Another solution is to have more intelligence in the Java code, reading sets of data iteratively from low resolution and downwards until the data is good enough.
Any ideas for a solution that would enable us to get optimal resolution in one read, without too large result sets being read from the database, while not putting too much load on the database? I'm having wild ideas about installing an intermediate NoSQL component to store the values in, that might support time intervals the way I want - not sure if that actually is an option in the organisation.

good design for a db that decreases resolution of accuracy after time

I have something like 20,000 data points in a database and I want to display it on the google annotated graph. I think around 2000 points would be a good number to actually use the graph for, so I want to use averages instead of the real amount of data points I have.
This data counts the frequency of something a certain time. it would be like Table(frequency, datetime)
So for the first week I will have datetime have an interval of every 10 minutes, and frequency will be an average of all the frequencies of that time interval (of 10 minutes). Similarly, for the month after that I will I have a datetime interval of an hour etc.
I think this is something you can see on google finance too, after some time the resolution of the datapoints decreases even when you zoom in.
So what would be a good design for this? Is there already a tool that exists to do something like this?
I already thought of (though it might not be good) a giant table of all 20,000 points and several smaller tables that represent each time interval (1 week, 1 month etc) that are built through queries to the larger table and constantly updated and trimmed with new averages.
Keep the raw data in the db in one table the. Have a second reprti g table which you use a script or query to populate from the raw table. The transformation that populates the reporting table can group and average the buckets however you want. The important thing Is to not transform your data on initial insert--keep all your raw data. That way you can always rollback or rebuild if you mess something up.
ETL. Learn it. Love it. Live it.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???