Average calculation procedure - mysql

i'm implementing a mysql database for saving logged energy consumption data out of a smart home applications. The data then should be plotted within a javascript framework. Unfortunately the usage get's logged every 8 seconds and there's too much information to plot a year consumption graph.
The data gets saved in a simple table by it's time, device id and consumption at this specific time.
I'm hoping to be able to automatically aggregate the given data by minutes, hours and finally day average values.
After some research I came across some queries/procedures to calculate average values of specific intervals. Unfortunately this isn't much help to me as I have data over a period of three years and I don't want to create the given intervals by hand.
Ideally the procedure in mysql should be able to aggregate the given device values by it's time and calculate an average value and save it in a separate table.
Does anyone have a idea how I could implement it?

select avg(consumption) minute_average, date_format(log_date,'%m/%d/%y %H:%i') minute from data
group by date_format(log_date,'%m/%d/%y %H:%i');
select avg(consumption) hour_average, date_format(log_date,'%m/%d/%y %H') hour from data
group by date_format(log_date,'%m/%d/%y %H');
select avg(consumption) day_average, date_format(log_date,'%m/%d/%y') day from data
group by date_format(log_date,'%m/%d/%y');
note: you could just as easily calculate any aggregate like sum or standard deviation as well.

Related

How to performantly record datetimes for analysis purposes?

In order to analyze dates and times I am creating a MySQL table where I want to keep the time information. Some example analyses will be stuff like:
Items per day/week/month/year
Items per weekday
Items per hour
etc.
Now in regards to performance, what way should I record in my datatable:
date type: Unix timestamp?
date type: datetime?
or keep date information in one row each, e.g. year, month, day in separate fields?
The last one, for example, would be handy if I'm analysing by weekday; I wouldn't have to perform WEEKDAY(item.date) on MySQL but could simply use WHERE item.weekday = :w.
Based on your usage, you want to use the native datetime format. Unix formats are most useful when the major operations are (1) ordering; (2) taking differences in seconds/minutes/hours/days; and (3) adding seconds/minutes/hours/days. They need to be converted to internal date time formats to get the month or week day, for instance.
You also have a potential indexing issue. If you want to select ranges of days, hours, months and so on for your results, then you want an index on the column. For this purpose an index on a datetime is probably sufficient.
If the summaries are by hour, you might find it helpful to stored the date component in a date field and the hour in a separate column. That would be particularly helpful if you are combining hours from different days.
Whether you break out other components of the date, such as weekday and month, for indexing purposes would depend on the volume of data in the table, performance requirements, and the queries you are planning on running. I would not be inclined to do this, except as a later optimization.
The rule of thumb is: store things as they should be stored, don't do performance tweaks until you're hitting the bottleneck. If you store your date as separate fields, you'll eventually stumble upon a situation you need this date as a whole inside your database (e.g. update query for a particular range of time), and this will be like hell - condition from 3 april 2015 till 15 may 2015 would be as giant as possible.
You should keep your dates as date type. This will grant you maximum flexibility, (most probably) query readability and will keep all of your opportunities to work with them. The only thing I really can recommend is storing the same date divided into year/month/day in next columns - of course, this will bloat your database and require extreme caution on update scenarios, but this will allow you to use any variant of source data in your queries.

Database design - How much data to store, performance vs quality

There is some value, x, which I am recording every 30 seconds, currently into a database with three fields:
ID
Time
Value
I am then creating a mobile app which will use that data to plot charts in views of:
Last hour
Last 24 hours.
7 Day
30 Day
Year
Obviously, saving every 30 seconds for the last year and then sending that data to a mobile device will be too much (it would mean sending 1051200 values).
My second thought was perhaps I could use the average function in MySQL, for example, collect all of the averages for every 7 days (creating 52 points for a year), and send those points. This would work, but still MySQL would be trawling through creating averages and if many users connect, it's going to be bad.
So simply put, if these are my views, then I do not need to keep track of all that data. Nobody should care what x was a year ago to the precision of every 30 seconds, this is fine. I should be able to use "triggers" to create some averages.
I'm looking for someone to check what I have below is reasonable:
Store values every 30s in a table (this will be used for the hour view, 120 points)
When there are 120 rows are in the 30s table (120 * 30s = 60 mins = 1 hour), use a trigger to store the first half an hour in a "half hour average" table, remove the first 60 entries from the 30s table. This new table will need to have an id, start time, end time and value. This half hour average will be used for the 24 hour view (48 data points).
When the half hour table has more than 24 entries (12 hours), store the first 6 as an average in a 6 hour average table and then remove from the table. This 6 hour average will be used for the 7 day view (28 data points).
When there are 8 entries in the 6 hour table, remove the first 4 and store this as an average day, to be used in the 30 day view (30 data points).
When there are 14 entries in the day view, remove the first 7 and store in a week table, this will be used for the year view.
This doesn't seem like the best way to me, as it seems to be more complicated than I would imagine it should be.
The alternative is to keep all of the data and let mysql find averages as and when needed. This will create a monstrously huge database. I have no idea about the performance yet. The id is an int, time is a datetime and value is a float. Is 1051200 records too many? Now is a good time to add, I would like to run this on a raspberry pi, but if not.. I do have my main machine which I could use.
Your proposed design looks OK. Perhaps there are more elegant ways of doing this, but your proposal should work too.
RRD (http://en.wikipedia.org/wiki/Round-Robin_Database) is a specialised database designed to do all of this automatically, and it should be much more performant than MySQL for this specialised purpose.
An alternative is the following: keep only the original table (1051200 records), but have a trigger that generates the last hour/day/year etc views every time a new record is added (e.g. every 30 seconds) and store/cache the result somewhere. Then your number-crunching workload is independent of the number of requests/clients you have to serve.
1051200 records may or may not be too many. Test in your Raspberry Pi to find out.
Let me give a suggestion on the physical layout of your table, regardless on whether you decide to keep all data or "prune" it from time to time...
Since you generate a new row "every 30 seconds", then Time can serve as a natural key without fear of exceeding the resolution of the underlying data type and causing duplicated keys. You don't need ID in this scenario1, so your table is simply:
Time (PK)
Value
And since InnoDB tables are clustered, not having secondary indexes2 means the whole table is stored in a single B-Tree, which is as efficient as it gets from storage and querying perspective. On top of that, Value is automatically covered, which may not have been the case in your original design unless you specifically designed your index(es) for that.
Using time as key can be tricky in general, but I think may be worth it in this particular case.
1 Unless there are other tables that reference it through FOREIGN KEYs, or you have already written too much code that depends on it.
2 Which would be necessary in the original design to support efficient aggregation.

SQL, querying by date intervals

I've got a dataset that I want to be able to slice up by date interval. It's a bunch of scraped web data and each item has a unix-style milisecond timestamp as well as a standard UTC datetime.
I'd like to be able to query the dataset, picking out the rows that are closest to various time intervals:
e.g.: Every hour, once a day, once a week, etc.
There is no guarantee that the timestamps are going to fall evenly on the interval times, otherwise I'd just do a mod query on the timestamp.
Is there a way to do this with SQL commands that doesn't involve stored procs or some sort of pre-computed support tables?
I use the latest MariaDB.
EDIT:
The marked answer doesn't quite answer my specific question but it is a decent answer to the more generalized problem so I went ahead and marked it.
I was specifically looking for a way to query a set of data where the timestamp is highly variable and to grab out rows that are reasonably close to periodic time intervals. E.g.: get all the rows that are the closest to being on 24 hour intervals from right now.
I ended up using a modulus query to solve the problem: timestamp % interval < average spacing between data points. This occasionally grabs extra points and misses a few but was good enough for my graphing application.
And them I got sick of the node-mysql library crashing all the time so I moved to MongoDB.
You say you want 'closest to various time intervals' but then say 'every hour/day/week', so the actual implementation will depend on what you really want, but you can use a host of standard date/time functions to group records, for example count by day:
SELECT DATE(your_DateTime) AS Dt, COUNT(something) AS CT
FROM yourTable
GROUP BY DATE(your_DateTime)
Count by Hour:
SELECT DATE(your_DateTime) AS Dt,HOUR(your_DateTime) AS Hr, COUNT(something) AS CT
FROM yourTable
GROUP BY DATE(your_DateTime), HOUR(your_DateTime)
See the full list of supported date and time functions here:
https://mariadb.com/kb/en/date-and-time-functions/

good design for a db that decreases resolution of accuracy after time

I have something like 20,000 data points in a database and I want to display it on the google annotated graph. I think around 2000 points would be a good number to actually use the graph for, so I want to use averages instead of the real amount of data points I have.
This data counts the frequency of something a certain time. it would be like Table(frequency, datetime)
So for the first week I will have datetime have an interval of every 10 minutes, and frequency will be an average of all the frequencies of that time interval (of 10 minutes). Similarly, for the month after that I will I have a datetime interval of an hour etc.
I think this is something you can see on google finance too, after some time the resolution of the datapoints decreases even when you zoom in.
So what would be a good design for this? Is there already a tool that exists to do something like this?
I already thought of (though it might not be good) a giant table of all 20,000 points and several smaller tables that represent each time interval (1 week, 1 month etc) that are built through queries to the larger table and constantly updated and trimmed with new averages.
Keep the raw data in the db in one table the. Have a second reprti g table which you use a script or query to populate from the raw table. The transformation that populates the reporting table can group and average the buckets however you want. The important thing Is to not transform your data on initial insert--keep all your raw data. That way you can always rollback or rebuild if you mess something up.
ETL. Learn it. Love it. Live it.

Storing dates i Train schedule MYSQL

I have created a train schedule database in MYSQL. There are several thousand routes for each day. But with a few exceptions most of the routes are similar for every working day, but differ on weekends.
At this time I basically update my SQL tables at midnight each day, to get the departures for the next 24 hours. This is however very inconvenient. So I need a way to store dates in my tables so I don't have to do this every day.
I tried to create a separate table where I stored dates for each routenumber (routenumbers are resetted each day), but this made my query so slow that it was impossible to use. Does this mean I would have to store my departure and arrival times as datetimes? In that case the main table containing routes would have several million entries.
Or is there another way?
My routetable looks like this:
StnCode (referenced in seperate Station table)
DepTime
ArrTime
Routenumber
legNumber
How were you storing the dates? A single date/time field? That'd certainly be the most compact representation, but also the most difficult to index and scan, especially if you're doing queries of the following type:
SELECT ...
WHERE MONTH(DepTime) = 4 AND DAY(DepTime) = 19;
Such a construct would require a full table scan to tear apart each date field and extract the month/day. For such a case, it'd be better to denomalize a bit and split the datetime into seperate year/month/day/hour/minute fields and place indeces onto them. Bit more of a hassle to maintain, but would also speed up querying by specific time parts immensely.
Instead of storing Schedules in terms of dates, you can store them against day (Sun, Mon, Tue, etc). This will eliminate storing the dates for routes. You can treat the routes as predetermined, and thus they are fixed. As the number of trains are around 8000(passenger trains) and days are fixed (7), routes are (50-1000), each table like, 1, 1A, AS PUBLISHED IN RAILWAY BOOKS,
This will avoid storing huge combinations of train schedules into the db since every date is translated into one of the weekdays and we are not missing any data.
You can create a table for storing days which will have at most 7 days.
I would suggest to model the database in such a way, that each station is a touch point, and not as station id....
and you can introduce the hub concept in the design, to identify 3-4 stations, which are of the same city....
each station is a touch point, where it is supported by facilities like,boarding point,HALT POINT etc...
cause, not all the stations are boarding points for all the trains..
facilities are the ones which are available at different stations...
all facilities are not available for all the trains...,
ex: Kazipet is a station, which is also a junction...but for FEW TRAINS,ON few routes,
they pass thru the station, and it also halts at the station, but, it will not allow new passengers to board at the station(s).
But, it will allow the same on reverse routes...