I've got a dataset that I want to be able to slice up by date interval. It's a bunch of scraped web data and each item has a unix-style milisecond timestamp as well as a standard UTC datetime.
I'd like to be able to query the dataset, picking out the rows that are closest to various time intervals:
e.g.: Every hour, once a day, once a week, etc.
There is no guarantee that the timestamps are going to fall evenly on the interval times, otherwise I'd just do a mod query on the timestamp.
Is there a way to do this with SQL commands that doesn't involve stored procs or some sort of pre-computed support tables?
I use the latest MariaDB.
EDIT:
The marked answer doesn't quite answer my specific question but it is a decent answer to the more generalized problem so I went ahead and marked it.
I was specifically looking for a way to query a set of data where the timestamp is highly variable and to grab out rows that are reasonably close to periodic time intervals. E.g.: get all the rows that are the closest to being on 24 hour intervals from right now.
I ended up using a modulus query to solve the problem: timestamp % interval < average spacing between data points. This occasionally grabs extra points and misses a few but was good enough for my graphing application.
And them I got sick of the node-mysql library crashing all the time so I moved to MongoDB.
You say you want 'closest to various time intervals' but then say 'every hour/day/week', so the actual implementation will depend on what you really want, but you can use a host of standard date/time functions to group records, for example count by day:
SELECT DATE(your_DateTime) AS Dt, COUNT(something) AS CT
FROM yourTable
GROUP BY DATE(your_DateTime)
Count by Hour:
SELECT DATE(your_DateTime) AS Dt,HOUR(your_DateTime) AS Hr, COUNT(something) AS CT
FROM yourTable
GROUP BY DATE(your_DateTime), HOUR(your_DateTime)
See the full list of supported date and time functions here:
https://mariadb.com/kb/en/date-and-time-functions/
Related
I have an attribute in MYSQL database called "dueDate". I want to update the record on the due date at 11:59PM.
Is there a way to create an event or cronjob that could do that?
You can set an event, or cronjob, for any particular time, if your system administration allows you to use either facility. It's easy to look up how to create a MySQL repeating event.
But this is a brittle way of dealing with time dependencies in your business rules. What do I mean "brittle?" For one thing, if something goes wrong and the job doesn't run, your business rules are fouled up and need to be repaired. For another thing, cronjobs and events don't run at precise times of day, they run on or after that time of day. They can take awhile to start.
So, I suggest you use rules in a query to enforce your business rules. Suppose, for example, your original desire is to set a column called is_overdue to 1 at the end of the due date. Instead, use a query like this to compute your is_overdue column.
SELECT whatever, dueDate,
IF(dueDate >= CURDATE() + INTERVAL 1 DAY, 1, 0) is_overdue
FROM table ...
This has the advantage that it will always be correct, down to the millisecond, and won't depend on the running of a brittle background job.
Events and cronjobs are better used for purging of stale records. For example, you can get rid of any records that have been expired for 30 days or more by using this kind of query in them.
DELETE FROM table WHERE dueDate <= CURDATE() - INTERVAL 30 DAY
If your cronjob / event fails to run on a particular day, the next day's run will still do the cleanup correctly.
Edit. The point of this suggestion is to compute the time-dependent column (is_expired in the example). If you follow this suggestion, you won't update the table at all. Instead, you'll use the suggested query whenever you retrieve the is_expired value.
Pro tip. When you want to do something at a time <= the last moment of a particular day, you're better off doing it at a time < the first moment of the next day. That is, for best results use
WHERE dueDate < '2017-11-17' + INTERVAL 1 DAY
in place of
WHERE dueDate <= '2017-11-17 23:59:59'
Why? the last moment of a day is hard to express precisely, especially if your system's timing using subsecond precision. But the first moment of a day is easy to express precisely.
In order to analyze dates and times I am creating a MySQL table where I want to keep the time information. Some example analyses will be stuff like:
Items per day/week/month/year
Items per weekday
Items per hour
etc.
Now in regards to performance, what way should I record in my datatable:
date type: Unix timestamp?
date type: datetime?
or keep date information in one row each, e.g. year, month, day in separate fields?
The last one, for example, would be handy if I'm analysing by weekday; I wouldn't have to perform WEEKDAY(item.date) on MySQL but could simply use WHERE item.weekday = :w.
Based on your usage, you want to use the native datetime format. Unix formats are most useful when the major operations are (1) ordering; (2) taking differences in seconds/minutes/hours/days; and (3) adding seconds/minutes/hours/days. They need to be converted to internal date time formats to get the month or week day, for instance.
You also have a potential indexing issue. If you want to select ranges of days, hours, months and so on for your results, then you want an index on the column. For this purpose an index on a datetime is probably sufficient.
If the summaries are by hour, you might find it helpful to stored the date component in a date field and the hour in a separate column. That would be particularly helpful if you are combining hours from different days.
Whether you break out other components of the date, such as weekday and month, for indexing purposes would depend on the volume of data in the table, performance requirements, and the queries you are planning on running. I would not be inclined to do this, except as a later optimization.
The rule of thumb is: store things as they should be stored, don't do performance tweaks until you're hitting the bottleneck. If you store your date as separate fields, you'll eventually stumble upon a situation you need this date as a whole inside your database (e.g. update query for a particular range of time), and this will be like hell - condition from 3 april 2015 till 15 may 2015 would be as giant as possible.
You should keep your dates as date type. This will grant you maximum flexibility, (most probably) query readability and will keep all of your opportunities to work with them. The only thing I really can recommend is storing the same date divided into year/month/day in next columns - of course, this will bloat your database and require extreme caution on update scenarios, but this will allow you to use any variant of source data in your queries.
i'm implementing a mysql database for saving logged energy consumption data out of a smart home applications. The data then should be plotted within a javascript framework. Unfortunately the usage get's logged every 8 seconds and there's too much information to plot a year consumption graph.
The data gets saved in a simple table by it's time, device id and consumption at this specific time.
I'm hoping to be able to automatically aggregate the given data by minutes, hours and finally day average values.
After some research I came across some queries/procedures to calculate average values of specific intervals. Unfortunately this isn't much help to me as I have data over a period of three years and I don't want to create the given intervals by hand.
Ideally the procedure in mysql should be able to aggregate the given device values by it's time and calculate an average value and save it in a separate table.
Does anyone have a idea how I could implement it?
select avg(consumption) minute_average, date_format(log_date,'%m/%d/%y %H:%i') minute from data
group by date_format(log_date,'%m/%d/%y %H:%i');
select avg(consumption) hour_average, date_format(log_date,'%m/%d/%y %H') hour from data
group by date_format(log_date,'%m/%d/%y %H');
select avg(consumption) day_average, date_format(log_date,'%m/%d/%y') day from data
group by date_format(log_date,'%m/%d/%y');
note: you could just as easily calculate any aggregate like sum or standard deviation as well.
There is some value, x, which I am recording every 30 seconds, currently into a database with three fields:
ID
Time
Value
I am then creating a mobile app which will use that data to plot charts in views of:
Last hour
Last 24 hours.
7 Day
30 Day
Year
Obviously, saving every 30 seconds for the last year and then sending that data to a mobile device will be too much (it would mean sending 1051200 values).
My second thought was perhaps I could use the average function in MySQL, for example, collect all of the averages for every 7 days (creating 52 points for a year), and send those points. This would work, but still MySQL would be trawling through creating averages and if many users connect, it's going to be bad.
So simply put, if these are my views, then I do not need to keep track of all that data. Nobody should care what x was a year ago to the precision of every 30 seconds, this is fine. I should be able to use "triggers" to create some averages.
I'm looking for someone to check what I have below is reasonable:
Store values every 30s in a table (this will be used for the hour view, 120 points)
When there are 120 rows are in the 30s table (120 * 30s = 60 mins = 1 hour), use a trigger to store the first half an hour in a "half hour average" table, remove the first 60 entries from the 30s table. This new table will need to have an id, start time, end time and value. This half hour average will be used for the 24 hour view (48 data points).
When the half hour table has more than 24 entries (12 hours), store the first 6 as an average in a 6 hour average table and then remove from the table. This 6 hour average will be used for the 7 day view (28 data points).
When there are 8 entries in the 6 hour table, remove the first 4 and store this as an average day, to be used in the 30 day view (30 data points).
When there are 14 entries in the day view, remove the first 7 and store in a week table, this will be used for the year view.
This doesn't seem like the best way to me, as it seems to be more complicated than I would imagine it should be.
The alternative is to keep all of the data and let mysql find averages as and when needed. This will create a monstrously huge database. I have no idea about the performance yet. The id is an int, time is a datetime and value is a float. Is 1051200 records too many? Now is a good time to add, I would like to run this on a raspberry pi, but if not.. I do have my main machine which I could use.
Your proposed design looks OK. Perhaps there are more elegant ways of doing this, but your proposal should work too.
RRD (http://en.wikipedia.org/wiki/Round-Robin_Database) is a specialised database designed to do all of this automatically, and it should be much more performant than MySQL for this specialised purpose.
An alternative is the following: keep only the original table (1051200 records), but have a trigger that generates the last hour/day/year etc views every time a new record is added (e.g. every 30 seconds) and store/cache the result somewhere. Then your number-crunching workload is independent of the number of requests/clients you have to serve.
1051200 records may or may not be too many. Test in your Raspberry Pi to find out.
Let me give a suggestion on the physical layout of your table, regardless on whether you decide to keep all data or "prune" it from time to time...
Since you generate a new row "every 30 seconds", then Time can serve as a natural key without fear of exceeding the resolution of the underlying data type and causing duplicated keys. You don't need ID in this scenario1, so your table is simply:
Time (PK)
Value
And since InnoDB tables are clustered, not having secondary indexes2 means the whole table is stored in a single B-Tree, which is as efficient as it gets from storage and querying perspective. On top of that, Value is automatically covered, which may not have been the case in your original design unless you specifically designed your index(es) for that.
Using time as key can be tricky in general, but I think may be worth it in this particular case.
1 Unless there are other tables that reference it through FOREIGN KEYs, or you have already written too much code that depends on it.
2 Which would be necessary in the original design to support efficient aggregation.
I have something like 20,000 data points in a database and I want to display it on the google annotated graph. I think around 2000 points would be a good number to actually use the graph for, so I want to use averages instead of the real amount of data points I have.
This data counts the frequency of something a certain time. it would be like Table(frequency, datetime)
So for the first week I will have datetime have an interval of every 10 minutes, and frequency will be an average of all the frequencies of that time interval (of 10 minutes). Similarly, for the month after that I will I have a datetime interval of an hour etc.
I think this is something you can see on google finance too, after some time the resolution of the datapoints decreases even when you zoom in.
So what would be a good design for this? Is there already a tool that exists to do something like this?
I already thought of (though it might not be good) a giant table of all 20,000 points and several smaller tables that represent each time interval (1 week, 1 month etc) that are built through queries to the larger table and constantly updated and trimmed with new averages.
Keep the raw data in the db in one table the. Have a second reprti g table which you use a script or query to populate from the raw table. The transformation that populates the reporting table can group and average the buckets however you want. The important thing Is to not transform your data on initial insert--keep all your raw data. That way you can always rollback or rebuild if you mess something up.
ETL. Learn it. Love it. Live it.