Query database in weekly interval - mysql

I have a database with a created_at column containing the datetime in Y-m-d H:i:s format.
The latest datetime entry is 2011-09-28 00:10:02.
I need the query to be relative to the latest datetime entry.
The first value in the query should be the latest datetime entry.
The second value in the query should be the entry closest to 7 days from the first value.
The third value should be the entry closest to 7 days from the second value.
REPEAT #3.
What I mean by "closest to 7 days from":
The following are dates, the interval I desire is a week, in seconds a week is 604800 seconds.
7 days from the first value is equal to 1316578202 (1317183002-604800)
the value closest to 1316578202 (7 days) is... 1316571974
unix timestamp | Y-m-d H:i:s
1317183002 | 2011-09-28 00:10:02 -> appear in query (first value)
1317101233 | 2011-09-27 01:27:13
1317009182 | 2011-09-25 23:53:02
1316916554 | 2011-09-24 22:09:14
1316836656 | 2011-09-23 23:57:36
1316745220 | 2011-09-22 22:33:40
1316659915 | 2011-09-21 22:51:55
1316571974 | 2011-09-20 22:26:14 -> closest to 7 days from 1317183002 (first value)
1316499187 | 2011-09-20 02:13:07
1316064243 | 2011-09-15 01:24:03
1315967707 | 2011-09-13 22:35:07 -> closest to 7 days from 1316571974 (second value)
1315881414 | 2011-09-12 22:36:54
1315794048 | 2011-09-11 22:20:48
1315715786 | 2011-09-11 00:36:26
1315622142 | 2011-09-09 22:35:42
I would really appreciate any help, I have not been able to do this via mysql and no online resources seem to deal with relative date manipulation such as this. I would like the query to be modular enough to be able to change the interval weekly, monthly, or yearly. Thanks in advance!
Answer #1 Reply:
SELECT
UNIX_TIMESTAMP(created_at)
AS unix_timestamp,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT max(created_at) - 7
FROM my_table
)
)
AS `random_1`,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT MAX(created_at) - 14
FROM my_table
)
)
AS `random_2`
FROM my_table
WHERE created_at =
(
SELECT MAX(created_at)
FROM my_table
)
Returns:
unix_timestamp | random_1 | random_2
1317183002 | 1317183002 | 1317183002
Answer #2 Reply:
RESULT SET:
This is the result set for a yearly interval:
id | created_at | period_index | period_timestamp
267 | 2010-09-27 22:57:05 | 0 | 1317183002
1 | 2009-12-10 15:08:00 | 1 | 1285554786
I desire this result:
id | created_at | period_index | period_timestamp
626 | 2011-09-28 00:10:02 | 0 | 0
267 | 2010-09-27 22:57:05 | 1 | 1317183002
I hope this makes more sense.

It's not exactly what you asked for, but the following example is pretty close....
Example 1:
select
floor(timestampdiff(SECOND, tbl.time, most_recent.time)/604800) as period_index,
unix_timestamp(max(tbl.time)) as period_timestamp
from
tbl
, (select max(time) as time from tbl) most_recent
group by period_index
gives results:
+--------------+------------------+
| period_index | period_timestamp |
+--------------+------------------+
| 0 | 1317183002 |
| 1 | 1316571974 |
| 2 | 1315967707 |
+--------------+------------------+
This breaks the dataset into groups based on "periods", where (in this example) each period is 7-days (604800 seconds) long. The period_timestamp that is returned for each period is the 'latest' (most recent) timestamp that falls within that period.
The period boundaries are all computed based on the most recent timestamp in the database, rather than computing each period's start and end time individually based on the timestamp of the period before it. The difference is subtle - your question requests the latter (iterative approach), but I'm hoping that the former (approach I've described here) will suffice for your needs, since SQL doesn't lend itself well to implementing iterative algorithms.
If you really do need to determine each period based on the timestamp in the previous period, then your best bet is going to be an iterative approach -- either using a programming language of your choice (like php), or by building a stored procedure that uses a cursor.
Edit #1
Here's the table structure for the above example.
CREATE TABLE `tbl` (
`id` int(10) unsigned NOT NULL auto_increment PRIMARY KEY,
`time` datetime NOT NULL
)
Edit #2
Ok, first: I've improved the original example query (see revised "Example 1" above). It still works the same way, and gives the same results, but it's cleaner, more efficient, and easier to understand.
Now... the query above is a group-by query, meaning it shows aggregate results for the "period" groups as I described above - not row-by-row results like a "normal" query. With a group-by query, you're limited to using aggregate columns only. Aggregate columns are those columns that are named in the group by clause, or that are computed by an aggregate function like MAX(time)). It is not possible to extract meaningful values for non-aggregate columns (like id) from within the projection of a group-by query.
Unfortunately, mysql doesn't generate an error when you try to do this. Instead, it just picks a value at random from within the grouped rows, and shows that value for the non-aggregate column in the grouped result. This is what's causing the odd behavior the OP reported when trying to use the code from Example #1.
Fortunately, this problem is fairly easy to solve. Just wrap another query around the group query, to select the row-by-row information you're interested in...
Example 2:
SELECT
entries.id,
entries.time,
periods.idx as period_index,
unix_timestamp(periods.time) as period_timestamp
FROM
tbl entries
JOIN
(select
floor(timestampdiff( SECOND, tbl.time, most_recent.time)/31536000) as idx,
max(tbl.time) as time
from
tbl
, (select max(time) as time from tbl) most_recent
group by idx
) periods
ON entries.time = periods.time
Result:
+-----+---------------------+--------------+------------------+
| id | time | period_index | period_timestamp |
+-----+---------------------+--------------+------------------+
| 598 | 2011-09-28 04:10:02 | 0 | 1317183002 |
| 996 | 2010-09-27 22:57:05 | 1 | 1285628225 |
+-----+---------------------+--------------+------------------+
Notes:
Example 2 uses a period length of 31536000 seconds (365-days). While Example 1 (above) uses a period of 604800 seconds (7-days). Other than that, the inner query in Example 2 is the same as the primary query shown in Example 1.
If a matching period_time belongs to more than one entry (i.e. two or more entries have the exact same time, and that time matches one of the selected period_time values), then the above query (Example 2) will include multiple rows for the given period timestamp (one for each match). Whatever code consumes this result set should be prepared to handle such an edge case.
It's also worth noting that these queries will perform much, much better if you define an index on your datetime column. For my example schema, that would look like this:
ALTER TABLE tbl ADD INDEX idx_time ( time )

If you're willing to go for the closest that is after the week is out then this'll work. You can extend it to work out the closest but it'll look so disgusting it's probably not worth it.
select unix_timestamp
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 7
from my_table )
)
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 14
from my_table )
)
from my_table
where sql_tstamp = ( select max(sql_tstamp)
from my_table )

Related

Is it possible in MySQL to find the Min/Max but remove outliers first?

I have a table that holds scan datetime values. I am wanting to find the start and stop scan time of the users from the main portion of scanning. The issue is that a user may perform some checks before or after the bulk of the scanning and generate a few more scans. The data might look as below.
....
| 2020-04-01 19:48:05 |
| 2020-04-01 19:48:22 |
| 2020-04-01 19:48:23 |
| 2020-04-01 19:48:48 |
| 2020-04-01 19:48:49 |
| 2020-04-01 20:45:33 |
+---------------------+
If I group by the date and grab the min/max of these values my time elapsed will be much large than the actual. In the case above the max would add almost 1 hour of extra time, which was not really spent scanning.
SELECT date, MIN(datetime), MAX(datetime) FROM table GROUP BY date
There might be 1 extra scan or there might be several scans at the beginning or the end of the data so throwing out the first and last data points is not really an option.
Hmmm . . . I think this is a gap and islands problem. You need some definition of when an outlier occurs. Say it is 5 minutes:
select min(datetime), max(datetime), count(*) as num_scans
from (select t.*,
sum(case when prev_datetime > datetime - interval 5 minute then 0 else 1 end) over (order by datetime) as grp
from (select t.*,
lag(datetime) over (order by datetime) as prev_datetime
from t
) t
) t
group by grp;
I'm not sure how you distinguish actual scans from the outliers. Perhaps if there is more than one row or so. If that is the case, you can remove the outliers with logic such as having count(*) > 1.

Find the period where the number of occurrences is the highest

Given a table "events_log" in this form :
| id | started_at | duration |
| 1 | 2017-06-01 09:00:00 | 80 |
| 1 | 2017-06-01 09:01:00 | 40 |
| 1 | 2017-06-01 09:01:23 | 20 |
I want to know when the most events were occuring (with a minute precision) :
|period |count|
| 2017-06-01 09:00:00 | 1 |
| 2017-06-01 09:01:00 | 3 |
In reality, there a millions of events to handle.
My solution is to :
Create a temporary table with event start grouped by minute
LEFT JOINing it with the events between each period
See http://sqlfiddle.com/#!9/8546a/1
But performance is terrible ...
Is there a better way to do it ?
I would think group by, something like this:
select date_format(started_at, '%Y-%m-%d %h:%i') as yyyymmddhhmi, count(*)
from t
group by yyyymmddhhmi
order by count(*) desc
limit 10;
Performance will not be great.
Here is modified version of your code. It will scan through events_log table twice. Once when building event_starts helper table and second time when selecting all events that are happening in specified interval. Also note added index, that will significantly speed up execution. It might be also reason why your original query was so slow.
CREATE TABLE events_log (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,started_at DATETIME,duration INT(11));
INSERT INTO events_log (started_at, duration) VALUES ('2017-06-01 09:00:00', 80);
INSERT INTO events_log (started_at, duration) VALUES ('2017-06-01 09:01:00', 40);
INSERT INTO events_log (started_at, duration) VALUES ('2017-06-01 09:01:23', 20);
CREATE /* TEMPORARY */ TABLE tmp_event_starts AS (
select DISTINCT DATE_ADD(started_at, INTERVAL -SECOND(started_at) SECOND) AS period_start
from events_log
);
create index idx_tmp_event_starts
on tmp_event_starts (period_start);
select period_start, count(*), group_concat(id) from events_log as log
join tmp_event_starts as per
on per.period_start >= DATE_ADD(started_at, INTERVAL -SECOND(started_at) SECOND)
and per.period_start <= DATE_ADD(started_at, INTERVAL -SECOND(started_at)+duration SECOND)
group by period_start
;
If you have lot of events happening in same minute and there are not minutes without events, then you might consider generating helper table as a sequence of minutes independent of data. In MySql it is quite uneasy task, but some hints can be found in this blog post Calendar Tables: An Invaluable Database Tool
It will also allow for generating helper table in advance and thus speeding up execution of query itself significantly.
You might also consider adding ended_at columnt to your event_log table, which will eliminate need for conversion during query execution.

SQL calculating difference between columns

I'm a bit of a newby at SQL and I don't really understand what to do here, so any help is really appreciated. I have a table full of readings from different readers, there's like 500.000 of them, so I can't do this by hand.
I received the table without the difference in it. I managed to calculate it, but there's a bit of a problem there...
It looks a bit like this:
reader_id | date | reading | difference
1 | 01-01-2013 | 205 | 0
1 | 02-01-2013 | 210 | 5
1 | 03-01-2013 | 213 | 3
... | ... | ... | ...
1 | 31-12-2013 | 2451 | 4
2 | 01-01-2013 | 8543 | 6092
2 | 02-01-2013 | 8548 | 5
reader_id and date form the primary key. The combination is unique.
How can I make sure I don't get the difference calculated when the last column contained a different reader_id?
When querying my data with a query like this one, the data get skewed by the incorrect difference between the two reader_ids:
SELECT AVG(difference), reader_id FROM table GROUP BY reader_id
For
I just want to get the average difference for each reader.
your query is perfectly good. I think you got something wrong in your difference calculation. The first value for reader_id=2, 6092, is the difference of the last reading from reader1 and the first reading from reader 2, i don't think that makes sense. If i'm not mistaken, the difference value is the current day reading - previous day reading. Therefore you should set the difference value of the first reading of each reader to 0.
You can do this with the following query:
UPDATE table t INNER JOIN (SELECT reader_id, min(date) as first_day FROM table GROUP BY reader_id) as tmp ON tmp.reader_id=t.reader_id AND tmp.first_day=t.date SET t.difference=0
Then
SELECT AVG(difference), reader_id FROM table GROUP BY reader_id
will do what you expect.
If you simply want the average difference, you can use the following query:
SELECT
meter_id,
MAX(reading) - MIN(reading) / COUNT(*) average_difference
FROM table
GROUP BY meter_id
ORDER BY meter_id;
It works on the logic that the the total difference for a given meter_id should be equal to MAX(reading) - MIN(reading).

get the SUM between two given dates

if i want to get the total_consumption over a range of dates, how would i do that?
I thought i could do:
SELECT id, SUM(consumption)
FROM consumption_info
WHERE date_time BETWEEN 2013-09-15 AND 2013-09-16
GROUP BY id;
however this returns: Empty set, 2 warnings(0.00 sec)
---------------------------------------
id | consumption | date_time |
=======================================|
1 | 5 | 2013-09-15 21:35:03 |
2 | 5 | 2013-09-15 24:35:03 |
3 | 7 | 2013-09-16 11:25:23 |
4 | 3 | 2013-09-16 20:15:23 |
----------------------------------------
any ideas what i'm doing wrong here?
thanks in advance
You're missing quotes around the date strings: the WHERE clause should actually be written as...
BETWEEN '2013-09-15' AND '2013-09-16'
The irony is that 2013-09-15 is a valid SQL expression - it means 2013 minus 09 minus 15. Obviously, there's no date lying in between the corresponding results; hence an empty set in return
Yet there might be another, more subtle error here: you probably should have used this clause...
BETWEEN '2013-09-15 00:00:00' AND '2013-09-16 23:59:59'
... instead. Without setting the time explicitly it'll be set to '00:00:00' on both dates (as DATETIME values are compared here).
While it's obviously ok for the starting date, it's not so for the ending one - unless, of course, exclusion of all the records for any time of that day but midnight is actually the desired outcome.
SELECT SUM(consumption)
FROM consumption_info
WHERE date_time >= 2013-09-15 AND date_time <= 2013-09-16;
or
SELECT SUM(consumption)
FROM consumption_info
WHERE date_time BETWEEN 2013-09-15 AND 2013-09-16;
Its better to use CAST when comparing the date function.
SELECT id, SUM(consumption)
FROM consumption_info
WHERE date_time
BETWEEN CAST('2013-09-15' AS DATETIME)
AND CAST('2013-09-16' AS DATETIME)
GROUP BY id;

How can I get the difference between the individual maximum values of different days?

I am new in MySQL, I am trying to find:
The difference between a given day's maximum value occurred and the previous day's maximum value.
I was able to get the maximum values for dates via:
select max(`bundle_count`), `Production_date`
from `table`
group by `Production_date`
But I don't know how to use SQL to calculate the differences between maximums for two given dates.
am expecting output like this
Please help me.
Update 1: Here is a fiddle, http://sqlfiddle.com/#!2/818ad/2, that I used for testing.
Update 2: Here is a fiddle, http://sqlfiddle.com/#!2/3f78d/10 that I used for further refining/fixing, based on Sandy's comments.
Update 3: For some reason the case where there is no previous day was not being dealt with correctly. I thought it was. However, I've updated to make sure that works (a bit cumbersome--but it appears to be right. Last fiddle: http://sqlfiddle.com/#!2/3f78d/45
I think #Grijesh conceptually got you the main thing you needed via the self-join of the input data (so make sure you vote up his answer!). I've cleaned up his query a bit on syntax (building off of his query!):
SELECT
DATE(t1.`Production_date`) as theDate,
MAX( t1.`bundle_count` ) AS 'max(bundle_count)',
MAX( t1.`bundle_count` ) -
IF(
EXISTS
(
SELECT date(t2.production_date)
FROM input_example t2
WHERE t2.machine_no = 1 AND
date_sub(date(t1.production_date), interval 1 day) = date(t2.production_date)
),
(
SELECT MAX(t3.bundle_count)
FROM input_example t3
WHERE t3.machine_no = 1 AND
date_sub(date(t1.production_date), interval 1 day) = date(t3.production_date)
GROUP BY DATE(t3.production_date)
), 0
)
AS Total_Bundles_Used
FROM `input_example` t1
WHERE t1.machine_no = 1
GROUP BY DATE( t1.`production_date` )
Note 1: I think #Grijesh and I were cleaning up the query syntax issues at the same time. It's encouraging that we ended up with very similar versions after we were both doing cleanup. My version differs in using IFNULL() for when there is no preceding data. I also ended up with a DATE_SUB, and I made sure to reduce various dates to mere dates without time component, via DATE()
Note 2: I originally had not fully understood your source tables, so I thought I needed to implement a running count in the query. But upon better inspection, it's clear that your source data already has a running count, so I took that stuff back out.
I am not sure but you need something like this, Hope it will be helpful to you upto some extend:
Try this:
SELECT t1.`Production_date` ,
MAX(t1.`bundle_count`) - MAX(t2.`bundle_count`) ,
COUNT(t1.`bundle_count`)
FROM `table_name` AS t1
INNER JOIN `table_name` AS t2
ON ABS(DATEDIFF(t1.`Production_date` , t2.`Production_date`)) = 1
GROUP BY t1.`Production_date`
EDIT
I create a table name = 'table_name', as below,
mysql> SELECT * FROM `table_name`;
+---------------------+--------------+
| Production_date | bundle_count |
+---------------------+--------------+
| 2004-12-01 20:37:22 | 1 |
| 2004-12-01 20:37:22 | 2 |
| 2004-12-01 20:37:22 | 3 |
| 2004-12-02 20:37:22 | 2 |
| 2004-12-02 20:37:22 | 5 |
| 2004-12-02 20:37:22 | 7 |
| 2004-12-03 20:37:22 | 6 |
| 2004-12-03 20:37:22 | 7 |
| 2004-12-03 20:37:22 | 2 |
| 2004-12-04 20:37:22 | 1 |
| 2004-12-04 20:37:22 | 9 |
+---------------------+--------------+
11 rows in set (0.00 sec)
My query: to find difference in bundle_count between two consecutive dates:
SELECT t1.`Production_date` ,
MAX(t2.`bundle_count`) - MAX(t1.`bundle_count`) ,
COUNT(t1.`bundle_count`)
FROM `table_name` AS t1
INNER JOIN `table_name` AS t2
ON ABS(DATEDIFF(t1.`Production_date` , t2.`Production_date`)) = 1
GROUP BY t1.Production_date;
its output:
+---------------------+-------------------------------------------------+--------------------------+
| Production_date | MAX(t2.`bundle_count`) - MAX(t1.`bundle_count`) | COUNT(t1.`bundle_count`) |
+---------------------+-------------------------------------------------+--------------------------+
| 2004-12-01 20:37:22 | 4 | 9 |
| 2004-12-02 20:37:22 | 0 | 18 |
| 2004-12-03 20:37:22 | 2 | 15 |
| 2004-12-04 20:37:22 | -2 | 6 |
+---------------------+-------------------------------------------------+--------------------------+
4 rows in set (0.00 sec)
This is PostgreSQL syntax (sorry; it's what I'm familiar with) but should fundamentally work in either database. Note this doesn't exactly run in PostgreSQL either because group is not a valid table name (it's a reserved keyword). The approach is a self-join as others have mentioned but I've used a view to handle the max-by-day and the difference as separate steps.
create view max_by_day as
select
date_trunc('day', production_date) as production_date,
max(bundle_count) as bundle_count
from
group
group by
date_trunc('day', production_date);
select
today.production_date as production_date,
today.bundle_count,
today.bundle_count - coalesce(yesterday.bundle_count, 0)
from
max_by_day as today
left join max_by_day yesterday on (yesterday.production_date = today.production_date - '1 day'::interval)
order by
production_date;
PostgreSQL also has a construct called window functions which is useful for this and a bit easier to understand. Just had to stick in a bit of advocacy for a superior database. :-P
select
date_trunc('day', production_date),
max(bundle_count),
max(bundle_count) - lag(max(bundle_count), 1, 0)
over
(order by date_trunc('day', production_date))
from
group
group by
date_trunc('day', production_date);
These two approaches differ in how they handle missing days in the data - the first will treat it as a 0, the second will use the previous day which is present. There wasn't a case like this in your sample so I don't know if this is something you care about.