MySQL window function with parameter based frame size - mysql

I come from MS SQL Server and I'm relatively new to MySQL / MariaDB 10 (at least in a deeper way than just "SELECT * FROM [Table]"). I now searched for several hours in Google and StackOverflow, but I haven't found a soluton to my problem yet. If it's relevant in any way: I use MySQL Workbench for writing my code.
The Background
I have a new data logging project for saving and displaying data from several temperature and humidity sensors within the house. I save it in following table:
ID
Time
Device
Temperature
Humidity
1
2022-01-09 13:34:00
1
20.1
52.3
2
2022-01-09 13:35:00
1
20.0
52.3
3
2022-01-09 13:36:00
1
20.1
52.4
4
2022-01-09 13:37:00
1
20.1
52.5
5
2022-01-09 13:38:00
1
20.0
52.5
6
2022-01-09 13:39:00
1
20.1
52.6
I query the needed data for a chart using a stored procedure. Especially the on 0.1°C rounded temperature values have the disadvantage that they naturally often change between a value of 0.1 when the temperature is pretty stable. So I thought of a moving average to smooth the values over the last 10 minutes which works perfectly with an average window function.
Here a simplified version of my procedure:
CREATE PROCEDURE `stpGetSensorData`(sensorId INT, startDate VARCHAR(8))
BEGIN
DECLARE FromDate DATE;
SET FromDate = STR_TO_DATE(startDate, '%Y%m%d');
SELECT
L.ID,
L.Time,
L.Device,
AVG(L.Temperature) OVER (ORDER BY L.Time ROWS BETWEEN 10 PRECEDING AND 0 FOLLOWING) AS Temperature,
AVG(L.Humidity) OVER (ORDER BY L.Time ROWS BETWEEN 10 PRECEDING AND 0 FOLLOWING) AS Humidity
FROM
LoggedData AS L
WHERE
Device = sensorId
AND Time < DATE_ADD(FromDate, INTERVAL 1 DAY)
AND Time >= FromDate
ORDER BY Time DESC;
END
The challenge
Now I thought I let the end user decide about the size of the window, i.e. an average over the last 5, 10, 30, 60, ... minutes. But when I try to insert a parameter in the window function, it leads to the error: "averageRows is not valid at this position".
Here the code:
CREATE PROCEDURE `stpGetSensorData`(sensorId INT, startDate VARCHAR(8), averageRows INT)
BEGIN
DECLARE FromDate DATE;
SET FromDate = STR_TO_DATE(startDate, '%Y%m%d');
SELECT
L.ID,
L.Time,
L.Device,
AVG(L.Temperature) OVER (ORDER BY L.Time ROWS BETWEEN averageRows PRECEDING AND 0 FOLLOWING) AS Temperature,
AVG(L.Humidity) OVER (ORDER BY L.Time ROWS BETWEEN averageRows PRECEDING AND 0 FOLLOWING) AS Humidity
FROM
LoggedData AS L
WHERE
Device = sensorId
AND Time < DATE_ADD(FromDate, INTERVAL 1 DAY)
AND Time >= FromDate
ORDER BY Time DESC;
END
I guess it's possible to solve this using Dynamic SQL, but I try to avoid Dynamic SQL whereever possible and thought there must be a 'normal' solution as well and I'm just too blind to see it.
Any smart ideas?

Related

SQL - sum data for all time, 30 days and 90 days for multiple columns indiviually

BACKGROUND:
I have data that looks like this
date src subsrc subsubsrc param1 param2
2020-02-01 src1 ksjd dfd8 47 31
2020-02-02 src1 djsk zmnc 44 95
2020-02-03 src2 skdj awes 92 100
2020-02-04 src2 mxsf kajs 80 2
2020-02-05 src3 skdj asio 46 53
2020-02-06 src3 dekl jdqo 19 18
2020-02-07 src3 dskl dqqq 69 18
2020-02-08 src4 sqip riow 64 46
2020-02-09 src5 ss01 qwep 34 34
I am trying to aggregate for all time, last 30 days and last 90 days (no rolling sum)
So my final data would look like this:
src subsrc subsubsrc p1_all p1_30 p1_90 p2_all p2_30 p2_90
src1 ksjd dfd8 7 1 7 98 7 98
src1 djsk zmnc 0 0 0 0 0 0
src2 skdj awes 12 12 12 4 4 4
src2 mxsf kajs 6 6 6 31 31 31
src3 skdj asio 0 0 0 0 0 0
src3 dekl jdqo 20 20 20 17 17 17
src3 dskl dqqq 3 3 3 4 4 4
src4 sqip qwep 0 0 0 0 0 0
src5 ss01 qwes 15 15 15 2 2 2
ABOUT DATA:
This is only dummy data and therefore incorrect.
There are tens of thousands of rows in my data.
There are a dozen of src columns that make up the key for the table.
There are a dozen of param columns that I have to sum for 30 and 90 and all time.
Also there are null values in param columns.
Also there are might be multiple rows for same day and src column.
New data is being added every day and the query is probably going to be run every day to get the latest 30, 90, all time data.
WHAT I HAVE TRIED:
This is what I have come up with:
SELECT src, subsubsrc, subsubsrc,
SUM(param1) as param1_all,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 30 THEN param1 END) as param1_30,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 90 THEN param1 END) as param1_90,
SUM(param2) as param2_all,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 30 THEN param2 END) as param2_30,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 90 THEN param2 END) as param2_90,
FROM `MY_TABLE`
GROUP BY src
ORDER BY src
This actually works but I can anticipate how long this query is going to become for multiple sources and even more param columns.
I have been trying something called "Filtered aggregate functions (or manual pivot)" explained HERE. But I am unable to understand/implement it for my case.
Also I have looked at dozens of answers and most of them are running sums for each day OR are complicated cases of this basic calculation. Maybe I am not searching it correctly.
As you can see I am newbie in SQL and would really appreciate any help.
Your query looks quite good; conditional aggregation is the canonical method to pivot a dataset.
One way to possibly increase performance would be to change the date filter in the conditional expressions: using a date function precludes the use of an index.
Instead, you could phrase this as:
select
src,
subsrc,
subsubsrc,
sum(param1) as param1_all,
sum(case when date >= current_date - interval 30 day then param1 end) as param1_30,
sum(case when date >= current_date - interval 90 day then param1 end) as param1_90,
sum(param2) as param2_all,
sum(case when date >= current_date - interval 30 day then param2 end) as param2_30,
sum(case when date >= current_date - interval 90 day then param2 end) as param2_90
from my_table
group by src, subsrc, subsubsrc
order by src, subsrc, subsubsrc
For performance, the following index may be helpul: (src, subsrc, subsubsrc, date).
Note that I included all three non-aggregated columns (src, subsrc, subsubsrc) in the group by clause: starting MySQL 5.7, this is by default mandatory (although you can play around with sql modes to alter that behavior) - and most other databases implement the same constraint.
Your first approach isn't a bad one if you are able to build the query programmatically. One alternative might be to create side tables for the 30 and 90 day cases first so you can effectively select all columns from each. This could also be done in sub-queries but there are performance considerations.
Some untested pseudo code to hopefully clarify:
SELECT
src,
subsrc,
subsubsrc,
SUM(param1) as param1_all,
-- other "all" sums here
SUM(t30.param1) as param1_30,
-- other "30" sums here
SUM(t90.param1) as param1_90,
-- other "90" sums here
FROM MY_TABLE
LEFT JOIN (
SELECT *
FROM MY_TABLE
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
) as t30 on t30.src = MY_TABLE.src
LEFT JOIN (
SELECT *
FROM MY_TABLE
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
) as t90 on t90.src = MY_TABLE.src
GROUP BY MY_TABLE.src
ORDER BY MY_TABLE.src
Note the date conditions have been switched to not use a function on the date column but instead compare to a date value. Your original approach would defeat any index on date (which you will want to make this more efficient).
If you first put these sub-queries into side tables that have a key on src the joins will be more efficient too. You could even group/sum directly into those side tables first rather than creating larger copies of your data, and then join the aggregated data together.
Your code looks good. Your RDBMS needs to loop all records under the hood and do some calculations. One thing that you can improve is that you are calculating date differences for all records. It would make sense to calculate the moment 30 days ago and 90 days ago beforehand, respectively and only compare the dates against those.
Since you already know that the number of rows and parameters will increase in the future, it makes sense to create a cron job which daily computes this in the following manner:
the first time it calculates the values, it should store all the results along with the date it was running at (maybe into a table dedicated for this analytics)
on subsequent days you can calculate the all time sum by loading the items which were created since the last check
you will still need to calculate the 30 and 90 day stuff, but that would be much less of a problem than calculating this for all time
If you do this properly and have daily information, then later on you will be able to analyze trends in history as well.
I'd recommend you use 3 different queries for that:
Sum for all time
Sum for 30 days
Sum for 90 days
Because when you're trying to do all-in-1 query then you end up with full table scan because of CASE-WHEN-END (BTW there is compact form IF() in MySQL). This is extremely non-optimal.
If you split it into 3 different queries and add an index to the date column then it won't do full-scan for the 2nd and 3rd query. Only for the 1st query, which can be optimised separately (for example by caching).
Also this approach: DATE_DIFF(CURRENT_DATE,date,day) <= 90
should be changed to: date >= 'date-90-days-ago' (where 'date-90-days-ago' is a fixed date)
Thus you won't have to compute difference of 2 dates for every row. You'll have just to compute 2 dates: 30 days ago and 90 days ago and compare all other dates to these two. This approach will benefit of the date column index.

Iterate Over Date Mysql Loop

I've written a stored procedure to iterate over every week for three years. It doesn't work though and returns a vague error message.
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '' at line 18
DELIMITER $$
CREATE PROCEDURE loop_three_years()
BEGIN
declare y INT default 2016;
declare m int default 4;
declare d int default 20;
WHILE y <= 2019 DO
WHILE YEARWEEK(concat(y, '-', m, '-', d)) <= 53 DO
WHILE m < 12 DO
WHILE (m = 2 and d <= 29) OR (d <=30 and m in(4, 6,9,11)) OR ( m in(1,3,5,7,8,10,12) AND d <= 31) DO
set d = d + 7;
SELECT YEARWEEK(concat(y, '-', m, '-', d));
END WHILE;
set d=1;
END WHILE;
set m = 1;
SET y = y + 1;
END WHILE;
END
$$
When I used this as minimal parts they work so I'm not sure what the issue is with my reassembly. Also not sure if there's a better way to do this. (The select is just for testing, it will be an insert when I use the real code.
Slightly Altered from a previous solution
You can build your own dynamic calendar / list using ANY other table in your system that has at least as many records as you need to fake row numbers. The query below will use MySQL # variables which work like an inline program and declaration. I can start the list with a given date... such as your 2016-04-20 and then each iteration through, add 1 week using date-based functions. No need for me to know or care about how many days have a 28, 29(leap-year), 30 or 31 days.
The table reference below of "AnyTableThatHasAtLeast156Records" is just that.. Any table in your database that has at least 156 records (52 weeks per year, 3 years)
select
YEARWEEK( #startDate ) WeekNum,
#startDate as StartOfWeek,
#startDate := date_add( #startDate, interval 1 week ) EndOfWeek
from
( select #startDate := '2016-04-20') sqlv,
AnyTableThatHasAtLeast156Records
limit
156
This will give you a list of 156 records (provided your "anyTable…" has 156 records all at once. If you need to join this to some other transaction table, you could do so by making the above a JOIN table. Benefit here, Since I included the begin date and end of week, those can be part of your joining to table.
Example, on
record WeekNum StartOfWeek EndOfWeek
1 ?? 2016-04-20 2016-04-27
2 ?? 2016-04-27 2016-05-04
3 ?? 2016-05-04 2016-05-11
4 ?? 2016-04-11 2016-05-18... etc
By adding 1 week to the starting point, you can see that it would do Ex: Monday to Monday. And the JOIN Condition below I have LESS THAN the EndOfWeek. This would account for any transactions UP TO but not including the ending date... such as transactions on 2016-04-26 11:59:59PM (hence LESS than 2016-04-27, as 04/27 is the beginning of the next week's cycle of transactions)
select
Cal.WeekNum,
YT.YourColumns
from
YourTransactionTable YT
JOIN ( aboveCalendarQuery ) Cal
on YT.TransactionDate >= Cal.StartOfWeek
AND YT.TransactionDate < Cal.EndOfWeek
where
whatever else
You could even do sum() with group by such as by WeekNum if that is what you intend.
Hopefully this is a much more accurate and efficient way to build out your calendar to run with and linking to transactions if you so needed to.
Response from comment.
You could by doing a join to a ( select 1 union select 2 union … select 156 ), but your choice. The ONLY reason for the "AnyTable…" is I am sure with any reasonable database with transactions you would have 156 records or more easily. It's sole purpose is to just allow a row for cycling through the iterations to dynamically create the rows.
Also much more sound than the looping mechanism you have run into to begin with. Nothing wrong with that, especially learning purposes, but if more efficient ways, doesn't that make more sense?
Per feedback from comment
I dont exactly know your other table you are trying to insert into, but yes, you can use this for all 3000 things. Provide more of what you are trying to do and I can adjust... In the mean-time, something like this...
insert into YourOtherTable
( someField,
AnotherField,
WeekNum
)
select
x.someField,
x.AnotherField,
z.WeekNum
from
Your3000ThingTable x
JOIN (select
YEARWEEK( #startDate ) WeekNum,
#startDate as StartOfWeek,
#startDate := date_add( #startDate, interval 1 week ) EndOfWeek
from
( select #startDate := '2016-04-20') sqlv,
AnyTableThatHasAtLeast156Records
limit
156 ) z
on 1=1
where
x.SomeCodition...
By joining the the select of 156 records on 1=1 (which is always true), it will return 156 entries for whatever record is in the Your3000ThingTable. So, if you have an inventory item table with
Item Name
1 Thing1
2 Thing2
3 Thing3
Your final insert would be
Item Name WeekNum
1 Thing1 1
1 Thing1 2
1 Thing1 ...
1 Thing1 156
2 Thing2 1
2 Thing2 2
2 Thing2 ...
2 Thing2 156
3 Thing3 1
3 Thing3 2
3 Thing3 ...
3 Thing3 156
And to pre-confirm what you THINK would happen, just try the select/join on 1=1 and you'll see all the records the query WOULD be inserting into your destination table.

How to SELECT all rows within a certain date/time range with a certain timestamp step size in MySQL?

I have a table that contains sensor data with a column timestamp that holds the unix timestamp of the time the sensor measurement has been taken.
Now I would like to SELECT all measurements within a certain date/time range with a specific time step.
I figured the first part out myself like you can see in my posted code snippet below.
// With $date_start and $date_stop in the format: '2010-10-01 12:00:00'
$result = mysqli_query($connection, "SELECT sensor_1
FROM sensor_table
WHERE timestamp >= UNIX_TIMESTAMP($date_start)
AND timestamp < UNIX_TIMESTAMP($date_stop)
ORDER BY timestamp");
Now is there a convenient way in MySQL to include a time step size into the same SELECT query?
My table contains thousands of measurements over months with one measurement taken every 5 seconds.
Now let's say I would like to SELECT measurements in between 2010-10-01 12:00:00 and 2010-10-02 12:00:00 but in this date/time range only SELECT one measurement every 10 minutes? (as my table contains measurements taken every 5 seconds).
Any smart ideas how to solve this in a single query?
(also other ideas are very welcome :))
Since you take one measurement every 5 seconds, the difference between $date_start and the first matching measurement cannot be greater than 4. We then take one entry every 600 seconds (allowing for some discrepancy from clock to clock...)
SELECT sensor_1
FROM sensor_table
WHERE timestamp >= UNIX_TIMESTAMP($date_start)
AND
timestamp < UNIX_TIMESTAMP($date_stop)
AND
((timestamp - UNIX_TIMESTAMP($date_start)) % 600) BETWEEN 0 AND 4
ORDER BY timestamp;
It is not elegant, but you can do:
SELECT s.sensor_1
FROM sensor_table s
WHERE s.timestamp >= UNIX_TIMESTAMP($date_start) AND
s.timestamp < UNIX_TIMESTAMP($date_stop) AND
s.timestamp = (SELECT MIN(s2.timestamp)
FROM sensor_table s2
WHERE s2.timestamp >= 60 * 10 * FLOOR(UNIX_TIMESTAMP(s.timestamp) / (60 * 10)) AND
s2.timestamp < s2.timestamp >= 60 * 10 * (1 + FLOOR(UNIX_TIMESTAMP(s.timestamp) / (60 * 10)))
)
ORDER BY timestamp;
This selects the first in each 10 minute period.
I think that you could use a simple cursor in plSQL
CREATE TABLE StoreValuesId
(
valueId int primary key;
)
CREATE OR REPLACE procedure_store[date_start date,date_stop date]
DECLARE date_startUpdated date , date_stopUpdated date , date_diff TIME(7) = '00:10:00'
IS
BEGIN
SELECT date_start INTO date_startUpdated;
SELECT date_stop INTO date_stopUpdated;
IF timestamp BETWEEN date_start and date_stop then
INSERT INTO StoreValuesId values(timestamp)
date_startUpdated=DATEADD(SECOND, DATEDIFF(SECOND, 0, date_diff), date_startUpdated);
date_stopUpdated=DATEADD(SECOND, DATEDIFF(SECOND, 0, date_diff), date_stopUpdated);
END IF
COMMIT;
END
Then again the syntax might be wrong but I hope you'll get the idea (haven't played with sql in a while)

SQL Query for a given time interval?

If I have this dataset below:
Timestamp Clicks
1:40:11 5
2:40:13 10
3:42:56 20
4:42:23 30
7:45:59 23
9:45:34 24
10:47:23 24
12:47:12 24
So from the data above the minutes range go from 40-47 but skips 41, 43, 44, and 46 in that range.
I want to find the average number of clicks per minute in that range (40-47) and put a zero value for the minutes that are not within the range (41, 43, 44, and 46).
So the result should be like this:
Minute Clicks
40 8
41 0
42 25
43 0
44 0
45 24
46 0
47 24
Any ideas on how to achieve something like this?
You only need 60 series, so you can create a table with 60 rows which contains the 60 existing minutes:
[table serie]
minute
0
1
2
3
4
5
…
Then use left join to create simple query like this:
select a.minute, IF(avg(b.Clicks),avg(b.Clicks),0) as avg_click from serie a
left join my_dataset b on a.`minute`*1 = SUBSTRING(b.Timestamp,-5,2)*1
group by minute
SUBSTRING(b.Timestamp,-5,2) will give you the minute from the end (to avoid wrong substring from the beginning if the HOUR has only 1 char).
We need to force comparison to INT by using *1 to CAST.
I would start with something like this
declare #StartTime DateTime = (select MIN(Timestamp) from tablename)
declare #EndTime DateTime = (select MAX(Timestamp) from tablename)
declare #CurrentMinute DateTime = #StartTime
declare #ResultTable (Minute int, Clicks int)
While #CurrentMinute <= #EndTime
begin
insert into #ResultTable (Minute,Clicks)
select DatePart(Minute,#CurrentMinute) as Minute, (select isnull( Clicks from tablename where DatePart(Minute,Timestamp) = DatePart(Minute,#CurrentMinute),0 )
end
select * from #ResultTable
this works by locating the lowest highest times and initializes the variable currentTime to the start time and continues in the while loop until the ending time it then insert into a temp row for every minute, if the results do not have a minute that matches it returns a null in the sub query and the is null insert a 0 for the clicks for that row as it had no row found

How to get data back from Mysql for days that have no statistics

I want to get the number of Registrations back from a time period (say a week), which isn't that hard to do, but I was wondering if it is in anyway possible to in MySQL to return a zero for days that have no registrations.
An example:
DATA:
ID_Profile datCreate
1 2009-02-25 16:45:58
2 2009-02-25 16:45:58
3 2009-02-25 16:45:58
4 2009-02-26 10:23:39
5 2009-02-27 15:07:56
6 2009-03-05 11:57:30
SQL:
SELECT
DAY(datCreate) as RegistrationDate,
COUNT(ID_Profile) as NumberOfRegistrations
FROM tbl_profile
WHERE DATE(datCreate) > DATE_SUB(CURDATE(),INTERVAL 9 DAY)
GROUP BY RegistrationDate
ORDER BY datCreate ASC;
In this case the result would be:
RegistrationDate NumberOfRegistrations
25 3
26 1
27 1
5 1
Obviously I'm missing a couple of days in between. Currently I'm solving this in my php code, but I was wondering if MySQL has any way to automatically return 0 for the missing days/rows. This would be the desired result:
RegistrationDate NumberOfRegistrations
25 3
26 1
27 1
28 0
1 0
2 0
3 0
4 0
5 1
This way we can use MySQL to solve any problems concerning the number of days in a month instead of relying on php code to calculate for each month how many days there are, since MySQL has this functionality build in.
Thanks in advance
No, but one workaround would be to create a single-column table with a date primary key, preloaded with dates for each day. You'd have dates from your earliest starting point right through to some far off future.
Now, you can LEFT JOIN your statistical data against it - then you'll get nulls for those days with no data. If you really want a zero rather than null, use IFNULL(colname, 0)
Thanks to Paul Dixon I found the solution. Anyone interested in how I solved this read on:
First create a stored procedure I found somewhere to populate a table with all dates from this year.
CREATE Table calendar(dt date not null);
CREATE PROCEDURE sp_calendar(IN start_date DATE, IN end_date DATE, OUT result_text TEXT)
BEGIN
SET #begin = 'INSERT INTO calendar(dt) VALUES ';
SET #date = start_date;
SET #max = SUBDATE(end_date, INTERVAL 1 DAY);
SET #temp = '';
REPEAT
SET #temp = concat(#temp, '(''', #date, '''), ');
SET #date = ADDDATE(#date, INTERVAL 1 DAY);
UNTIL #date > #max
END REPEAT;
SET #temp = concat(#temp, '(''', #date, ''')');
SET result_text = concat(#begin, #temp);
END
call sp_calendar('2009-01-01', '2010-01-01', #z);
select #z;
Then change the query to add the left join:
SELECT
DAY(dt) as RegistrationDate,
COUNT(ID_Profile) as NumberOfRegistrations
FROM calendar
LEFT JOIN
tbl_profile ON calendar.dt = tbl_profile.datCreate
WHERE dt BETWEEN DATE_SUB(CURDATE(),INTERVAL 6 DAY) AND CURDATE()
GROUP BY RegistrationDate
ORDER BY dt ASC
And we're done.
Thanks all for the quick replies and solution.