Group a sequence of lines [SQL] - mysql

Is there a way to group a sequence of rows in SQL (MySQL 5.1.73).
Let me explain, I have a query that gives this:
hour
start_date
end_date
10
2022-02-01 10:11:18
2022-02-01 10:50:18
11
2022-02-01 11:30:31
2022-02-01 11:38:12
13
2022-02-17 13:55:09
2022-02-17 13:58:38
14
2022-02-17 14:51:09
2022-02-17 14:57:59
And I would like to convert it to this:
hour
start_date
end_date
10
2022-02-01 10:11:18
2022-02-01 11:38:12
13
2022-02-17 13:55:09
2022-02-17 14:57:59
Indeed, I would like to group all the lines whose hours follow each other.
My request is a grouping in hours, like this :
SELECT hour( date ) as hour, MIN(date) as start_date , MAX(date) as end_date
FROM test_tbl
GROUP BY hour( date ) , date( date )
order by date, hour( date ) ;
But after doing this query, I would like to group the lines whose hours follow each other (10,11 => 10)...

EDIT: the following answer only works with MySQL version 8+
with tbl_by_hour as (
SELECT hour( date ) as hour, MIN(date) as start_date , MAX(date) as end_date
FROM test_tbl
GROUP BY hour( date ) , date( date )
order by date, hour( date )
)
select
min(hour) as hour,
min(start_date) as start_date,
max(end_date) as end_date
from (
select tab1.*,
sum(case when prev_hour is null or prev_hour = hour - 1 then 0 else 1 end) over(order by hour) grp
from (
select hour, start_date, end_date, lag(hour) over(order by hour) prev_hour from tbl_by_hour
) as tab1
) as tab2
group by grp

You can probably do something like this:
SELECT MIN(hours), dates, MIN(start_date), MAX(end_date), tn
FROM
(SELECT *,
CEIL(rownum/5) AS tn
FROM
(SELECT *,
CASE WHEN dates=#dt
AND hours=#hr+1
THEN #rn := #rn+1
WHEN dates=#dt
AND hours > #hr+1
THEN #rn := #rn+20
ELSE #rn := 1
END AS rownum,
#dt := dates,
#hr := hours
FROM
(SELECT hour(date) as hours, date(date) dates,
MIN(date) as start_date , MAX(date) as end_date
FROM test_tbl t
GROUP BY dates, hours) v
CROSS JOIN (SELECT #rn := 0, #dt := NULL, #hr := 0) r
ORDER BY dates, hours) s
) w
GROUP BY dates, tn;
I took your original query as base then made it as subquery.
Then I CROSS JOIN with a subquery of variables where I'm attempting to generate a custom row numbering. The conditions of the row number are:
If it's on the same date and the next hour increment from previous is +1 then continue the numbering.
If it's on the same date and the next hour increment from previous more than +1 then pick-up the last number and increment it by +20.
Repeat the row numbering sequence if the date is different.
After generating the row numbering, I convert to subquery then divide the row numbering by 5 and use ceiling (CIEL) function to somehow make them the same, effectively identifying (assuming) these rows with same CIEL(rownum/5) result as one group - this is where I felt it's not really convincing but it works anyhow.
Lastly, I convert that to a subquery again and did the whole MIN(hours), dates, MIN(start_date), MAX(end_date), tn with GROUP BY dates, tn.
It's not a convincing solution because the final operation (generating of the tn column) is based on creativity and not something certain. I usually prefer a solution that covers all the possible scenarios with something concrete rather than creative. However, I did some extensive tests on the current query with more data variation and so far it's returning good results. Also, I do notice that you said your MySQL version is 5.1+ so, I'm not really sure if this particular operation will work. Version 5.5+ is probably the lowest version of MySQL fiddle that is available online.
Here's a demo fiddle

Related

I'm using the IN operator yet I still get the error: 'Subquery returns more than 1 row'

I'm trying to solve this challenge: https://www.hackerrank.com/challenges/sql-projects/problem.
I tried the following:
SELECT
(SELECT start_date
FROM projects
WHERE
(SELECT DATE_ADD(start_date, INTERVAL -1 DAY)) NOT IN (SELECT start_date FROM projects)
ORDER BY start_date ASC) AS start_date,
(SELECT end_date
FROM projects
WHERE
(SELECT DATE_ADD(end_date, INTERVAL 1 DAY)) NOT IN (SELECT end_date FROM projects)
ORDER BY end_date ASC) AS end_date
FROM projects p
ORDER BY DATEDIFF(end_date, start_date) ASC, start_date ASC
Nonetheless,I got the following error: 'Subquery returns more than 1 row' Despite using the NOT IN operator.
However, when I tried executing only this part of the code:
SELECT start_date
FROM projects p
WHERE (SELECT (DATE_ADD(start_date, INTERVAL -1 DAY)) NOT IN (SELECT start_date FROM projects)
ORDER BY start_date ASC
It worked fine.
What could be the problem?
The two subquery for start _date and end_date could return a different numbers of rows adn any way the db engine not allow so called "parallel query"
in this case you should gets all the date involved and the left join for the subquery
select t1.start_date, t2.end_date
from (
SELECT start_date
FROM projects
WHERE DATE_ADD(start_date, INTERVAL -1 DAY) NOT IN (SELECT start_date FROM projects)
UNION
SELECT end_date
FROM projects
WHERE SELECT DATE_ADD(start_date, INTERVAL -1 DAY) NOT IN (SELECT end_date FROM projects)
) t
left join (
SELECT start_date
FROM projects
WHERE DATE_ADD(start_date, INTERVAL -1 DAY) NOT IN (SELECT start_date FROM projects)
) t1 on t.start_date = t1.start_date
left join (
SELECT end_date
FROM projects
WHERE DATE_ADD(start_date, INTERVAL -1 DAY) NOT IN (SELECT start_date FROM projects)
) t2 on t.start_date = t2.start_date
order by t1.syaty_date
You select project rows. Per project row you select a start date. The query for the this start date looks like this:
(SELECT start_date ... ORDER BY start_date ASC)
Do you really think it is one start_date you are selecting here? Why then the ORDER BY clause? This subquery returns multiple rows and this is why you are getting the error.
This query does not selects one start date, but all start dates for which not exists the previous date in the table. It doesn't even relate to the project row in the main query.
It seems you want to find all start dates that have no predecessor and all end dates that have no follower. These are two data sets you can select from. So the subqueries don't belong in the SELECT clause where you say which columns to select, but in the FROM clause where you say from which data sets to select.
You would then have to join the two sets. The join criteria would be the rows' positions in the ordered data sets (first start date belongs to first end date, second start date belongs to second end date, ...). For this you need a way to number these data rows.
Such a task is easy to solve with ROW_NUMBER. This is only featured since MySQL 8.
SELECT s.start_date, e.end_date
FROM
(
SELECT start_date, ROW_NUMBER() OVER (ORDER BY start_date) AS rn
FROM projects
WHERE start_date - INTERVAL 1 DAY NOT IN (SELECT start_date FROM projects)
) s
JOIN
(
SELECT end_date, ROW_NUMBER() OVER (ORDER BY end_date) AS rn
FROM projects
WHERE start_date + INTERVAL 1 DAY NOT IN (SELECT end_date FROM projects)
) e USING (rn)
ORDER BY s.start_date;
This kind of problem is called gaps & islands. There are other ways to solve this, but I think that above query plainly builds up on yours and is thus easy to understand.
Here is another answer that may explain better what you are doing.
You can:
select
start_date,
end_date,
start_date - interval 1 day as prev_day,
1 as one
from projects;
The select clause contains what you want to select from a projects row. For the first row you will get its start date, end date, its start date minus one day, and a 1 we call "one" here. For the second row you will get its start date (which is probably another start date than the one of the first row), its end date, its start date minus one day, and again a 1 we call "one".
You can
select
(select start_date) as start_date,
(select end_date) as end_date,
(select start_date - interval 1 day) as prev_day,
(select 1) as one
from projects;
which doesn't change anything and only obfuscates things. (This is what you do here: (SELECT DATE_ADD(end_date, INTERVAL 1 DAY)).
You cannot
select
(select start_date from projects) as start_date,
(select end_date from projects) as end_date,
(select start_date - interval 1 day from projects) as prev_day,
(select 1 from projects) as one
from projects;
because here you are not selecting one value for the first project row's start date, but all start dates from the table. Same for its end date etc. of course, same for the second row etc. This is what you are doing here:
SELECT
(SELECT start_date FROM projects ...) AS start_date,
(SELECT end_date FROM projects ...) AS end_date
FROM projects p
and this is why you are getting the error "Subquery returns more than 1 row".

collect_set() distinct users by day from last 90 days only when user is older than last 90 days

for now I was able to collect_set() everyone that is active with no problem:
with aux as(
select date
,collect_set(user_id) over(
partition by feature
order by cast(timestamp(date) as float)
range between (-90*60*60*24) following and 0 preceding
) as user_id
,feature
--
from (
select data
,feature
,collect_set(user_id)
--
from table
--
group by date, feature
)
)
--
select date
,distinct_array(flatten(user_id))
,feature
--
from aux
The problem is, now I have to keep only users that are older than last 90 days
I tried this and didn't work:
select date
,collect_set(case when user_created_at < date - interval 90 day
then user_id end) over(
partition by feature
order by cast(timestamp(date) as float)
range between (-90*60*60*24) following and 0 preceding
) as teste
,feature
from table
The reason it didn't work is because the filter inside collect_select() filters only users from one day instead filtering all the users from the last 90 days,
Making the result with more results than expected.
How can I get it correctly?
As reference, I'm using this query to verify if is correct:
select
count(distinct user_id) as total
,count(distinct case when user_created_at < date('2020-04-30') - interval 90 day then user_id end)
,count(distinct case when user_created_at >= date('2020-04-30') - interval 90 day then user_id end)
--
from table
--
where 1=1
and date >= date('2020-04-30') - interval 90 day
and date <= '2020-04-30'
and feature = 'a_feature'
pretty ugly workaround but:
select data
,feature
,collect_set(cus.client_id) as client
from (
select data
,explode(array_distinct(flatten(client))) as client
,feature
from(
select data
,collect_set(client_id) over(
partition by feature
order by cast(timestamp(data) as float)
range between (-90*60*60*24) following and 0 preceding
) as cliente
,feature
from (
select data
,feature
,collect_set(client_id) as cliente
from da_pandora.ds_transaction dtr
--
group by data, feature
)
)
)as dtr
left join costumer as cus
on cus.client_id = dtr.client and date(client_created_at) < data - interval 90 day
group by data, feature

MYSQL max() and group by error:only_full_group_by

I have question about a MySQL query that is logging error's since updating the MySQL-5.7.
The error is the "only_full_group_by" which is will spoken off on stackoverflow.
In many answers it's stated not to disable this option but improve your sql query.
The query that I'm using is returning the minimum and maximum values of a counter per hour.
SELECT MAX( counter ) AS max,
MIN( counter ) AS min,
DATE_FORMAT(date_time, '%H:%i') AS dt
FROM table1
WHERE date_time >= NOW() - INTERVAL 1 DAY
GROUP BY YEAR(date_time), MONTH(date_time), DAY(date_time), HOUR(date_time)
as I understand from the error message I'm missing one of the items from the SELECT cause in the GROUP BY cause. But however I restort/remove/add items I'm not getting the result I got before the upgrade to MySQL-5.7.
I tried to subquery the main query to improve the SQL query. But somehow I can't recreate the results.
What is it I'm missing?
MySQL isn't able to determine the functional dependence ... between the expressions in the GROUP BY clause, and the expressions in the SELECT list.
The non-aggregate expression in the SELECT list (DATE_FORMAT(date_time, '%H:%i') includes a minutes component. The GROUP BY clause is going to collapse the rows into groups by just hour. So the value of the minutes is indeterminate... we know it's going to come from some row in the group, but there's no guarantee which one.
(The question reference to ONLY_FULL_GROUP_BY seems to indicate that we've got some understanding of indeterminate values...)
The easiest (fewest) changes fix would be to wrap that expression in a MIN or MAX function.
SELECT MAX(t.counter) AS `max`
, MIN(t.counter) AS `min`
, MIN(DATE_FORMAT(t.date_time,'%H:%i')) AS `dt`
FROM table1 t
WHERE t.date_time >= NOW() - INTERVAL 1 DAY
GROUP
BY YEAR(t.date_time)
, MONTH(t.date_time)
, DAY(t.date_time)
, HOUR(t.date_time)
ORDER
BY YEAR(t.date_time)
, MONTH(t.date_time)
, DAY(t.date_time)
, HOUR(t.date_time)
If we want rows returned in a particular order, we should include an ORDER BY clause, and not rely on MySQL-specific extension or behavior of GROUP BY (which may disappear in future releases.)
It's a bit odd to be doing a GROUP BY year, month, day and not including those values in the SELECT list. (It's not invalid to do that, just kind of strange. The conditions in the WHERE clause are guaranteeing that we don't have more than 24 hours span for date_time.
My preference would to do the GROUP BY on the same expression as the non-aggregate in the SELECT list. If I ever needed more than 24 hours, I'd include the date component:
SELECT MAX(t.counter) AS `max`
, MIN(t.counter) AS `min`
, DATE_FORMAT(t.date_time,'%Y-%m-%d %H:00') + INTERVAL 0 DAY AS `dt`
FROM table1 t
WHERE t.date_time >= NOW() - INTERVAL 1 DAY
GROUP
BY DATE_FORMAT(t.date_time,'%Y-%m-%d %H:00') + INTERVAL 0 DAY
ORDER
BY DATE_FORMAT(t.date_time,'%Y-%m-%d %H:00') + INTERVAL 0 DAY
--or--
if we always know it's just one day's worth of date_time, and we only want to return the hour, then we can group by just the hour. The same expression as in the SELECT list.
SELECT MAX(t.counter) AS `max`
, MIN(t.counter) AS `min`
, DATE_FORMAT(t.date_time,'%H:00') AS `dt`
FROM table1 t
WHERE t.date_time >= NOW() - INTERVAL 1 DAY
GROUP
BY DATE_FORMAT(t.date_time,'%H:00')
, DATE_FORMAT(t.date_time,'%Y-%m-%d %H')
ORDER
BY DATE_FORMAT(t.date_time,'%Y-%m-%d %H')
SELECT MAX( counter ) AS max,
MIN( counter ) AS min,
YEAR(date_time) AS g_year,
MONTH(date_time)AS g_month,
DAY(date_time) AS g_day,
HOUR(date_time) AS g_hour
FROM table1
WHERE date_time >= NOW() - INTERVAL 1 DAY
GROUP BY g_year, g_month, g_day, g_hour
Or you can get rid of redundant data if you always do it for 1 day:
SELECT MAX( counter ) AS max,
MIN( counter ) AS min,
DAY(date_time) AS g_day,
HOUR(date_time) AS g_hour
FROM table1
WHERE date_time >= NOW() - INTERVAL 1 DAY
GROUP BY g_day, g_hour

Find number of "active" rows each month for multiple months in one query

I have a mySQL database with each row containing an activate and a deactivate date. This refers to the period of time when the object the row represents was active.
activate deactivate id
2015-03-01 2015-05-10 1
2013-02-04 2014-08-23 2
I want to find the number of rows that were active at any time during each month. Ex.
Jan: 4
Feb: 2
Mar: 1
etc...
I figured out how to do this for a single month, but I'm struggling with how to do it for all 12 months in a year in a single query. The reason I would like it in a single query is for performance, as information is used immediately and caching wouldn't make sense in this scenario. Here's the code I have for a month at a time. It checks if the activate date comes before the end of the month in question and that the deactivate date was not before the beginning of the period in question.
SELECT * from tblName WHERE activate <= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND deactivate >= DATE_SUB(NOW(), INTERVAL 2 MONTH)
If anybody has any idea how to change this and do grouping such that I can do this for an indefinite number of months I'd appreciate it. I'm at a loss as to how to group.
If you have a table of months that you care about, you can do:
select m.*,
(select count(*)
from table t
where t.activate_date <= m.month_end and
t.deactivate_date >= m.month_start
) as Actives
from months m;
If you don't have such a table handy, you can create one on the fly:
select m.*,
(select count(*)
from table t
where t.activate_date <= m.month_end and
t.deactivate_date >= m.month_start
) as Actives
from (select date('2015-01-01') as month_start, date('2015-01-31') as month_end union all
select date('2015-02-01') as month_start, date('2015-02-28') as month_end union all
select date('2015-03-01') as month_start, date('2015-03-31') as month_end union all
select date('2015-04-01') as month_start, date('2015-04-30') as month_end
) m;
EDIT:
A potentially faster way is to calculate a cumulative sum of activations and deactivations and then take the maximum per month:
select year(date), month(date), max(cumes)
from (select d, (#s := #s + inc) as cumes
from (select activate_date as d, 1 as inc from table t union all
select deactivate_date, -1 as inc from table t
) t cross join
(select #s := 0) param
order by d
) s
group by year(date), month(date);

Calculating a Moving Average MySQL?

Good Day,
I am using the following code to calculate the 9 Day Moving average.
SELECT SUM(close)
FROM tbl
WHERE date <= '2002-07-05'
AND name_id = 2
ORDER BY date DESC
LIMIT 9
But it does not work because it first calculates all of the returned fields before the limit is called. In other words it will calculate all the closes before or equal to that date, and not just the last 9.
So I need to calculate the SUM from the returned select, rather than calculate it straight.
IE. Select the SUM from the SELECT...
Now how would I go about doing this and is it very costly or is there a better way?
If you want the moving average for each date, then try this:
SELECT date, SUM(close),
(select avg(close) from tbl t2 where t2.name_id = t.name_id and datediff(t2.date, t.date) <= 9
) as mvgAvg
FROM tbl t
WHERE date <= '2002-07-05' and
name_id = 2
GROUP BY date
ORDER BY date DESC
It uses a correlated subquery to calculate the average of 9 values.
Starting from MySQL 8, you should use window functions for this. Using the window RANGE clause, you can create a logical window over an interval, which is very powerful. Something like this:
SELECT
date,
close,
AVG (close) OVER (ORDER BY date DESC RANGE INTERVAL 9 DAY PRECEDING)
FROM tbl
WHERE date <= DATE '2002-07-05'
AND name_id = 2
ORDER BY date DESC
For example:
WITH t (date, `close`) AS (
SELECT DATE '2020-01-01', 50 UNION ALL
SELECT DATE '2020-01-03', 54 UNION ALL
SELECT DATE '2020-01-05', 51 UNION ALL
SELECT DATE '2020-01-12', 49 UNION ALL
SELECT DATE '2020-01-13', 59 UNION ALL
SELECT DATE '2020-01-15', 30 UNION ALL
SELECT DATE '2020-01-17', 35 UNION ALL
SELECT DATE '2020-01-18', 39 UNION ALL
SELECT DATE '2020-01-19', 47 UNION ALL
SELECT DATE '2020-01-26', 50
)
SELECT
date,
`close`,
COUNT(*) OVER w AS c,
SUM(`close`) OVER w AS s,
AVG(`close`) OVER w AS a
FROM t
WINDOW w AS (ORDER BY date DESC RANGE INTERVAL 9 DAY PRECEDING)
ORDER BY date DESC
Leading to:
date |close|c|s |a |
----------|-----|-|---|-------|
2020-01-26| 50|1| 50|50.0000|
2020-01-19| 47|2| 97|48.5000|
2020-01-18| 39|3|136|45.3333|
2020-01-17| 35|4|171|42.7500|
2020-01-15| 30|4|151|37.7500|
2020-01-13| 59|5|210|42.0000|
2020-01-12| 49|6|259|43.1667|
2020-01-05| 51|3|159|53.0000|
2020-01-03| 54|3|154|51.3333|
2020-01-01| 50|3|155|51.6667|
Use something like
SELECT
sum(close) as sum,
avg(close) as average
FROM (
SELECT
(close)
FROM
tbl
WHERE
date <= '2002-07-05'
AND name_id = 2
ORDER BY
date DESC
LIMIT 9 ) temp
The inner query returns all filtered rows in desc order, and then you avg, sum up those rows returned.
The reason why the query given by you doesn't work is due to the fact that the sum is calculated first and the LIMIT clause is applied after the sum has already been calculated, giving you the sum of all the rows present
an other technique is to do a table:
CREATE TABLE `tinyint_asc` (
`value` tinyint(3) unsigned NOT NULL default '0',
PRIMARY KEY (value)
) ;
​
INSERT INTO `tinyint_asc` VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255);
After you can used it like that:
select
date_add(tbl.date, interval tinyint_asc.value day) as mydate,
count(*),
sum(myvalue)
from tbl inner
join tinyint_asc.value <= 30 -- for a 30 day moving average
where date( date_add(o.created_at, interval tinyint_asc.value day ) ) between '2016-01-01' and current_date()
group by mydate
This query is fast:
select date, name_id,
case #i when name_id then #i:=name_id else (#i:=name_id)
and (#n:=0)
and (#a0:=0) and (#a1:=0) and (#a2:=0) and (#a3:=0) and (#a4:=0) and (#a5:=0) and (#a6:=0) and (#a7:=0) and (#a8:=0)
end as a,
case #n when 9 then #n:=9 else #n:=#n+1 end as n,
#a0:=#a1,#a1:=#a2,#a2:=#a3,#a3:=#a4,#a4:=#a5,#a5:=#a6,#a6:=#a7,#a7:=#a8,#a8:=close,
(#a0+#a1+#a2+#a3+#a4+#a5+#a6+#a7+#a8)/#n as av
from tbl,
(select #i:=0, #n:=0,
#a0:=0, #a1:=0, #a2:=0, #a3:=0, #a4:=0, #a5:=0, #a6:=0, #a7:=0, #a8:=0) a
where name_id=2
order by name_id, date
If you need an average over 50 or 100 values, it's tedious to write, but
worth the effort. The speed is close to the ordered select.