Sum values while the column value is the same - mysql

I am trying to build a timeline. My table have a type column, date_start and date_end, like so:
+------+---------------------+---------------------+----------+
| Type | Start | End | Diff |
+------+---------------------+---------------------+----------+
| 1 | 2020-11-23 23:40:00 | 2020-11-23 23:41:00 | 00:01:00 |
| 1 | 2020-11-23 23:42:00 | 2020-11-23 23:43:00 | 00:01:00 |
| 1 | 2020-11-23 23:44:00 | 2020-11-23 23:45:00 | 00:01:00 |
| 2 | 2020-11-23 23:46:00 | 2020-11-23 23:47:00 | 00:01:00 |
| 2 | 2020-11-23 23:48:00 | 2020-11-23 23:49:00 | 00:01:00 |
| 1 | 2020-11-23 23:50:00 | 2020-11-23 23:51:00 | 00:01:00 |
| 1 | 2020-11-23 23:52:00 | 2020-11-23 23:53:00 | 00:01:00 |
+------+---------------------+---------------------+----------+
I need to sum the differences, while the column value stays the same as the one before. Once the type column value changes, it creates a new line, giving a result like this:
+------+----------+
| Type | Diff |
+------+----------+
| 1 | 00:03:00 |
| 2 | 00:02:00 |
| 1 | 00:02:00 |
+------+----------+
How can I achieve such grouping and sum result in MySQL?
PS: Don't bother with time logics, if you want to setup an example using integer is perfectly ok.

Use a variable to assign a block number and then aggregate
drop table if exists t;
create table t
( Type int, Startdt datetime, Enddt datetime, Diff time);
insert into t values
( 1 ,'2020-11-23 23:40:00' ,'2020-11-23 23:41:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:42:00' ,'2020-11-23 23:43:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:44:00' ,'2020-11-23 23:45:00' , '00:01:00' ),
( 2 ,'2020-11-23 23:46:00' ,'2020-11-23 23:47:00' , '00:01:00' ),
( 2 ,'2020-11-23 23:48:00' ,'2020-11-23 23:49:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:50:00' ,'2020-11-23 23:51:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:52:00' ,'2020-11-23 23:53:00' , '00:01:00' );
select type,block,sec_to_time(sum(time_to_sec(diff)))
from
(
select t.*,
if(type <> #p, #b:=#b+1,#b:=#b) block,
#p:=type p
from t
cross join (select #b:=0,#p:=0) b
order by startdt,type
) s
group by s.block,s.type;
------+-------+-------------------------------------+
| type | block | sec_to_time(sum(time_to_sec(diff))) |
+------+-------+-------------------------------------+
| 1 | 1 | 00:03:00 |
| 2 | 2 | 00:02:00 |
| 1 | 3 | 00:02:00 |
+------+-------+-------------------------------------+
3 rows in set (0.148 sec)

SELECT t.type,sum(c.dif)
FROM TABLE t JOIN
(SELECT c.type TIMEDIFF(c.start,c.end)
as dif FROM TABLE c)ta ON ta.type = t.type
group by t.type

Related

How to select all records for only the first 50 distinct values in a column

I am trying to create a classifier model for a dataset, but I have too many distinct values for my target variable. If I run something like this:
Create or replace model `model_name`
options (model_type="AUTOML_CLASSIFIER", input_label_cols=["ORIGIN_AIRPORT"]) as
select DAY_OF_WEEK, ARRIVAL_TIME, ARRIVAL_DELAY, ORIGIN_AIRPORT
from `table_name`
limit 1000
I end up getting
Error running query
Classification model currently only supports classification with up to 50 unique labels and the label column had 111 unique labels.
So how can I select, for example, all rows that have one of the first 50 values of ORIGIN_AIRPORT?
Select * from “TABLE_NAME” as T1 left outer join (SELECT distinct
COLUMN_NAME from TABLE_NAME Order by COLUMN_NAME limit 50)as T2 on
T1.COLUMN_NAME=T2.COLUMN_NAME
This query will fetch you 50 distinct values in the inner query, then the outer query searches for those particular 50 distinct values using the T1.COLUMN_NAME=T2.COLUMN_NAME commands and returns all the records( it shows null for those not included in the 50 unique list)
Given a table of values (origin_airport), with unique identifiers (id) and date, find the minimum date for each unique value (origin_airport) to decide which N origin_airport values are to be returned.
Return all rows which match the first 3 unique origin_airport values (densely ranked, by min(date) per origin_airport).
Updated: to use columns that more closely match the model, with origin_airport and a date column for ordering.
Full working test case
The test data:
CREATE TABLE airportlogs (
origin_airport int
, id int primary key auto_increment
, date date DEFAULT NULL
);
INSERT INTO airportlogs (origin_airport) VALUES
( 1 )
, ( 1 )
, ( 8 )
, ( 8 )
, ( 8 )
, ( 7 )
, ( 7 )
, ( 6 )
, ( 5 )
, ( 4 )
, ( 3 )
, ( 3 )
, ( 7 )
, ( 7 )
, ( 1 )
, ( 8 )
, ( 3 )
, ( 1 )
;
-- Create some dates to use for ordering.
-- Ordering can be as complicated as we need.
UPDATE airportlogs SET date = current_date + INTERVAL +id DAY;
-- Intermediate calculation to show the MIN(date) per origin_airport
WITH nvals (origin_airport, mdate) AS (
SELECT origin_airport, MIN(date) AS mdate FROM airportlogs GROUP BY origin_airport
)
SELECT *
FROM nvals
ORDER BY mdate
;
+----------------+------------+
| origin_airport | mdate |
+----------------+------------+
| 1 | 2021-08-05 |
| 8 | 2021-08-07 |
| 7 | 2021-08-10 |
| 6 | 2021-08-12 |
| 5 | 2021-08-13 |
| 4 | 2021-08-14 |
| 3 | 2021-08-15 |
+----------------+------------+
-- Calculation of ordered rank for the unique origin_airport values
-- by MIN(date) per origin_airport.
WITH nvals0 (origin_airport, date, mdate) AS (
SELECT origin_airport
, date
, MIN(date) OVER (PARTITION BY origin_airport) AS mdate
FROM airportlogs
)
, nvals (origin_airport, date, mdate, r) AS (
SELECT origin_airport
, date
, mdate
, DENSE_RANK() OVER (ORDER BY mdate) AS r
FROM nvals0
)
SELECT *
FROM nvals
ORDER BY r, date
;
Result:
+----------------+------------+------------+---+
| origin_airport | date | mdate | r |
+----------------+------------+------------+---+
| 1 | 2021-08-05 | 2021-08-05 | 1 |
| 1 | 2021-08-06 | 2021-08-05 | 1 |
| 1 | 2021-08-19 | 2021-08-05 | 1 |
| 1 | 2021-08-22 | 2021-08-05 | 1 |
| 8 | 2021-08-07 | 2021-08-07 | 2 |
| 8 | 2021-08-08 | 2021-08-07 | 2 |
| 8 | 2021-08-09 | 2021-08-07 | 2 |
| 8 | 2021-08-20 | 2021-08-07 | 2 |
| 7 | 2021-08-10 | 2021-08-10 | 3 |
| 7 | 2021-08-11 | 2021-08-10 | 3 |
| 7 | 2021-08-17 | 2021-08-10 | 3 |
| 7 | 2021-08-18 | 2021-08-10 | 3 |
| 6 | 2021-08-12 | 2021-08-12 | 4 |
| 5 | 2021-08-13 | 2021-08-13 | 5 |
| 4 | 2021-08-14 | 2021-08-14 | 6 |
| 3 | 2021-08-15 | 2021-08-15 | 7 |
| 3 | 2021-08-16 | 2021-08-15 | 7 |
| 3 | 2021-08-21 | 2021-08-15 | 7 |
+----------------+------------+------------+---+
The final solution:
WITH min_date (origin_airport, date, mdate) AS (
SELECT origin_airport
, date
, MIN(date) OVER (PARTITION BY origin_airport) AS mdate
FROM airportlogs
)
, ranks (origin_airport, date, mdate, r) AS (
SELECT origin_airport
, date
, mdate
, DENSE_RANK() OVER (ORDER BY mdate) AS r
FROM min_date
)
SELECT *
FROM ranks
WHERE r <= 3
ORDER BY r, date
;
The final result:
+----------------+------------+------------+---+
| origin_airport | date | mdate | r |
+----------------+------------+------------+---+
| 1 | 2021-08-05 | 2021-08-05 | 1 |
| 1 | 2021-08-06 | 2021-08-05 | 1 |
| 1 | 2021-08-19 | 2021-08-05 | 1 |
| 1 | 2021-08-22 | 2021-08-05 | 1 |
| 8 | 2021-08-07 | 2021-08-07 | 2 |
| 8 | 2021-08-08 | 2021-08-07 | 2 |
| 8 | 2021-08-09 | 2021-08-07 | 2 |
| 8 | 2021-08-20 | 2021-08-07 | 2 |
| 7 | 2021-08-10 | 2021-08-10 | 3 |
| 7 | 2021-08-11 | 2021-08-10 | 3 |
| 7 | 2021-08-17 | 2021-08-10 | 3 |
| 7 | 2021-08-18 | 2021-08-10 | 3 |
+----------------+------------+------------+---+
There are a number of other solutions.
The poster didn't mention the logic for this ordering. But with the above window function behavior, that's trivial to specify.

Mysql SELECT query group by with sum and avg

If I have added "group by date" then sum or avg function is not working.
Here is a table
| date | calories |
|-------------------------|
| 2021-03-28 | 42.50 |
| 2021-03-30 | 500.00 |
| 2021-03-31 | 35.00 |
| 2021-04-01 | 200.00 |
| 2021-04-01 | 35.00 |
Here is create Query
SELECT CONCAT(round(IF(avg(up.calories), avg(up.calories), 0), 2), "kcal") as avg, CONCAT(round(IF(SUM(up.calories), SUM(up.calories), 0), 2), "kcal") as total_burned
FROM `tbl` as `up`
WHERE `date` BETWEEN "2021-03-28" AND "2021-04-03"
AND `calories` != '0'
GROUP BY `date`
Below is my query result
| avg | total_burned |
|-----------------------------|
| 42.50 | 42.50 |
| 500.00 | 500.00 |
| 35.00 | 35.00 |
| 235.00 | 235.00 |
But actually, I want to this type of result
| avg | total_burned |
|-----------------------------|
| 203.13 | 812.50 |
Roll your own
DROP TABLE IF EXISTS T;
create table t( date date, calories decimal(10,2));
insert into t values
( '2021-03-28' , 42.50 ),
( '2021-03-30' , 500.00 ),
( '2021-03-31' , 35.00 ),
( '2021-04-01' , 200.00 ),
( '2021-04-01' , 35.00 );
select sum(calories) sumcal,sum(calories) / count(distinct date) calcavg, avg(calories)
from t;
+--------+------------+---------------+
| sumcal | calcavg | avg(calories) |
+--------+------------+---------------+
| 812.50 | 203.125000 | 162.500000 |
+--------+------------+---------------+
1 row in set (0.002 sec)

MySQL query to find *best* row per group where *best* is a complex metrics

My table foobar has the following columns:
val: tinyint NOT NULL
date: timestamp NOT NULL
type: enum('A', 'B', 'C') NOT NULL
extra: tinyint NOT NULL
For each type I would like to find the row that matches an arbitrary condition on the columns (e.g. extra > 12 AND val > 0), that minimizes val and, in case of equal val, minimizes date. I assume that for each type such a row exists and is unique. Finally, I'd like the result (as many rows as there are different type values) to be ordered by val, date.
If foobar contains the following rows:
+------+---------------------+------+-------+
| val | date | type | extra |
+------+---------------------+------+-------+
| -1 | 2014-04-10 00:00:00 | A | 40 |
| 1 | 2014-04-15 00:00:00 | A | 15 |
| 2 | 2014-04-12 00:00:00 | A | 77 |
| 1 | 2014-04-11 00:00:00 | A | 2 |
| 1 | 2014-04-14 00:00:00 | A | 22 |
| 1 | 2014-04-10 00:00:00 | B | 40 |
| 1 | 2014-04-15 00:00:00 | B | 15 |
| 1 | 2014-04-12 00:00:00 | B | 77 |
| 1 | 2014-04-11 00:00:00 | B | 2 |
| 1 | 2014-04-14 00:00:00 | B | 22 |
| 4 | 2014-04-10 00:00:00 | C | 40 |
| 3 | 2014-04-15 00:00:00 | C | 15 |
| 3 | 2014-04-12 00:00:00 | C | 77 |
| 1 | 2014-04-11 00:00:00 | C | 2 |
| 3 | 2014-04-14 00:00:00 | C | 22 |
+------+---------------------+------+-------+
the query shall return:
+------+---------------------+------+-------+
| val | date | type | extra |
+------+---------------------+------+-------+
| 1 | 2014-04-10 00:00:00 | B | 40 |
| 1 | 2014-04-14 00:00:00 | A | 22 |
| 3 | 2014-04-12 00:00:00 | C | 77 |
+------+---------------------+------+-------+
This seems to work:
SELECT a.* FROM (
SELECT MIN(val * 4294967296 + UNIX_TIMESTAMP(date)) AS score
FROM foobar WHERE extra > 12 AND val > 0
GROUP BY type
) AS b
INNER JOIN foobar AS a
ON a.val * 4294967296 + UNIX_TIMESTAMP(a.date) = b.score
ORDER BY val, date;
but I find it over-complicated and I suspect that there must be a better way. Moreover, transforming my multi-columns criteria in a single numeric value (val * 4294967296 + UNIX_TIMESTAMP(date)) works in this simple case but may be more difficult in more complex scenarios.
Are there other, more generic schemes, that would do the same?
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(val INT SIGNED NOT NULL
,date TIMESTAMP NOT NULL
,type CHAR(1) NOT NULL
,extra TINYINT NOT NULL
,PRIMARY KEY(val,date,type)
);
INSERT INTO my_table VALUES
(-1,'2014-04-10 00:00:00','A',40),
( 1,'2014-04-15 00:00:00','A',15),
( 2,'2014-04-12 00:00:00','A',77),
( 1,'2014-04-11 00:00:00','A', 2),
( 1,'2014-04-14 00:00:00','A',22),
( 1,'2014-04-10 00:00:00','B',40),
( 1,'2014-04-15 00:00:00','B',15),
( 1,'2014-04-12 00:00:00','B',77),
( 1,'2014-04-11 00:00:00','B', 2),
( 1,'2014-04-14 00:00:00','B',22),
( 4,'2014-04-10 00:00:00','C',40),
( 3,'2014-04-15 00:00:00','C',15),
( 3,'2014-04-12 00:00:00','C',77),
( 1,'2014-04-11 00:00:00','C', 2),
( 3,'2014-04-14 00:00:00','C',22);
SELECT a.*
FROM my_table a
JOIN
( SELECT x.val
, x.type
, MIN(x.date) date
FROM my_table x
JOIN
( SELECT MIN(val) val
, type
FROM my_table
WHERE extra > 12
AND val > 0
GROUP
BY type
) y
ON y.type = x.type
AND y.val = x.val
WHERE x.extra > 12
GROUP
BY val
, type
) b
ON b.val = a.val
AND b.type = a.type
AND b.date = a.date;
+-----+---------------------+------+-------+
| val | date | type | extra |
+-----+---------------------+------+-------+
| 1 | 2014-04-14 00:00:00 | A | 22 |
| 1 | 2014-04-10 00:00:00 | B | 40 |
| 3 | 2014-04-12 00:00:00 | C | 77 |
+-----+---------------------+------+-------+

Include NULL in SQL Join when using WHERE

I have the following two tables:
Table TempUser22 : 57,000 rows:
+------+-----------+
| Id | Followers |
+------+-----------+
| 874 | 55542 |
| 1081 | 330624 |
| 1378 | 17919 |
| 1621 | 920 |
| 1688 | 255463 |
| 2953 | 751 |
| 3382 | 204466 |
| 3840 | 273489 |
| 4145 | 376 |
| ... | ... |
+------+-----------+
Table temporal_users : 10,000,000 rows total, 3200 rows Where Date=2010-12-31:
+---------------------+---------+--------------------+
| Date | User_Id | has_original_tweet |
+---------------------+---------+--------------------+
| 2008-02-22 12:00:00 | 676493 | 2 |
| 2008-02-22 12:00:00 | 815263 | 1 |
| 2008-02-22 12:00:00 | 6245822 | 1 |
| 2008-02-22 12:00:00 | 8854092 | 1 |
| 2008-02-23 12:00:00 | 676493 | 2 |
| 2008-02-23 12:00:00 | 815263 | 1 |
| 2008-02-23 12:00:00 | 6245822 | 1 |
| 2008-02-23 12:00:00 | 8854092 | 1 |
| 2008-02-24 12:00:00 | 676493 | 2 |
| ............. | ... | .. |
+---------------------+---------+--------------------+
I am running the following join query on these tables:
SELECT sum(has_original_tweet), b.Id
FROM temporal_users AS a
RIGHT JOIN TempUser22 AS b
ON a.User_ID = b.Id
GROUP BY b.Id;
Which returns 57,00 rows as expected, with NULL answers on the first field:
+-------------------------+------+
| sum(has_original_tweet) | Id |
+-------------------------+------+
| NULL | 874 |
| NULL | 1081 |
| 135 | 1378 |
| 164 | 1621 |
| 652 | 1688 |
| 691 | 2953 |
| NULL | 3382 |
| NULL | 3840 |
| NULL | 4145 |
| ... | .... |
+-------------------------+------+
However, when adding the WHERE line specifying a date as below:
SELECT sum(has_original_tweet), b.Id
FROM temporal_users AS a
RIGHT JOIN TempUser22 AS b
ON a.User_ID = b.Id
WHERE a.Date BETWEEN '2010-12-31-00:00:00' AND '2010-12-31-23:59:59'
GROUP BY b.Id;
I receive the following answer, of only 3200 rows, and without any NULL in the first field.
+-------------------------+---------+
| sum(has_original_tweet) | Id |
+-------------------------+---------+
| 1 | 797194 |
| 1 | 815263 |
| 0 | 820678 |
| 1 | 1427511 |
| 0 | 4653731 |
| 1 | 5933862 |
| 2 | 7530552 |
| 1 | 7674072 |
| 1 | 8149632 |
| .. | .... |
+-------------------------+---------+
My question is: How to get, for a given date, an answer of size 57,000 rows for each user in TempUser22 with NULL values when has_original_tweet is not present in temporal_user for the given date?
Thanks.
SELECT b.Id, SUM(a.has_original_tweet) s
FROM TempUser22 b
LEFT JOIN temporal_users a ON b.Id = a.User_Id
AND a.Date BETWEEN '2010-12-31-00:00:00' AND '2010-12-31-23:59:59'
GROUP BY b.Id;
Id s
1 null
2 1
3 null
4 3
5 null
6 null
For debugging, I used:
CREATE TEMPORARY TABLE TempUser22(Id INT, Followers INT)
SELECT 1 Id, 10 Followers UNION ALL
SELECT 2, 20 UNION ALL
SELECT 3, 30 UNION ALL
SELECT 4, 40 UNION ALL
SELECT 5, 50 UNION ALL
SELECT 6, 60
;
CREATE TEMPORARY TABLE temporal_users(`Date` DATETIME, User_Id INT, has_original_tweet INT)
SELECT '2008-02-22 12:00:00' `Date`, 1 User_Id, 1 has_original_tweet UNION ALL
SELECT '2008-12-31 12:00:00', 2, 1 UNION ALL
SELECT '2010-12-31 12:00:00', 2, 1 UNION ALL
SELECT '2012-12-31 12:00:00', 2, 1 UNION ALL
SELECT '2008-12-31 12:00:00', 4, 9 UNION ALL
SELECT '2010-12-31 12:00:00', 4, 1 UNION ALL
SELECT '2010-12-31 12:00:00', 4, 2 UNION ALL
SELECT '2012-12-31 12:00:00', 4, 9
;
That's because NULL values will always be discarded from the where clause
You can use a coalesce in your where clause.
WHERE coalesce(a.Date, 'some-date-in-the-range') BETWEEN '2010-12-31-00:00:00' AND '2010-12-31-23:59:59'
With this instead, you force null values to be considered as valid.

TIMESTAMPDIFF Sum Case error

+-----------+-----------+--------+
| punchtime | punchdate | emp_id |
+-----------+-----------+--------+
| 9:51:00 | 4/1/2016 | 2 |
| 12:59:00 | 4/1/2016 | 2 |
| 10:28:00 | 4/1/2016 | 5 |
| 14:13:00 | 4/1/2016 | 5 |
| 9:56:00 | 4/1/2016 | 10 |
| 15:31:00 | 4/1/2016 | 10 |
| 10:08:00 | 5/1/2016 | 2 |
| 18:09:00 | 5/1/2016 | 2 |
| 10:15:00 | 5/1/2016 | 5 |
| 18:32:00 | 5/1/2016 | 5 |
| 10:11:00 | 6/1/2016 | 2 |
| 18:11:00 | 6/1/2016 | 2 |
| 10:25:00 | 6/1/2016 | 5 |
| 18:28:00 | 6/1/2016 | 5 |
| 10:19:00 | 6/1/2016 | 10 |
| 18:26:00 | 6/1/2016 | 10 |
+-----------+-----------+--------+
I need to count where emp_id punchtime is less then that 4 hours and count ir for the whole. i am trying the below code but its not working.
SELECT
a.emp_id,
sum( case when TIMESTAMPDIFF(hour, min(a.punchtime),
max(a.punchtime))< 4 then 1 else 0 end ) as 'Half Day'
FROM machinedata a
GROUP BY
a.emp_id
I am getting a error #1111 - Invalid use of group function
Desired output -
+-----------+-----------+
| emp_id | Half Day |
+-----------+-----------+
|2 | 1 |
|8 | 0 |
|10 |0 |
+-----------+-----------+
Your data set and desired result do not accord, so I'm going to ignore it...
Instead consider the following...
Note both the way in which I have presented the problem, and the construction of the solution.
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(employee_id INT NOT NULL
,punchtime DATETIME NOT NULL
,PRIMARY KEY(employee_id,punchtime)
);
INSERT INTO my_table VALUES
( 2,'2016/01/04 09:51:00'),
( 2,'2016/01/04 12:59:00'),
( 5,'2016/01/04 10:28:00'),
( 5,'2016/01/04 14:13:00'),
(10,'2016/01/04 09:56:00'),
(10,'2016/01/04 15:31:00'),
( 2,'2016/01/05 10:08:00'),
( 2,'2016/01/05 18:09:00'),
( 5,'2016/01/05 10:15:00'),
( 5,'2016/01/05 18:32:00'),
( 2,'2016/01/06 10:11:00'),
( 2,'2016/01/06 18:11:00'),
( 5,'2016/01/06 10:25:00'),
( 5,'2016/01/06 18:28:00'),
(10,'2016/01/06 10:19:00'),
(10,'2016/01/06 18:26:00');
SELECT employee_id
, SUM(diff < 14400 ) half
FROM
( SELECT x.*
, DATE(x.punchtime) dt
, TIME_TO_SEC(MAX(y.punchtime)) - TIME_TO_SEC(MIN(x.punchtime)) diff
FROM my_table x
JOIN my_table y
ON y.employee_id = x.employee_id
AND DATE(y.punchtime) = DATE(x.punchtime)
GROUP
BY x.employee_id
, dt
) n
GROUP
BY employee_id;
+-------------+------+
| employee_id | half |
+-------------+------+
| 2 | 1 |
| 5 | 1 |
| 10 | 0 |
+-------------+------+