Mysql SELECT query group by with sum and avg - mysql

If I have added "group by date" then sum or avg function is not working.
Here is a table
| date | calories |
|-------------------------|
| 2021-03-28 | 42.50 |
| 2021-03-30 | 500.00 |
| 2021-03-31 | 35.00 |
| 2021-04-01 | 200.00 |
| 2021-04-01 | 35.00 |
Here is create Query
SELECT CONCAT(round(IF(avg(up.calories), avg(up.calories), 0), 2), "kcal") as avg, CONCAT(round(IF(SUM(up.calories), SUM(up.calories), 0), 2), "kcal") as total_burned
FROM `tbl` as `up`
WHERE `date` BETWEEN "2021-03-28" AND "2021-04-03"
AND `calories` != '0'
GROUP BY `date`
Below is my query result
| avg | total_burned |
|-----------------------------|
| 42.50 | 42.50 |
| 500.00 | 500.00 |
| 35.00 | 35.00 |
| 235.00 | 235.00 |
But actually, I want to this type of result
| avg | total_burned |
|-----------------------------|
| 203.13 | 812.50 |

Roll your own
DROP TABLE IF EXISTS T;
create table t( date date, calories decimal(10,2));
insert into t values
( '2021-03-28' , 42.50 ),
( '2021-03-30' , 500.00 ),
( '2021-03-31' , 35.00 ),
( '2021-04-01' , 200.00 ),
( '2021-04-01' , 35.00 );
select sum(calories) sumcal,sum(calories) / count(distinct date) calcavg, avg(calories)
from t;
+--------+------------+---------------+
| sumcal | calcavg | avg(calories) |
+--------+------------+---------------+
| 812.50 | 203.125000 | 162.500000 |
+--------+------------+---------------+
1 row in set (0.002 sec)

Related

How to select all records for only the first 50 distinct values in a column

I am trying to create a classifier model for a dataset, but I have too many distinct values for my target variable. If I run something like this:
Create or replace model `model_name`
options (model_type="AUTOML_CLASSIFIER", input_label_cols=["ORIGIN_AIRPORT"]) as
select DAY_OF_WEEK, ARRIVAL_TIME, ARRIVAL_DELAY, ORIGIN_AIRPORT
from `table_name`
limit 1000
I end up getting
Error running query
Classification model currently only supports classification with up to 50 unique labels and the label column had 111 unique labels.
So how can I select, for example, all rows that have one of the first 50 values of ORIGIN_AIRPORT?
Select * from “TABLE_NAME” as T1 left outer join (SELECT distinct
COLUMN_NAME from TABLE_NAME Order by COLUMN_NAME limit 50)as T2 on
T1.COLUMN_NAME=T2.COLUMN_NAME
This query will fetch you 50 distinct values in the inner query, then the outer query searches for those particular 50 distinct values using the T1.COLUMN_NAME=T2.COLUMN_NAME commands and returns all the records( it shows null for those not included in the 50 unique list)
Given a table of values (origin_airport), with unique identifiers (id) and date, find the minimum date for each unique value (origin_airport) to decide which N origin_airport values are to be returned.
Return all rows which match the first 3 unique origin_airport values (densely ranked, by min(date) per origin_airport).
Updated: to use columns that more closely match the model, with origin_airport and a date column for ordering.
Full working test case
The test data:
CREATE TABLE airportlogs (
origin_airport int
, id int primary key auto_increment
, date date DEFAULT NULL
);
INSERT INTO airportlogs (origin_airport) VALUES
( 1 )
, ( 1 )
, ( 8 )
, ( 8 )
, ( 8 )
, ( 7 )
, ( 7 )
, ( 6 )
, ( 5 )
, ( 4 )
, ( 3 )
, ( 3 )
, ( 7 )
, ( 7 )
, ( 1 )
, ( 8 )
, ( 3 )
, ( 1 )
;
-- Create some dates to use for ordering.
-- Ordering can be as complicated as we need.
UPDATE airportlogs SET date = current_date + INTERVAL +id DAY;
-- Intermediate calculation to show the MIN(date) per origin_airport
WITH nvals (origin_airport, mdate) AS (
SELECT origin_airport, MIN(date) AS mdate FROM airportlogs GROUP BY origin_airport
)
SELECT *
FROM nvals
ORDER BY mdate
;
+----------------+------------+
| origin_airport | mdate |
+----------------+------------+
| 1 | 2021-08-05 |
| 8 | 2021-08-07 |
| 7 | 2021-08-10 |
| 6 | 2021-08-12 |
| 5 | 2021-08-13 |
| 4 | 2021-08-14 |
| 3 | 2021-08-15 |
+----------------+------------+
-- Calculation of ordered rank for the unique origin_airport values
-- by MIN(date) per origin_airport.
WITH nvals0 (origin_airport, date, mdate) AS (
SELECT origin_airport
, date
, MIN(date) OVER (PARTITION BY origin_airport) AS mdate
FROM airportlogs
)
, nvals (origin_airport, date, mdate, r) AS (
SELECT origin_airport
, date
, mdate
, DENSE_RANK() OVER (ORDER BY mdate) AS r
FROM nvals0
)
SELECT *
FROM nvals
ORDER BY r, date
;
Result:
+----------------+------------+------------+---+
| origin_airport | date | mdate | r |
+----------------+------------+------------+---+
| 1 | 2021-08-05 | 2021-08-05 | 1 |
| 1 | 2021-08-06 | 2021-08-05 | 1 |
| 1 | 2021-08-19 | 2021-08-05 | 1 |
| 1 | 2021-08-22 | 2021-08-05 | 1 |
| 8 | 2021-08-07 | 2021-08-07 | 2 |
| 8 | 2021-08-08 | 2021-08-07 | 2 |
| 8 | 2021-08-09 | 2021-08-07 | 2 |
| 8 | 2021-08-20 | 2021-08-07 | 2 |
| 7 | 2021-08-10 | 2021-08-10 | 3 |
| 7 | 2021-08-11 | 2021-08-10 | 3 |
| 7 | 2021-08-17 | 2021-08-10 | 3 |
| 7 | 2021-08-18 | 2021-08-10 | 3 |
| 6 | 2021-08-12 | 2021-08-12 | 4 |
| 5 | 2021-08-13 | 2021-08-13 | 5 |
| 4 | 2021-08-14 | 2021-08-14 | 6 |
| 3 | 2021-08-15 | 2021-08-15 | 7 |
| 3 | 2021-08-16 | 2021-08-15 | 7 |
| 3 | 2021-08-21 | 2021-08-15 | 7 |
+----------------+------------+------------+---+
The final solution:
WITH min_date (origin_airport, date, mdate) AS (
SELECT origin_airport
, date
, MIN(date) OVER (PARTITION BY origin_airport) AS mdate
FROM airportlogs
)
, ranks (origin_airport, date, mdate, r) AS (
SELECT origin_airport
, date
, mdate
, DENSE_RANK() OVER (ORDER BY mdate) AS r
FROM min_date
)
SELECT *
FROM ranks
WHERE r <= 3
ORDER BY r, date
;
The final result:
+----------------+------------+------------+---+
| origin_airport | date | mdate | r |
+----------------+------------+------------+---+
| 1 | 2021-08-05 | 2021-08-05 | 1 |
| 1 | 2021-08-06 | 2021-08-05 | 1 |
| 1 | 2021-08-19 | 2021-08-05 | 1 |
| 1 | 2021-08-22 | 2021-08-05 | 1 |
| 8 | 2021-08-07 | 2021-08-07 | 2 |
| 8 | 2021-08-08 | 2021-08-07 | 2 |
| 8 | 2021-08-09 | 2021-08-07 | 2 |
| 8 | 2021-08-20 | 2021-08-07 | 2 |
| 7 | 2021-08-10 | 2021-08-10 | 3 |
| 7 | 2021-08-11 | 2021-08-10 | 3 |
| 7 | 2021-08-17 | 2021-08-10 | 3 |
| 7 | 2021-08-18 | 2021-08-10 | 3 |
+----------------+------------+------------+---+
There are a number of other solutions.
The poster didn't mention the logic for this ordering. But with the above window function behavior, that's trivial to specify.

Sum values while the column value is the same

I am trying to build a timeline. My table have a type column, date_start and date_end, like so:
+------+---------------------+---------------------+----------+
| Type | Start | End | Diff |
+------+---------------------+---------------------+----------+
| 1 | 2020-11-23 23:40:00 | 2020-11-23 23:41:00 | 00:01:00 |
| 1 | 2020-11-23 23:42:00 | 2020-11-23 23:43:00 | 00:01:00 |
| 1 | 2020-11-23 23:44:00 | 2020-11-23 23:45:00 | 00:01:00 |
| 2 | 2020-11-23 23:46:00 | 2020-11-23 23:47:00 | 00:01:00 |
| 2 | 2020-11-23 23:48:00 | 2020-11-23 23:49:00 | 00:01:00 |
| 1 | 2020-11-23 23:50:00 | 2020-11-23 23:51:00 | 00:01:00 |
| 1 | 2020-11-23 23:52:00 | 2020-11-23 23:53:00 | 00:01:00 |
+------+---------------------+---------------------+----------+
I need to sum the differences, while the column value stays the same as the one before. Once the type column value changes, it creates a new line, giving a result like this:
+------+----------+
| Type | Diff |
+------+----------+
| 1 | 00:03:00 |
| 2 | 00:02:00 |
| 1 | 00:02:00 |
+------+----------+
How can I achieve such grouping and sum result in MySQL?
PS: Don't bother with time logics, if you want to setup an example using integer is perfectly ok.
Use a variable to assign a block number and then aggregate
drop table if exists t;
create table t
( Type int, Startdt datetime, Enddt datetime, Diff time);
insert into t values
( 1 ,'2020-11-23 23:40:00' ,'2020-11-23 23:41:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:42:00' ,'2020-11-23 23:43:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:44:00' ,'2020-11-23 23:45:00' , '00:01:00' ),
( 2 ,'2020-11-23 23:46:00' ,'2020-11-23 23:47:00' , '00:01:00' ),
( 2 ,'2020-11-23 23:48:00' ,'2020-11-23 23:49:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:50:00' ,'2020-11-23 23:51:00' , '00:01:00' ),
( 1 ,'2020-11-23 23:52:00' ,'2020-11-23 23:53:00' , '00:01:00' );
select type,block,sec_to_time(sum(time_to_sec(diff)))
from
(
select t.*,
if(type <> #p, #b:=#b+1,#b:=#b) block,
#p:=type p
from t
cross join (select #b:=0,#p:=0) b
order by startdt,type
) s
group by s.block,s.type;
------+-------+-------------------------------------+
| type | block | sec_to_time(sum(time_to_sec(diff))) |
+------+-------+-------------------------------------+
| 1 | 1 | 00:03:00 |
| 2 | 2 | 00:02:00 |
| 1 | 3 | 00:02:00 |
+------+-------+-------------------------------------+
3 rows in set (0.148 sec)
SELECT t.type,sum(c.dif)
FROM TABLE t JOIN
(SELECT c.type TIMEDIFF(c.start,c.end)
as dif FROM TABLE c)ta ON ta.type = t.type
group by t.type

how to get the current balance of each source and then get the percentage of it from the total sum of all sources

SELECT source, SUM(deposit), SUM(distribute), SUM(deposit)-SUM(distribute),(SUM(deposit)-SUM(distribute)) / (SUM(SUM(deposit)-SUM(distribute)) * 100 as percentage
FROM tbl_sourceofFunds
GROUP BY source
it keep saying "#1111 - Invalid use of group function"
source | deposit | withdraw |
--------------------------------------
A | 300,000.00 | |
B | 300,000.00 | |
C | 220,000.00 | |
A | | 53,300.00 |
A | 20,000.00 | |
B | | 3,700.00 |
C | | 5,100.00 |
what I meant is to get:
source | sum.deposit |sum.withdraw | balance | percentage |
------------------------------------------------------------------
A | 320,000.00 | 53,300 | 266,700.00 | 34.284612 |
B | 300,000.00 | 3,700 | 296,300.00 | 38.089729 |
C | 220,000.00 | 5,100 | 214,900.00 | 34.284612 |
You can include a calculation to get the total across the entire table by including it in a sub query
DROP TABLE IF EXISTS sourcefunds;
CREATE TABLE sourcefunds(source VARCHAR(1), deposit DECImal (10,2), distribute decimal(10,2));
insert into sourcefunds values
('A' , 320000.00 , null ),
('B' , 300000.00 , null ),
('C' , 220000.00 , null ),
('A' , null , 53300.00 ),
('A' , 20000.00 , null ),
('B' , null , 3700.00 ),
('C' , null , 5100.00 );
SELECT source, SUM(deposit), SUM(distribute), SUM(deposit)-SUM(distribute),
(SUM(deposit)-SUM(distribute)) / (select sum(deposit) - sum(distribute) from sourcefunds) * 100 as percentage
FROM sourcefunds
GROUP BY source;
+--------+--------------+-----------------+------------------------------+------------+
| source | SUM(deposit) | SUM(distribute) | SUM(deposit)-SUM(distribute) | percentage |
+--------+--------------+-----------------+------------------------------+------------+
| A | 340000.00 | 53300.00 | 286700.00 | 35.931821 |
| B | 300000.00 | 3700.00 | 296300.00 | 37.134979 |
| C | 220000.00 | 5100.00 | 214900.00 | 26.933200 |
+--------+--------------+-----------------+------------------------------+------------+
3 rows in set (0.00 sec)
And if you want grand total use rollup
SELECT source, SUM(deposit), SUM(distribute), SUM(deposit)-SUM(distribute),
(SUM(deposit)-SUM(distribute)) / (select sum(deposit) - sum(distribute) from sourcefunds) * 100 as percentage
FROM sourcefunds
GROUP BY source with rollup;
+--------+--------------+-----------------+------------------------------+------------+
| source | SUM(deposit) | SUM(distribute) | SUM(deposit)-SUM(distribute) | percentage |
+--------+--------------+-----------------+------------------------------+------------+
| A | 340000.00 | 53300.00 | 286700.00 | 35.931821 |
| B | 300000.00 | 3700.00 | 296300.00 | 37.134979 |
| C | 220000.00 | 5100.00 | 214900.00 | 26.933200 |
| NULL | 860000.00 | 62100.00 | 797900.00 | 100.000000 |
+--------+--------------+-----------------+------------------------------+------------+
4 rows in set (0.00 sec)
Try this
Select source, deposit, withdrew,
deposit - distribute as balance,
(deposit - distribute) / (Select sum(deposit - distribute) from tbl_sourceofFunds) * 100 as percentage
From (
SELECT source,
SUM(deposit) as deposit,
SUM(distribute) as withdrew
FROM tbl_sourceofFunds
GROUP BY source
) x

Include NULL in SQL Join when using WHERE

I have the following two tables:
Table TempUser22 : 57,000 rows:
+------+-----------+
| Id | Followers |
+------+-----------+
| 874 | 55542 |
| 1081 | 330624 |
| 1378 | 17919 |
| 1621 | 920 |
| 1688 | 255463 |
| 2953 | 751 |
| 3382 | 204466 |
| 3840 | 273489 |
| 4145 | 376 |
| ... | ... |
+------+-----------+
Table temporal_users : 10,000,000 rows total, 3200 rows Where Date=2010-12-31:
+---------------------+---------+--------------------+
| Date | User_Id | has_original_tweet |
+---------------------+---------+--------------------+
| 2008-02-22 12:00:00 | 676493 | 2 |
| 2008-02-22 12:00:00 | 815263 | 1 |
| 2008-02-22 12:00:00 | 6245822 | 1 |
| 2008-02-22 12:00:00 | 8854092 | 1 |
| 2008-02-23 12:00:00 | 676493 | 2 |
| 2008-02-23 12:00:00 | 815263 | 1 |
| 2008-02-23 12:00:00 | 6245822 | 1 |
| 2008-02-23 12:00:00 | 8854092 | 1 |
| 2008-02-24 12:00:00 | 676493 | 2 |
| ............. | ... | .. |
+---------------------+---------+--------------------+
I am running the following join query on these tables:
SELECT sum(has_original_tweet), b.Id
FROM temporal_users AS a
RIGHT JOIN TempUser22 AS b
ON a.User_ID = b.Id
GROUP BY b.Id;
Which returns 57,00 rows as expected, with NULL answers on the first field:
+-------------------------+------+
| sum(has_original_tweet) | Id |
+-------------------------+------+
| NULL | 874 |
| NULL | 1081 |
| 135 | 1378 |
| 164 | 1621 |
| 652 | 1688 |
| 691 | 2953 |
| NULL | 3382 |
| NULL | 3840 |
| NULL | 4145 |
| ... | .... |
+-------------------------+------+
However, when adding the WHERE line specifying a date as below:
SELECT sum(has_original_tweet), b.Id
FROM temporal_users AS a
RIGHT JOIN TempUser22 AS b
ON a.User_ID = b.Id
WHERE a.Date BETWEEN '2010-12-31-00:00:00' AND '2010-12-31-23:59:59'
GROUP BY b.Id;
I receive the following answer, of only 3200 rows, and without any NULL in the first field.
+-------------------------+---------+
| sum(has_original_tweet) | Id |
+-------------------------+---------+
| 1 | 797194 |
| 1 | 815263 |
| 0 | 820678 |
| 1 | 1427511 |
| 0 | 4653731 |
| 1 | 5933862 |
| 2 | 7530552 |
| 1 | 7674072 |
| 1 | 8149632 |
| .. | .... |
+-------------------------+---------+
My question is: How to get, for a given date, an answer of size 57,000 rows for each user in TempUser22 with NULL values when has_original_tweet is not present in temporal_user for the given date?
Thanks.
SELECT b.Id, SUM(a.has_original_tweet) s
FROM TempUser22 b
LEFT JOIN temporal_users a ON b.Id = a.User_Id
AND a.Date BETWEEN '2010-12-31-00:00:00' AND '2010-12-31-23:59:59'
GROUP BY b.Id;
Id s
1 null
2 1
3 null
4 3
5 null
6 null
For debugging, I used:
CREATE TEMPORARY TABLE TempUser22(Id INT, Followers INT)
SELECT 1 Id, 10 Followers UNION ALL
SELECT 2, 20 UNION ALL
SELECT 3, 30 UNION ALL
SELECT 4, 40 UNION ALL
SELECT 5, 50 UNION ALL
SELECT 6, 60
;
CREATE TEMPORARY TABLE temporal_users(`Date` DATETIME, User_Id INT, has_original_tweet INT)
SELECT '2008-02-22 12:00:00' `Date`, 1 User_Id, 1 has_original_tweet UNION ALL
SELECT '2008-12-31 12:00:00', 2, 1 UNION ALL
SELECT '2010-12-31 12:00:00', 2, 1 UNION ALL
SELECT '2012-12-31 12:00:00', 2, 1 UNION ALL
SELECT '2008-12-31 12:00:00', 4, 9 UNION ALL
SELECT '2010-12-31 12:00:00', 4, 1 UNION ALL
SELECT '2010-12-31 12:00:00', 4, 2 UNION ALL
SELECT '2012-12-31 12:00:00', 4, 9
;
That's because NULL values will always be discarded from the where clause
You can use a coalesce in your where clause.
WHERE coalesce(a.Date, 'some-date-in-the-range') BETWEEN '2010-12-31-00:00:00' AND '2010-12-31-23:59:59'
With this instead, you force null values to be considered as valid.

TIMESTAMPDIFF Sum Case error

+-----------+-----------+--------+
| punchtime | punchdate | emp_id |
+-----------+-----------+--------+
| 9:51:00 | 4/1/2016 | 2 |
| 12:59:00 | 4/1/2016 | 2 |
| 10:28:00 | 4/1/2016 | 5 |
| 14:13:00 | 4/1/2016 | 5 |
| 9:56:00 | 4/1/2016 | 10 |
| 15:31:00 | 4/1/2016 | 10 |
| 10:08:00 | 5/1/2016 | 2 |
| 18:09:00 | 5/1/2016 | 2 |
| 10:15:00 | 5/1/2016 | 5 |
| 18:32:00 | 5/1/2016 | 5 |
| 10:11:00 | 6/1/2016 | 2 |
| 18:11:00 | 6/1/2016 | 2 |
| 10:25:00 | 6/1/2016 | 5 |
| 18:28:00 | 6/1/2016 | 5 |
| 10:19:00 | 6/1/2016 | 10 |
| 18:26:00 | 6/1/2016 | 10 |
+-----------+-----------+--------+
I need to count where emp_id punchtime is less then that 4 hours and count ir for the whole. i am trying the below code but its not working.
SELECT
a.emp_id,
sum( case when TIMESTAMPDIFF(hour, min(a.punchtime),
max(a.punchtime))< 4 then 1 else 0 end ) as 'Half Day'
FROM machinedata a
GROUP BY
a.emp_id
I am getting a error #1111 - Invalid use of group function
Desired output -
+-----------+-----------+
| emp_id | Half Day |
+-----------+-----------+
|2 | 1 |
|8 | 0 |
|10 |0 |
+-----------+-----------+
Your data set and desired result do not accord, so I'm going to ignore it...
Instead consider the following...
Note both the way in which I have presented the problem, and the construction of the solution.
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(employee_id INT NOT NULL
,punchtime DATETIME NOT NULL
,PRIMARY KEY(employee_id,punchtime)
);
INSERT INTO my_table VALUES
( 2,'2016/01/04 09:51:00'),
( 2,'2016/01/04 12:59:00'),
( 5,'2016/01/04 10:28:00'),
( 5,'2016/01/04 14:13:00'),
(10,'2016/01/04 09:56:00'),
(10,'2016/01/04 15:31:00'),
( 2,'2016/01/05 10:08:00'),
( 2,'2016/01/05 18:09:00'),
( 5,'2016/01/05 10:15:00'),
( 5,'2016/01/05 18:32:00'),
( 2,'2016/01/06 10:11:00'),
( 2,'2016/01/06 18:11:00'),
( 5,'2016/01/06 10:25:00'),
( 5,'2016/01/06 18:28:00'),
(10,'2016/01/06 10:19:00'),
(10,'2016/01/06 18:26:00');
SELECT employee_id
, SUM(diff < 14400 ) half
FROM
( SELECT x.*
, DATE(x.punchtime) dt
, TIME_TO_SEC(MAX(y.punchtime)) - TIME_TO_SEC(MIN(x.punchtime)) diff
FROM my_table x
JOIN my_table y
ON y.employee_id = x.employee_id
AND DATE(y.punchtime) = DATE(x.punchtime)
GROUP
BY x.employee_id
, dt
) n
GROUP
BY employee_id;
+-------------+------+
| employee_id | half |
+-------------+------+
| 2 | 1 |
| 5 | 1 |
| 10 | 0 |
+-------------+------+