sql: group by multiple correlated fields (date, weekday, month) - mysql

I am working on a SQL task. The goal is to know how many flights there are on average, for a given day in a given month from the flights table.
Input table:
flights
id BIGINT
dep_day_of_week varchar (255)
dep_month varchar (255)
dep_date text
An example of the flights table. There could be multiple entries for the same date.
id dep_day_of_week dep_month dep_date
1 Thursday January 4/7/2005 15:24:00
2 Friday February 5/6/2005 12:12:12
3 Friday February 5/6/2005 15:12:12
I read a solution as following:
SELECT a.dep_month,
a.dep_day_of_week,
AVG(a.flight_count) AS average_flights
FROM (
SELECT dep_month, dep_day_of_week, dep_date,
COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3
) a
GROUP BY 1,2
ORDER BY 1,2;
My question is in the subquery which calculate the number of flights per day:
SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3
Since dep_month, dep_day_of_week, dep_date are three correlated attributes, with the dep_date might be the most detailed resolution of the three. So I thought GROUP BY 1,2,3 will do the same function as GROUP BY 3.
To examine what could be the possible differences, I use count(*) from ... to select all the terms resulted from the above subquery,
Select count(*) from (
SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3 or Group Group by 3)
In the output, the counts for GROUP BY 1,2,3 and GROUP BY 3 , are 447 and 441, respectively. Why there is any difference between these two grouping methods?
Updates:
Thanks to #trincot excellent answer. I use his suggested codes and found inconsistency in the input database.
SELECT dep_date, count(distinct dep_month), count(distinct dep_day_of_week)
FROM flights
GROUP BY dep_date
HAVING count(distinct dep_month) > 1
OR count(distinct dep_day_of_week) > 1
Output:
dep_date count(distinct dep_month) count(distinct dep_day_of_week)
1/16/2001 1 2
10/25/2003 1 2
2/23/2000 1 2
3/29/2001 1 2
4/3/2001 1 2
5/13/2000 1 2
Specifically, the database assigns Monday for 1/16/2001 8:25:00 and Tuesday for 1/16/2001 7:56:00. That is the reason of the inconsistency.

As the date field has a time component, the count(*) in your subquery is going to be 1 every time, since the time component will be different and generate a new group. Your groups are actually per second.
You could get your results without subquery, like this:
select dep_month,
dep_day_of_week,
count(*) /
count(distinct substring_index(dep_date, ' ', 1)) avg_flights
from flights
group by dep_month,
dep_day_of_week
This counts all the flight records, and divides that by the number of different dates these flights are on. The date is extracted by only taking the part before the space.
Note that this means that when you don't have a record at all for a certain date, this day will not count in the average and might give a false impression. For instance, if in January there is only one Friday for which you have flights (let's say 10 of them), but there are 4 Fridays in January, you will still get an average of 10, even though 2.5 would be more reasonable.
About the difference in count
You state that this query returns 447 records:
Select count(*) from (
SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3)
And this only 441:
Select count(*) from (
SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
FROM flights
GROUP BY 3)
This seems to indicate that you have identical dates in multiple records, but yet with difference in one of the first two columns, which would be an inconsistency. You can find out with this query:
SELECT dep_date, count(distinct dep_month), count(distinct dep_day_of_week)
FROM flights
GROUP BY dep_date
HAVING count(distinct dep_month) > 1
OR count(distinct dep_day_of_week) > 1
In a healthy data set, this query should return 0 records. If it returns records, you'll get the dates for which the month is not correctly set in at least one record, or the day of the week is not correctly set in at least one record.

Related

Getting the number of users for this year and last year in SQL

My table is like this:
root_tstamp
userId
2022-01-26T00:13:24.725+00:00
d2212
2022-01-26T00:13:24.669+00:00
ad323
2022-01-26T00:13:24.629+00:00
adfae
2022-01-26T00:13:24.573+00:00
adfa3
2022-01-26T00:13:24.552+00:00
adfef
...
...
2021-01-26T00:12:24.725+00:00
d2212
2021-01-26T00:15:24.669+00:00
daddfe
2021-01-26T00:14:24.629+00:00
adfda
2021-01-26T00:12:24.573+00:00
466eff
2021-01-26T00:12:24.552+00:00
adfafe
I want to get the number of users in the current year and in previous year like below using SQL.
Date Users previous_year
2022-01-01 10 5
2022-01-02 20 15
The code is written as follows.
select CAST(root_tstamp as DATE) as Date,
count(DISTINCT userid) as users,
count(Distinct case when CAST(root_tstamp as DATE) = dateadd(MONTH,-12,CAST(root_tstamp as DATE)) then userid end) as previous_year
FROM table1
But it returns 0 for previous_year values.
How can I fix that?
Possible solution for SQL Server:
WITH cte AS ( SELECT 2022 [year]
UNION ALL
SELECT 2021 )
SELECT cte.[year],
COUNT(DISTINCT test.userId) current_users_amount,
COUNT(DISTINCT CASE WHEN YEAR(test.root_tstamp) < cte.[year]
THEN test.userId
END) previous_users_amount
FROM test
JOIN cte ON YEAR(test.root_tstamp) <= cte.[year]
GROUP BY cte.[year]
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=88b78aad9acd965bdbac4c85a0b81927
This query (for MySql) returns unique number of userids where the root_timestamp is in the current year, by day, and the number of unique userids for the same day last year. If there is no record for a day in the current year nothing will be displayed for that day. If there are rows for the current year, but no rows for the same day last year, then NULL will be shown for that lastyear column.
SELECT cast(ty.root_tstamp as date) as Dte,
COUNT(DISTINCT ty.userId) as users_this_day,
count(distinct lysd.userid) as users_sameday_lastyear
FROM test ty
left join
test lysd
on cast(lysd.root_tstamp as date)=date_add(cast(ty.root_tstamp as date), interval -1 year)
WHERE YEAR(ty.root_tstamp) = year(current_date())
GROUP BY Dte
If you wish to show output rows for calendar days even if there are no rows in current year and/or last year, then you also need a calendar table to be introduced (let's hope that it is not what you need)

How can I optimize the query below which uses three levels of select statements?

How to optimize the below query:
I have two tables, 'calendar_table' and 'consumption', Here I use this query to calculate monthly consumption for each year.
The calendar table has day, month and year for years 2005 - 2009 and consumption table has billed consumption data for monthly bill cycle. This query will count the number of days for each bill and use that the find the consumption for each month.
SELECT id,
date_from as bill_start_date,
theYear as Year,
MONTHNAME(STR_TO_DATE(theMonth, '%m')) as month,
sum(DaysOnBill),
TotalDaysInTheMonth,
sum(perDayConsumption * DaysOnBill) as EstimatedConsumption
FROM
(
SELECT
id,
date_from,
theYear,
theMonth, # use theMonth for displaying the month as a number
COUNT(*) AS DaysOnBill,
TotalDaysInTheMonth,
perDayConsumption
FROM
(
SELECT
c.id,
c.date_from as date_from,
ct.dt,
y AS theYear,
month AS theMonth,
DAY(LAST_DAY(ct.dt)) as TotalDaysInTheMonth,
perDayConsumption
FROM
consumption AS c
INNER JOIN
calendar_table AS ct
ON ct.dt >= c.date_from
AND ct.dt<= c.date_to
) AS allDates
GROUP BY
id,
date_from,
theYear,
theMonth ) AS estimates
GROUP BY
id,
theYear,
theMonth;
It is taking around 1000 seconds to go through around 1 million records. Can something be done to make it faster?.
The query is a bit dubious pretending to do one grouping first and then building on that with another, which actually isn't the case.
First the bill gets joined with all its days. Then we group by bill plus month and year thus getting a monthly view on the data. This could be done in one pass, but the query is joining first and then using the result as a derived table which gets aggregated. At last the results are taken again and "another" group is built, which is actually the same as before (bill plus month and year) and some pseudo aggregations are done (e.g. sum(perDayConsumption * DaysOnBill) which is the same as perDayConsumption * DaysOnBill, as SUM sums one record only here).
This can simply written as:
SELECT
c.id,
c.date_from as bill_start_date,
ct.y AS Year,
MONTHNAME(STR_TO_DATE(ct.month, '%m')) as month,
COUNT(*) AS DaysOnBill,
DAY(LAST_DAY(ct.dt)) as TotalDaysInTheMonth,
SUM(c.perDayConsumption) as EstimatedConsumption
FROM consumption AS c
INNER JOIN calendar_table AS ct ON ct.dt BETWEEN c.date_from AND c.date_to
GROUP BY
c.id,
ct.y,
ct.month;
I don't know if this will be faster or if MySQL's optimizer doesn't see through your query itself and boils it down to this anyhow.

SQL WHERE IF clause issue

I have a SQL/Java code issue. The basic overlay is as follows: a MySQL database with a table. In this table there are multiple columns. One column consists of names. An associated column is months. In the third column there is counts. So a sample table would be
john - january - 5
john - january - 6
mary - january - 5
Alex - February- 5
John - February - 6
John - February - 4
Mary - February - 3
John - march - 4
The table continues to month May.
So John appears in five months, Mary in 3, and Alex in one. Currently, my SQL query somewhat looks like this.
select name, sum(count)/4
from table where (category ='something'
AND month not like 'May') group by name;
Basically, what this query is supposed to do is just display each name with the average counts per month. Hence, the sum will be divided by four (because I exclude May, so it must divide Jan-April/4). However, the issue is that some names only appear in one month (or two or three).
This means for that name, the sum of the counts would only be divided by that specific number, to get the average counts over the months. How would I go about this? I feel as if this will require some if statement in a where clause. Kind of like where if the count of the distinct (because months may repeat) is a certain number, then divide the sum(count) by that number for each name?
Also, I think it may not be a where if clause issue. I've read some forums where possibly some use of case could be utilized?
If you need average per month, you can GROUP BY name and month and use AVG function:
SELECT `name`, `month`, avg(`count`)
FROM table
WHERE `category` ='something' AND `month` NOT LIKE 'May'
GROUP BY `name`, `month`;
If you need average for all period, just GROUP BY name and AVG count:
SELECT `name`, avg(`count`)
FROM table
WHERE `category` ='something' AND `month` NOT LIKE 'May'
GROUP BY `name`;
And another option, if you don't like AVG:
SELECT `name`, sum(`count`)/(SELECT count(*) FROM `table` AS `t2` WHERE `category` ='something' AND `month` NOT LIKE 'May' and `t1`.`name` = `t2`.`name`)
FROM `table` AS `t1`
WHERE `category` ='something' AND `month` NOT LIKE 'May')
GROUP BY name;
But I would stay with AVG.
Actually, i prefer to use != instead of NOT LIKE it's improves readability
Just for completness sake here is a WORKING FIDDLE. using the AVG function is the way to go as it will do the average per person per month. look at John in January.. his result is 5.5 when the count (in january) is 5 and 6.. average = 5.5.
SELECT
person,
month,
avg(counter)
FROM testing
where
(
category ='something'
AND month <> 'May'
)
GROUP BY person, month;
If you want to see the data in one like as it sounds like that from your post then you can do this. ANOTHER FIDDLE
SELECT
person,
group_concat(month),
group_concat(average_count)
FROM(
SELECT
person,
month,
avg(counter) as average_count
FROM testing
where
(
category ='something'
AND month <> 'May'
)
GROUP BY person, month
) as t
group by person;
Try this :
SELECT name, SUM(count) / COUNT(DISTINCT month)
FROM table
WHERE month != 'May'
AND category = 'something'
GROUP BY name

MYSQL select average number of entries

I have a table that has a unique key each time a user creates a case:
id|doctor_id|created_dt
--|---------|-----------
1|23 |datetimestamp
2|23 |datetimestamp
3|17 |datetimestamp
How can I select and return the average amount of entries a user has per month?
I have tried this:
SELECT avg (id)
FROM `cases`
WHERE created_dt BETWEEN DATE_SUB(CURDATE(),INTERVAL 90 DAY) AND CURDATE()
and doctor_id = 17
But this returns a ridiculously large value that cannot be true.
To clarify: I am trying to get something like doctor id 17 has an average of 2 entries per month into this table.
I think you were thrown off by the idea of "averaging". You don't want the average id, or average user_id. You want the average number of entries into the table, so you would use COUNT():
SELECT count(id)/3 AS AverageMonthlyCases
FROM `cases`
WHERE created_dt BETWEEN DATE_SUB(CURDATE(),INTERVAL 90 DAY) AND CURDATE()
group by doctor_id
Since you have a 90 day interval, you want to count the number of rows per 30 days, or the count/3.
SELECT AVG(cnt), user_id
FROM (
SELECT COUNT(id) cnt, user_id
FROM cases
WHERE created_dt BETWEEN <yourDateInterval>
GROUP BY user_id, year(created_dt), month(created_dt)
)
Since you need average number of entries, AVG function is not really applicable, because it is SUM()/COUNT() and obviously you do not need that (why would you need SUM of ids).
You need something like this
SELECT
doctor_id,
DATE(created_dt,'%m-%Y') AS month,
COUNT(id) AS visits
FROM `cases`
GROUP BY
`doctor_id`,
DATE(created_dt,'%m-%Y')
ORDER BY
`doctor_id` ASC,
DATE(created_dt,'%m-%Y') ASC
To get visits per month per doctor. If you want to average it, you can then use something like
SELECT
doctor_id,
SUM(visits)/COUNT(month) AS `average`
FROM (
SELECT
doctor_id,
DATE(created_dt,'%m-%Y') AS month,
COUNT(id) AS visits
FROM `cases`
GROUP BY
`doctor_id`,
DATE(created_dt,'%m-%Y')
ORDER BY
`doctor_id` ASC,
DATE(created_dt,'%m-%Y') ASC
) t1
GROUP BY
doctor_id
Obviously you can add your WHERE clauses, as this query is compatible for multiple years (i.e. it will not count January of 2013th and January of 2014th as one month).
Also, it takes into account if a doctor has "blank" months, where he did not have any patients, so it will not count those months (0 can destroy and average).
Use this, you'll group each doctor's total id, by month.
Select monthname(created_dt), doctor_id, count(id) as total from cases group by 1,2 order by 1
Also you can use GROUP_CONCAT() as nested query in order to deploy a pivot like table, where each column is each doctor_id.

sql multiple columns plus sum of each column

Using MySQL, I am counting the occurrence of several events (fields) over a time span of years. I then display this in columns by year. My query works perfect when grouped by year. I now want to add a final column which displays the aggregate of the years. How do I include the total of columns query?
Event 2008 2009 2010 2011 total
A 0 2 0 1 3
B 1 2 3 0 6
etc.
Here is the real query:
select
count(*) as total_docs,
YEAR(field_document_date_value) as doc_year,
field_document_facility_id_value as facility,
IF(count(IF(field_document_type_value ='LIC809',1, NULL)) >0,count(IF(field_document_type_value ='LIC809',1, NULL)),'-') as doc_type_LIC809,
IF(count(IF(field_document_type_value ='LIC9099',1, NULL)) >0,count(IF(field_document_type_value ='LIC9099',1, NULL)),'-') as doc_type_LIC9099,
IF(count(field_document_f1_value) >0,count(field_document_f1_value),'-') as substantial_compliance,
IF(count(field_document_f2_value) >0,count(field_document_f2_value),'-') as deficiencies_sited,
IF(count(field_document_f3_value) >0,count(field_document_f3_value),'-') as admin_outcome_809,
IF(count(field_document_f4_value) >0,count(field_document_f4_value),'-') as unfounded,
IF(count(field_document_f5_value) >0,count(field_document_f5_value),'-') as substantiated,
IF(count(field_document_f6_value) >0,count(field_document_f6_value),'-') as inconclusive,
IF(count(field_document_f7_value) >0,count(field_document_f7_value),'-') as further_investigation,
IF(count(field_document_f8_value) >0,count(field_document_f8_value),'-') as admin_outcome_9099,
IF(count(field_document_type_a_value) >0,count(field_document_type_a_value),'-') as penalty_type_a,
IF(count(field_document_type_b_value) >0,count(field_document_type_b_value),'-') as penalty_type_b,
IF(sum(field_document_civil_penalties_value) >0,CONCAT('$',sum(field_document_civil_penalties_value)),'-') as total_penalties,
IF(count(field_document_noncompliance_value) >0,count(field_document_noncompliance_value),'-') as total_noncompliance
from rcfe_content_type_facility_document
where YEAR(field_document_date_value) BETWEEN year(NOW()) -9 AND year(NOW())
and field_document_facility_id_value = :facility
group by doc_year
You can not GROUP row twice in a SELECT, so you can only count row in a year or in total. You can UNION two SELECT (one grouped by year, second not grouped - total) to overcome this limitation, but I think it is better to count total from year result in script if there is any.
Simplified example:
SELECT by_year.amount, years.date_year FROM
-- generating years pseudo table
(
SELECT 2008 AS date_year
UNION ALL SELECT 2009
UNION ALL SELECT 2010
UNION ALL SELECT 2011
) AS years
-- joining with yearly stat data
LEFT JOIN
(
SELECT SUM(value_field) AS amount, YEAR(date_field) AS date_year FROM data
GROUP BY YEAR(date_field)
) AS by_year USING(date_year)
-- appending total
UNION ALL SELECT SUM(value_field) AS amount, 'total' AS date_year FROM data
WITH ROLLUP is your friend:
http://dev.mysql.com/doc/refman/5.7/en/group-by-modifiers.html
Use your original query and simply add this to the last line:
GROUP BY doc_year WITH ROLLUP
That will add a final cumulative row to your query's result set.