Impala/SQL: Can I have different time-period for each group? - mysql

I have the following table:
id | timestamp | team
----------------------------
1 | 2016-05-06 | A
2 | 2016-03-02 | A
3 | 2015-12-01 | A
4 | 2016-07-05 | B
5 | 2016-06-30 | B
6 | 2016-06-28 | B
7 | 2016-04-05 | C
8 | 2016-04-02 | C
9 | 2016-01-02 | C
I want to group by team and find the last timestamp for each team, so I did:
select team, max(timestamp) from my_table group by team
It's all working fine so far. However, now I want to find out how many distinct id in the last month of each team. For example, for team A, it would be from 2016-04-07 to 2016-05-06, so such count is 1. For team B, the last month is from 2016-06-06 to 2016-07-05, so the count is 3. And for team C, the last month is from 2016-03-06 to 2016-04-05, and the count is 2. My expected output should look like:
team | max(timestamp) | count_in_last_month
------------------------------------------------
A | 2016-05-06 | 1
B | 2016-07-05 | 3
C | 2016-04-05 | 2
Can this be derived using the Impala query? Thanks!

Join the original table with the subquery that gets the max timestamp.
SELECT t1.team, t2.month_end, COUNT(DISTINCT t1.id) AS count_in_last_month
FROM my_table AS t1
JOIN (SELECT team, MAX(timestamp) AS month_end
FROM my_table
GROUP BY team) AS t2
ON t1.team = t2.team
AND t1.timestamp BETWEEN DATE_SUB(month_end, INTERVAL 1 MONTH) AND month_end
GROUP BY t1.team, t2.month_end
DEMO

Related

Sum childrens in two tables of a table

I Have 3 tables:
a (id,date,ckey) b(id,a.ckey,hht,hha) c(id,a.ckey,date_ini,date_fin)
where B keeps all the activities to be done and their respective hours in 2 places (hht,hha), while c saves the activities carried out with its initial and final date (to determine the hours executed the dates are subtracted).
Now I need to know, for each record in A how many hours you have assigned (B) and how many hours you have completed (C)
actually i have this:
a:
+----------+----------+------------+
| id | date | ckey |
+----------+----------+------------+
| 1 |2018-01-20| 18 |
|----------|----------|------------|
b:
+----------+----------+--------+--------+
| id | a.ckey | hht | hht |
+----------+----------+--------+--------+
| 1 | 18 | 2 | 3 |
| 2 | 18 | 2 | 5 |
| 3 | 18 | 0 | 7 |
+----------+----------+--------+--------+
c:
+----------+----------+----------------------+----------------------+
| id | a.ckey | date_ini | date_fin |
+----------+----------+----------------------+----------------------+
| 1 | 18 | 2019-01-23 13:30:00 | 2019-01-23 14:00:00 |
| 1 | 18 | 2019-01-23 14:00:00 | 2019-01-23 14:30:00 |
+----------+----------+----------------------+----------------------+
I need this:
+----------+----------+----------------------+----------------------+
| id | a.ckey | hours | hours2 |
+----------+----------+----------------------+----------------------+
| 1 | 18 | 19 | 1 |
+----------+----------+----------------------+----------------------+
I get this:
+----------+----------+----------------------+----------------------+
| id | a.ckey | hours | hours2 |
+----------+----------+----------------------+----------------------+
| 1 | 18 | 38 | 37.5 |
+----------+----------+----------------------+----------------------+
This is my query:
SELECT
(b.hht+b.hha) AS hours,
(SUM(b.hht+b.hha) -
FORMAT(IFNULL((TIMESTAMPDIFF(MINUTE, c.date_ini, c.date_fin)/60),0),2)) AS hours2
FROM a
LEFT JOIN b ON a.key=b.akey
INNER JOIN c ON a.key=c.akey
GROUP a.ckey
Because you have multiple rows in tables b and c for each value of ckey you need to do the aggregation within a subquery, otherwise you get duplicated rows leading to incorrect sums.
SELECT a.id, a.key, b.hours, FORMAT(c.minutes/60, 2) AS hours2
FROM a
LEFT JOIN (SELECT akey, SUM(hht+hha) AS hours
FROM b
GROUP BY akey) b ON b.akey = a.key
LEFT JOIN (SELECT akey, SUM(TIMESTAMPDIFF(MINUTE, date_ini, date_fin)) AS minutes
FROM c
GROUP BY akey) c ON c.akey = a.key
ORDER BY a.id
Output:
id key hours hours2
1 18 19 1.00
Demo on SQLFiddle
You're doing an m-to-n-join, try UNION ALL instead:
select ckey, sum(hours) as hours, sum(hours) - sum(hours2) as hours2
from
(
SELECT ckey, (b.hht+b.hha) AS hours, NULL as hours2
FROM b
UNION ALL
SELECT ckey, NULL AS hours,
FORMAT(IFNULL((TIMESTAMPDIFF(MINUTE, c.date_ini, c.date_fin)/60),0),2)) as hours2
FROM c
) as dt
group by ckey
If you actually need columns from table a put this Select in a Derived Table and join to it.
please check this
SELECT
(SELECT SUM(hha + hht) from b where b.ckey = a.ckey) hours,
FORMAT((SELECT SUM(TIMESTAMPDIFF(MINUTE, c.date_ini, c.date_fin)/60) from c where c.ckey = a.ckey),2) as hours2
FROM A
Fiddle

How to join two tables with average function and where clause? SQL

I have two tables below with the following information
project.analytics
| proj_id | list_date | state
| 1 | 03/05/10 | CA
| 2 | 04/05/10 | WA
| 3 | 03/05/10 | WA
| 4 | 04/05/10 | CA
| 5 | 03/05/10 | WA
| 6 | 04/05/10 | CA
employees.analytics
| employee_id | proj_id | worked_date
| 20 | 1 | 3/12/10
| 30 | 1 | 3/11/10
| 40 | 2 | 4/15/10
| 50 | 3 | 3/16/10
| 60 | 3 | 3/17/10
| 70 | 4 | 4/18/10
What query can I write to determine the average number of unique employees who have worked on the project in the first 7 days that it was listed by month and state?
Desired output:
| list_date | state | # Unique Employees of projects first 7 day list
| March | CA | 1
| April | WA | 2
| July | WA | 2
| August | CA | 1
My Attempt
select
month(list_date),
state_name,
count(*) as Projects,
from projects
group by
month(list_date),
state_name;
I understand the next steps are to subtract the worked_date - list_date and if value is <7 then average count of employees from the 2nd table but I'm not sure what query functions to use.
You could use a CASE with a DISTINCT to COUNT the unique employees that worked within the first 7 days of the list_date.
Once you have that total of employees per project, then you can calculate those averages per month & state.
SELECT
MONTHNAME(list_date) as `ListMonth`,
state,
AVG(TotalUniqEmp7Days) AS `Average Unique Employees of projects first 7 day list`
FROM
(
SELECT
proj.proj_id,
proj.list_date,
proj.state,
COUNT(DISTINCT CASE
WHEN emp.worked_date BETWEEN proj.list_date and DATE_ADD(proj.list_date, INTERVAL 6 DAY)
THEN emp.employee_id
END) AS TotalUniqEmp7Days
-- , COUNT(DISTINCT emp.employee_id) AS TotalUniqEmp
FROM project.analytics proj
LEFT JOIN employees.analytics emp ON emp.proj_id = proj.proj_id
GROUP BY proj.proj_id, proj.list_date, proj.state
) AS ProjectTotals
GROUP BY YEAR(list_date), MONTH(list_date), MONTHNAME(list_date), state;
A Sql Fiddle test can be found here
I think this is the code that you want
select
p.list_date, p.state,
emp.no_of_unique_emp
from project.analytics p
inner join (
select
t.project_id,
count(t.employee_id) as no_of_unique_emp
from (
select distinct employee_id, project_id
from employees.analytics
) t
group by t.project_id
) emp
on emp.project_id = p.project_id
where datediff (p.list_date, getdate()) <= 7

Get the most recent submission from a team in a submissions table (SQL)

I have a table that can be simplified as:
ID |team_id | submission file | date
========================================
1 | 1756 | final_project.c |2018-06-22 19:00:00
2 | 1923 | asdf.c |2018-06-22 16:00:00
3 | 1756 | untitled.c |2018-06-21 20:00:00
4 | 1923 | my_project.c |2018-06-21 14:00:00
5 | 1756 | untitled.c |2018-06-21 08:00:00
6 | 1814 | my_project.c |2018-06-20 12:00:00
This is a table of people submitting their projects to me, but I only want each individual students' most recent submission, with each student having a unique team_id.
How do I recall the most recent row of each team_id so that my recall looks like this:
ID |team_id | submission file | date
========================================
1 | 1756 | final_project.c |2018-06-22 19:00:00
2 | 1923 | asdf.c |2018-06-22 16:00:00
6 | 1814 | my_project.c |2018-06-20 12:00:00
Thank you for the help!
Subquery will do what you want with correlation approach :
select t.*
from table t -- Need to replace table with your table-name i.e. Projecttble, etc..
where id = (select t1.id
from table t1 -- Need to replace table with your table-name
where t1.team_id = t.team_id
order by t1.date desc
limit 1
);
You could use a self join to pick most recent row per team
select a.*
from your_table a
join (
select team_id, max(date) date
from your_table
group by team_id
) b on a.team_id = b.team_id and a.date = b.date
EDIT: Yogesh beat me to the MySQL 5 answer while I was AFK. You'll want to join your subquery on both the team_id and the submission_file in case you get multiple file submissions from a team on the same day.
Depending on what version of MySQL you are using, this can be done different ways.
SETUP
CREATE TABLE t1 (ID int, team_id int, submission_file varchar(30), the_date date) ;
INSERT INTO t1 (ID,team_id,submission_file,the_date)
SELECT 1, 1756, 'final_project.c', '2018-06-20 19:00:00' UNION ALL
SELECT 2, 1923, 'asdf.c', '2018-06-22 16:00:00' UNION ALL /**/
SELECT 3, 1756, 'untitled.c', '2018-06-21 20:00:00' UNION ALL /**/
SELECT 4, 1923, 'my_project.c', '2018-06-21 14:00:00' UNION ALL /**/
SELECT 5, 1756, 'untitled.c', '2018-06-21 08:00:00' UNION ALL
SELECT 6, 1814, 'my_project.c', '2018-06-20 12:00:00' UNION ALL/**/
SELECT 7, 1756, 'final_project.c', '2018-06-21 19:00:00' UNION ALL
SELECT 8, 1756, 'final_project.c', '2018-06-22 00:00:00' /**/
;
QUERIES
If you are using MySQL 5.x or lower, then you'll want to use a correlated subquery with a LIMIT on it to pull up just the rows you want.
/* MySQL <8 */
SELECT a.*
FROM t1 a
WHERE a.id = (
SELECT b.id
FROM t1 b
WHERE b.team_id = a.team_id
AND b.submission_file = a.submission_file
ORDER BY b.the_date DESC
LIMIT 1
) ;
ID | team_id | submission_file | the_date
-: | ------: | :-------------- | :---------
2 | 1923 | asdf.c | 2018-06-22
3 | 1756 | untitled.c | 2018-06-21
4 | 1923 | my_project.c | 2018-06-21
6 | 1814 | my_project.c | 2018-06-20
8 | 1756 | final_project.c | 2018-06-22
MySQL 8 added window functions (FINALLY), and this makes a problem like this MUCH easier to solve, and likely much more efficient, too. You can sort the rows you need with a ROW_NUMBER() window function.
/* MySQL 8+ */
SELECT s1.ID, s1.team_id, s1.submission_file, s1.the_date
FROM (
SELECT ID, team_id, submission_file, the_date
, ROW_NUMBER() OVER (PARTITION BY team_id, submission_file ORDER BY the_date DESC) AS rn
FROM t1
) s1
WHERE rn = 1
;
ID | team_id | submission_file | the_date
-: | ------: | :-------------- | :---------
8 | 1756 | final_project.c | 2018-06-22
3 | 1756 | untitled.c | 2018-06-21
6 | 1814 | my_project.c | 2018-06-20
2 | 1923 | asdf.c | 2018-06-22
4 | 1923 | my_project.c | 2018-06-21
db<>fiddle here
NOTE: After re-reading the OP, the intent may be different than what I originally read. In my queries, my filtering will return the most recent of all unique submission_file names that a team submitted. So if a team submitted 3 files, you will get all 3 of the most recent versions of those files. If you remove submission_file from the 5 subquery and the 8 PARTITION BY, it will return only the most recent single file a team submitted regardless of name.

MySQL - Update table with row number per group

Sample Data
id | order_id | instalment_num | date_due
---------------------------------------------------------
1 | 10000 | 1 | 2010-07-09 00:00:00
2 | 10000 | 1 | 2010-09-06 11:39:56
3 | 10001 | 1 | 2014-04-25 15:46:52
4 | 10002 | 1 | 2010-01-11 00:00:00
5 | 10003 | 1 | 2010-01-04 00:00:00
6 | 10003 | 1 | 2016-05-31 00:00:00
7 | 10003 | 1 | 2010-01-08 00:00:00
8 | 10003 | 1 | 2010-01-06 09:06:26
9 | 10004 | 1 | 2010-01-11 11:25:07
10 | 10004 | 1 | 2010-01-12 07:06:42
Desired Result
id | order_id | instalment_num | date_due
---------------------------------------------------------
1 | 10000 | 1 | 2010-07-09 00:00:00
2 | 10000 | 2 | 2010-09-06 11:39:56
3 | 10001 | 1 | 2014-04-25 15:46:52
4 | 10002 | 1 | 2010-01-11 00:00:00
5 | 10003 | 1 | 2010-01-04 00:00:00
8 | 10003 | 2 | 2010-01-06 09:06:26
7 | 10003 | 3 | 2010-01-08 00:00:00
6 | 10003 | 4 | 2016-05-31 00:00:00
9 | 10004 | 1 | 2010-01-11 11:25:07
10 | 10004 | 2 | 2010-01-12 07:06:42
As you can see, I have an instalment_num column which should show the number/index of each row belonging to the order_id, determined by the date_due ASC, id ASC order.
How can I update the instalment_num column like this?
Additional Notes
The date_due column is not unique, and there may be many ids or order_ids with the exact same timestamp.
If the timestamp is the same for two rows belonging to the same order_id, it should order them by id as a fallback.
I require a query which will update this column.
This is how I would do it:
SELECT a.id,
a.order_id,
COUNT(b.id)+1 AS instalment_num,
a.date_due
FROM sample_data a
LEFT JOIN sample_data b ON a.order_id=b.order_id AND (a.date_due>b.date_due OR (a.date_due=b.date_due AND a.id>b.id))
GROUP BY a.id, a.order_id, a.date_due
ORDER BY a.order_id, a.date_due, a.id
UPDATE version attempt:
UPDATE sample_data
LEFT JOIN (SELECT a.id,
COUNT(b.id)+1 AS instalment_num
FROM sample_data a
JOIN sample_data b ON a.order_id=b.order_id AND (a.date_due>b.date_due OR (a.date_due=b.date_due AND a.id>b.id))
GROUP BY a.id) c ON c.id=sample_data.id
SET sample_data.instalment_num=c.instalment_num
For the numbering to begin with 1:
UPDATE sample_data
LEFT JOIN (SELECT a.id,
COUNT(b.id) AS instalment_num
FROM sample_data a
JOIN sample_data b ON a.order_id = b.order_id AND (a.date_due > b.date_due OR (a.date_due=b.date_due AND a.id + 1 > b.id))
GROUP BY a.id) c ON c.id = sample_data.id
SET sample_data.instalment_num = c.instalment_num
You are trying to achieve what ROW_NUMBER with a partition would do using something like SQL Server or Oracle. You can simulate this with an approriate query:
SELECT t.id, t.order_id,
(
SELECT 1 + COUNT(*)
FROM sampleData
WHERE (date_due < t.date_due OR (date_due = t.date_due AND id < t.id)) AND
order_id = t.order_id
) AS instalment_num,
t.date_due
FROM sampleData t
ORDER BY t.order_id, t.date_due
This query will order the instalment_num by due_date in ascending order. And in the case of a tie in due_date, it will order by the id in ascending order.
Follow the link below for a demo:
SQLFiddle
select
sub.order_id, sub.date_due,
#group_rn:= case
when #group_order_id=sub.order_id then #group_rn:=#group_rn:+1
else 1
end as instalment_num,
#group_order_id:=sub.order_id
FROM (select #group_rn:=0, group_order_id=0) init,
(select *
from the_table
order by order_id, date_due) sub

MySQL - how to select id where min/max dates difference is more than 3 years

I have a table like this:
| id | date | user_id |
----------------------------------------------------
| 1 | 2008-01-01 | 10 |
| 2 | 2009-03-20 | 15 |
| 3 | 2008-06-11 | 10 |
| 4 | 2009-01-21 | 15 |
| 5 | 2010-01-01 | 10 |
| 6 | 2011-06-01 | 10 |
| 7 | 2012-01-01 | 10 |
| 8 | 2008-05-01 | 15 |
I’m looking for a solution how to select user_id where the difference between MIN and MAX dates is more than 3 yrs. For the above data I should get:
| user_id |
-----------------------
| 10 |
Anyone can help?
SELECT user_id
FROM mytable
GROUP BY user_id
HAVING MAX(`date`) > (MIN(`date`) + INTERVAL '3' YEAR);
Tested here: http://sqlize.com/MC0618Yg58
Similar to bernie's approach, I'd keep date formats native. I'd also probably list the MAX first as to avoid an ABS call (secure a positive number is always returned).
SELECT user_id
FROM my_table
WHERE DATEDIFF(MAX(date),MIN(date)) > 365
DATEDIFF just returns delta (in days) between two given date fields.
SELECT user_id
FROM (SELECT user_id, MIN(date) m0, MAX(date) m1
FROM table
GROUP by user_id)
HAVING EXTRACT(YEAR FROM m1) - EXTRACT(YEAR FROM m0) > 3
SELECT A.USER_ID FROM TABLE AS A
JOIN TABLE AS B
ON A.USER_ID = B.USER_ID
WHERE DATEDIFF(A.DATE,B.DATE) > 365