I have this table below and want to get the min value of quantity, max value of quantity, first value of quantity and last value of quantity. The new table should be grouped by date with a 1 day interval.
id item quantity date
1 xLvCm 2 2020-01-10 19:15:03
1 UBizL 4 2020-01-10 20:16:41
1 xLvCm 1 2020-01-10 21:21:12
1 xLvCm 3 2020-01-11 11:14:00
1 UBizL 1 2020-01-11 15:01:10
1 moJEe 4 2020-01-12 00:15:50
1 moJEe 1 2020-01-12 02:11:23
1 UBizL 1 2020-01-12 04:16:17
1 KiZoX 3 2020-01-13 10:10:02
1 KiZoX 2 2020-01-13 19:05:40
1 KiZoX 1 2020-01-13 20:14:33
This is the expected table result
min(quantity) max(quantity) first(quantity) last(quantity) date
1 4 2 1 2020-01-10 19:15:03
1 3 3 1 2020-01-11 11:14:00
1 4 4 1 2020-01-12 00:15:50
1 4 3 1 2020-01-13 10:10:02
The SQL query I have tried is
SELECT MIN(quantity), MAX(quantity), FIRST(quantity), LAST(quantity) FROM tablename GROUP BY date
I can't figure out how to include the first and last values of quantity and group by day (like 10, 11, 12, 13) instead of date like (2020-01-10 19:15:03)
It is important to state the database tool you are using because of the different functionality available in each of them. But if you were using Snowflake this is something I would try:
select distinct day(date) as day_of_month,
min(quantity) over (partition by day(date) order by date range between unbounded preceding and UNBOUNDED FOLLOWING) min_quantity,
max(quantity) over (partition by day(date) order by date range between unbounded preceding and UNBOUNDED FOLLOWING) max_quantity ,
last_value(QUANTITY) over (partition by day(date) order by date range BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) as last_quantity,
first_value(QUANTITY) over (partition by day(date) order by date range BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) as first_quantity
from demo_db.staging.test
It is important to note that this is a costly query. If your table is huge this might take too long.
A common approach to this problem is to use window functions and aggregation. Here is one method:
SELECT date(date), MIN(quantity), MAX(quantity),
MAX(CASE WHEN seqnum_a = 1 THEN quantity END) as first_quantity,
MAX(CASE WHEN seqnum_d = 1 THEN quantity END) as last_quantity
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY date(date) ORDER BY date) as seqnum_a,
ROW_NUMBER() OVER (PARTITION BY date(date) ORDER BY date des) as seqnum_d
FROM tablename t
) t
GROUP BY date(date);
Try this:
select A.minquantity,A.maxquantity,B.firstquantity,C.lastquantity,A.date from (
(select min(quantity) as minquantity,max(quantity) as maxquantity,Date(date) as date
from Test group by Date(date))A
join
(select Date(date) as date,quantity as firstquantity from
Test where date in (select min(date) from Test group by Date(date)))B
on A.date=B.date
join
(select Date(date)as date,quantity as lastquantity from Test
where date in (select max(date) from Test group by Date(date)))C
on A.date=C.date
);
Output:
1 4 2 1 2020-01-10
1 3 3 1 2020-01-11
1 4 4 1 2020-01-12
1 3 3 1 2020-01-13
Related
I am trying to create a query for getting the current streak in MySQL based on status
ID
Dated
Status
1
2022-03-08
1
2
2022-03-09
1
3
2022-03-10
0
4
2022-03-11
1
5
2022-03-12
0
6
2022-03-13
1
7
2022-03-14
1
8
2022-03-16
1
9
2022-03-18
0
10
2022-03-19
1
11
2022-03-20
1
In the above table current streak should be 2( i.e 2022-03-20 - 2022-03-19) based on status 1. Any help or ideas would be greatly appreciated!
WITH cte AS (
SELECT SUM(Status) OVER (ORDER BY Dated DESC) s1,
SUM(NOT Status) OVER (ORDER BY Dated DESC) s2
FROM table
)
SELECT MAX(s1)
FROM cte
WHERE NOT s2;
SELECT DATEDIFF(MAX(CASE WHEN Status THEN Dated END),
MAX(CASE WHEN NOT Status THEN Dated END))
FROM table
and so on...
This is a gaps and islands problem. In your case, you want the island of status 1 records which occurs last. We can use the difference in row numbers method, assuming you are using MySQL 8+.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY Dated) rn1,
ROW_NUMBER() OVER (PARTITION BY Status ORDER BY Dated) rn2
FROM yourTable
),
cte2 AS (
SELECT *, RANK() OVER (ORDER BY rn1 - rn2 DESC) rnk
FROM cte
WHERE Status = 1
)
SELECT ID, Dated, Status
FROM cte2
WHERE rnk = 1
ORDER BY Dated;
Demo
We can use 2 one row CTE's to find the latest date where the status was not the same as the latest one and then count the records superieur.
**Schema (MySQL v8.0)**
create table t(
ID int,
Dated date,
Status int);
insert into t values
(1,'2022-03-08',1),
(2,'2022-03-09',1),
(3,'2022-03-10',0),
(4,'2022-03-11',1),
(5,'2022-03-12',0),
(6,'2022-03-13',1),
(7,'2022-03-14',1),
(8,'2022-03-16',1),
(9,'2022-03-18',0),
(10,'2022-03-19',1),
(11,'2022-03-20',1);
---
**Query #1**
with latest AS
(SELECT
dated lastDate,
status lastStatus
from t
order by dated desc
limit 1 ),
lastDiff as
(select MAX(dated) diffDate
from t,latest
where not status = lastStatus
)
select count(*)
from t ,lastDiff
where dated > diffDate;
| count(*) |
| -------- |
| 2 |
---
[View on DB Fiddle](https://www.db-fiddle.com/)
We could also consider using datediff() to find the number of days that the streak has lasted which might be more interesting than count() seeing as there are some days where there is no record.
Trying to get the 2nd transaction month details for all the customers
Date User_id amount
2021-11-01 1 100
2021-11-21 1 200
2021-12-20 2 110
2022-01-20 2 200
2022-02-04 1 50
2022-02-21 1 100
2022-03-22 2 200
For every customer get all the records in the month of their 2nd transaction (There can be multiple transaction in a month and a day by a particular user)
Expected Output
Date User_id amount
2022-02-04 1 50
2022-02-21 1 100
2022-01-20 2 200
You can use dense_rank:
select Date, User_id, amount from
(select *, dense_rank() over(partition by User_id order by year(Date), month(date)) r
from table_name) t
where r = 2;
Fiddle
If dense_rank is an option you can:
with cte1 as (
select *, extract(year_month from date) as yyyymm
from t
), cte2 as (
select *, dense_rank() over (partition by user_id order by yyyymm) as dr
from cte1
)
select *
from cte2
where dr = 2
Note that it is possible to write the above using one cte.
If a date chain break is more than 30 days, it needs to be logged as a separate date chain, so the same id can have multiple date chains. For example this is my input table.
id date
1 2021-01-01
1 2021-01-02
1 2021-01-03
1 2021-01-10
1 2021-01-20
1 2021-03-20
1 2021-03-21
1 2021-03-22
1 2021-04-02
Output
id start_date end_date
1 2021-01-01 2021-01-20
1 2021-03-20 2021-04-02
Does any knows how to do this in sql or pandas?
You can use lag() to identify where new dates start. Then a cumulative sum and aggregation:
select id, min(date) as start_date, max(date) as end_date
from (select t.*,
sum(case when prev_date >= date - interval 30 day then 0 else 1 end) over (partition by id order by date) as grp
from (select t.*,
lag(date) over (partition by id order by date) as prev_date
from t
) t
group by id, grp;
You can use groupby with the time difference between consecutive rows (needs sorting) to form the groups, and agg to get the first/last date:
df = df.sort_values(by='date')
(df.groupby(df['date'].diff().gt(pd.Timedelta('30d')).cumsum())
['date'].agg(start_date='first', end_date='last')
)
output:
start_date end_date
date
0 2021-01-01 2021-01-20
1 2021-03-20 2021-04-02
To ensure keeping the "id", you can add "id" to the groupby:
df = df.sort_values(by='date')
(df.groupby(['id', df['date'].diff().gt(pd.Timedelta('30d')).cumsum()])
['date'].agg(start_date='first', end_date='last')
.droplevel(1) # to remove the "date" group
# .reset_index() # uncomment to get "id" as column
)
output:
start_date end_date
id
1 2021-01-01 2021-01-20
1 2021-03-20 2021-04-02
Original Data:
ID Date Original_col
A 2021-04-10 1
B 2021-03-01 1
B 2021-05-01 1
C 2021-03-01 1
C 2021-03-02 2
C 2021-03-03 3
C 2021-05-07 1
Result data:
ID Date Result_col
A 2021-04-10 1
B 2021-03-01 1
B 2021-05-01 1
C 2021-03-01 3
C 2021-05-07 1
For ID = 'C' records, records with date between '2021-03-01' to '2021-03-03' are grouped together, only start date '2021-03-01' and max day '3' is kept, record with date = '2021-05-07' is kept cause there are no bigger records.
There are no strict restrictions on 'the date period', I need to group them together if they are continuous on Original_col.
You can identify the periods by subtracting an enumerated value. This is constant for "adjacent" days. The rest is just aggregation:
select id, min(date), max(original_col) as result_col
from (select t.*,
row_number() over (partition by id order by date) as seqnum
from t
) t
group by id, (date - interval seqnum day);
If the original_column is really enumerating the adjacent dates, then you don't even need a subquery:
select id, min(date), max(original_col) as result_col
from t
group by id, (date - interval original_col day);
However, I don't know if the values are just coincidences in the sample data in the question.
Suppose I have the following set in a table:
empid
start_time
end_time
1
8
9
1
9
10
1
11
12
1
12
13
1
13
14
1
14
15
I want to have an sql (or an sql process ) that convert the previous set to the following set:
empid
start_time
end_time
1
8
10
1
11
15
It means that if the end_time of a record equals to the start_time of the next record we shall remove one record and update the record with the new value (of course without touching the main table)
This is a type of gaps-and-islands problem. In this case, you can use lag to see where an "island" starts, then use a cumulative sum to assign the same number within an island and aggregate:
select empid, min(start_time), max(end_time)
from (select t.*,
sum(case when prev_end_time = start_time then 0 else 1 end) over (partition by empid order by start_time) as island
from (select t.*,
lag(end_time) over (partition by empid order by start_time) as prev_end_time
from t
) t
) t
group by empid, island;
Here is a db<>fiddle.