I apologize if this question has already been asked, attempted a search but could not find a relevant thread.
I have been given a semi-large data source (~15m records) which I need to perform some analysis on to determine user behavior. The data source includes fields for the User ID, date of the transaction, and a flag to indicate if the transaction had a certain characteristic. Obviously I'm simplifying here to get to the core of the question. The number of transactions by user will vary quite a bit (from 1 to 200+), the date distribution will vary, and the distribution of flags will vary.
Consider the following table:
ID User ID Date Flag
1 1 2015-01-03 Y
2 1 2015-03-15 N
3 1 2015-07-20 N
4 1 2015-11-18 N
5 1 2015-11-29 N
6 2 2015-02-16 Y
7 2 2015-03-03 N
8 2 2015-06-10 Y
9 2 2015-08-10 Y
How would one go about querying this data to isolate records based upon the characteristics of other records for the same user before or after?
For example:
How would one identify records with a 'Y' flag which are followed by three other records (ordered by date) for the same User ID with an 'N' flag? [Would return 1 in the above table]
How would one identify User IDs where 50% or more of their transactions with 'Y' flags occur in the first 20% of their transactions? [Would return User ID 1 in the above table]
I hope the question is clear enough.
*Edit: The answer below is correct, however he did not know that I am using MySQL as the database (I added in the tag after he answered). MySQL DOES NOT support these functions, either Oracle or SQL Server would be able to implement these functions.
This question assumes a reasonable database that supports window/analytic functions.
The first question can be handled using lead():
select t.*
from (select t.*,
lead(flag, 1) over (partition by userid order by date) as flag_1,
lead(flag, 2) over (partition by userid order by date) as flag_2,
lead(flag, 3) over (partition by userid order by date) as flag_3
from t
) t
where flag = 'Y' and flag_1 = 'N' and flag_2 = 'N' and flag_3 = 'N';
The second also uses window functions:
select user_id
from (select t.*,
row_number() over (partition by user_id order by date) as seqnum,
count(*) over (partition by user_id) as cnt
from t
) t
group by user_id
having sum(case when flag = 'Y' and seqnum/0.2 <= cnt then 1 else 0 end) >=
0.5 * sum(case when flag = 'Y' then 1 else 0 end);
So, the answer to your question is basically: Learn about window (analytic) functions.
Related
This question already has answers here:
SQL select only rows with max value on a column [duplicate]
(27 answers)
Closed 28 days ago.
It has been really hard for me to write a title for the question, but here is my situation: I have a bot that saves from a public API the prices of gas tagged by gas station id, time of price change, and type of fuel (it can be, for example, "Gas", "Diesel", "Natural Gas", "Special Gas", etc.).
The table gets updated every day: every time the manager of a station communicates a new price, a new record is added. I need to keep track of prices variations, this is the reason why I am not updating the record for that particular fuel type, but adding a new record.
The table looks like this (more or less, the real table is a little bit more complex):
id
station_id
updated_timestamp
price
type_of_fuel
1
1
2023-01-19
1.00
Gas
2
1
2023-01-19
1.20
Diesel
3
2
2023-01-19
1.05
Gas
4
1
2023-01-20
1.10
Gas
5
2
2023-01-20
1.10
Gas
6
1
2023-01-21
1.15
Gas
One of my use cases is to return the historical data, and that is no problem, but I also need to show only the latest available price for each station and each type of fuel. In other words, I expect to receive something like
id
station_id
updated_timestamp
price
type_of_fuel
2
1
2023-01-19
1.20
Diesel
5
2
2023-01-20
1.10
Gas
6
1
2023-01-21
1.15
Gas
Starting from a basic query
SELECT
*
FROM
TABLE
I understood that a JOIN with a subquery might be the solution, something like:
SELECT
*
FROM
TABLE
AS
T
INNER JOIN (
SELECT
id,
station_id,
MAX(updated_timestamp)
FROM
TABLE
GROUP BY
station_id
) AS SUB ON T.id = SUB.id
But this is not working as expected. Any idea? I'd like to be able to write this query, and to understand why it works.
Thank you in advance.
Since MySQL 8, you can use the ROW_NUMBER window function for this kind of task.
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY station_id, type_of_fuel ORDER BY id DESC) AS rn
FROM tab
)
SELECT id, station_id, updated_timestamp, price, type_of_fuel
FROM cte
WHERE rn = 1
Check the demo here.
In MySQL 5.X you are forced to compute aggregation separately. Since you want the last price for each station_id and type of fuel, you need to look for the last id, in partition <station_id, type_of_fuel>. Then join back on matching ids.
SELECT tab.*
FROM tab
INNER JOIN (SELECT MAX(id) AS id
FROM tab
GROUP BY station_id, type_of_fuel) cte
ON tab.id = cte.id
Check the demo here.
Note: If the "ID" field gives you a correct ordering with respect to your data, it's preferable to use it, as long as conditions on integers make join operations more efficient than conditions on dates. If that's not the case, you are forced to use dates instead, changing:
MAX(id) AS id to MAX(updated_timestamp) AS updated_timestamp
ON tab.id = cte.id to ON tab.updated_timestamp = cte.updated_timestamp.
Hope this works.
with tsample as (
select '1' as id,'1' as station_id,'2023-01-19' as updated_time,'1.00' as price,'Gas' as type union all
select '2' as id,'1' as station_id,'2023-01-19' as updated_time,'1.20' as price,'Diesel' as type union all
select '3' as id,'2' as station_id,'2023-01-19' as updated_time,'1.05' as price,'Gas' as type union all
select '4' as id,'1' as station_id,'2023-01-20' as updated_time,'1.10' as price,'Gas' as type union all
select '5' as id,'2' as station_id,'2023-01-20' as updated_time,'1.10' as price,'Gas' as type union all
select '6' as id,'1' as station_id,'2023-01-21' as updated_time,'1.15' as price,'Gas' as type
)
select station_id,type,max(price),max(updated_time),max(id) from tsample group by 1,2
I have the following SQL table
username
Month
292
10
123
12
123
1
123
2
123
4
345
6
345
7
I want to query it, to get each username's login streak in Count of sequential Month. meaning the end result I am looking for looks like this :
username
Streak
292
1
123
3
345
2
How can I achieve it ? taking into note the Month 12 --> Month 1 issue;
Appreciate your help;
This would give you the result you want:
select username, count(*)
from (
select
username
, month_1
, coalesce(nullif(lead(month_1)
over (partition by username
order by coalesce(nullif(month_1,12),0))
- coalesce(nullif(month_1,12),0),-1),1) as MonthsTillNext
from login_tab
) Step1
where MonthsTillNext=1
group by username
By calculating the difference from the next row, where the next row is defined as the next month_no in ascending order, treating 12 as 0 (refer to the ambiguity I mentioned in my comment). It then just leaves the rows for consecutive months rows, and counts them.
Beware though, in addition to the anomaly around month:12, there is another case not considered: if the months for the user are 1,2,3 and 6,7,8 this would count as Streak:6; is it what you wanted?
One way would be with a recursive CTE, like
WITH RECURSIVE cte (username, month, cnt) AS
(
SELECT username, month, 1
FROM test
UNION ALL
SELECT test.username, test.month, cte.cnt+1
FROM cte INNER JOIN test
ON cte.username = test.username AND CASE WHEN cte.month = 12 THEN 1 ELSE cte.month + 1 END = test.month
)
SELECT username, MAX(cnt)
FROM cte
GROUP BY username
ORDER BY username
The idea is that the CTE (named cte in my example) recursively joins back to the table on a condition where the user is the same and the month is the next one. So for user 345, you have:
Username
Month
Cnt
345
6
1
345
7
1
345
7
2
The rows with cnt=1 are from the original table (with the extra cnt column hardcoded to 1), the row with cnt=2 is from the recursive part of the query (which found a match and used cnt+1 for its cnt). The query then selects the maximum for each user.
The join uses a CASE statement to handle 12 being followed by 1.
You can see it working with your sample data in this fiddle.
The one shared by #EdmCoff is quite elegant.
Another one without recursive and just using conditional logic -
with data_cte as
(
select username, month_1,
case when (count(month_1) over (partition by username) = 1) then 1
when (lead(month_1) over (partition by username order by username) - month_1) = 1 OR (month_1 - lag(month_1) over (partition by username order by username)) = 1 then 1
when (month_1 = 12 and min (month_1) over (partition by username) =1) then 1
end cnt
from login_tab
)
select username, count(cnt) from data_cte group by username
DB Fiddle here.
Given a database table that contains a list of race times, I need to be able to identify which of the performances are faster than earlier finish times for that athlete at a specific distance, e.g. it was their best time at the time of the performance.
Also, would it be better to update this and store as a boolean in an additional column at rest, rather than trying to calculate when doing a SELECT. The database isn't populated in chronological order, so not sure if a TRIGGER would help. I was thinking of a query that runs on the whole table after any inserts/updates. Appreciate this may have a performance impact, so could be run periodically rather than on each row update.
This is on a MySQL 5.6.47 server.
Example table
athleteId date distance finishTime
1 2020-01-04 5K 30:00
1 2020-01-11 5K 30:09
1 2020-01-18 5K 29:45
1 2020-01-25 5K 29:32
1 2020-02-01 5K 31:18
1 2020-02-02 10K 1:06:07
1 2020-02-08 5K 28:25
1 2020-02-23 10K 1:06:02
1 2020-02-23 10K 1:07:30
Expected output
athleteId date distance finishTime isPersonalBest
1 2020-01-04 5K 30:00 Y
1 2020-01-11 5K 30:09 N
1 2020-01-18 5K 29:45 Y
1 2020-01-25 5K 29:32 Y
1 2020-02-01 5K 31:18 N
1 2020-02-02 10K 1:06:07 Y
1 2020-02-08 5K 28:25 Y
1 2020-02-23 10K 1:06:02 Y
1 2020-02-23 10K 1:07:30 N
The data is just an example. The actual finish times are stored in seconds. There will be many more athletes and different event distances. If a performance is the first for that athlete at that distance, it would be classed as a personal best.
If you are running MysQL 8.0, you can use window functions:
select
t.*,
case when finishTime < min(finishTime) over(
partition by athleteId, distance
order by date
rows between unbounded preceding and 1 preceding
)
then 'Y'
else 'N'
end isPersonalBest
from mytable t
In earlier versions, one option is a correlated subquery:
select
t.*,
case when exists(
select 1
from mytable t1
where
t1.athleteId = t.athleteId
and t1.distance = t.distance
and t1.date < t.date
and t1.finishTime <= t.finishTime
)
then 'N'
else 'Y'
end isPersonalBest
from mytable t
I wouldn't recommend actually storing this derived information. Instead, you use the above query to create a view.
You can use a cumulative min in MySQL 8+:
select t.*,
(case when finishTime >=
min(finishTime) over (partition by athleteid, distance
order by date
rows between unbounded preceding and 1 preceding
)
then 'N' else 'Y'
end) as isPersonalBest
from t;
Here is a db<>fiddle.
In earlier versions, you could use not exists:
select t.*,
(case when not exists (select 1
from t t2
where t2.atheleteid = t.athleteid and
t2.distance = t.distance and
t2.date < t.date and
t2.finishTime <= t.finishTime
)
then 'Y' else 'N'
end) as isPersonalBest
from t;
I have a data table like this
id typeid date
12 exited 01-06-2017
1 approved 05-06-2017
7 attended 08-06-2017
9 admitted 10-06-2017
45 approved 12-06-2017
67 admitted 16-06-2017
The answer I want would be something like this:
difference(days)
5
4
I want to calculate the date difference between approved and admitted (wherever they are, so I think we have to use looping statement). I want to write a stored procedure in MySql (version: 5.6) which returns the result in any form (maybe a table having these results).
This is actually the sort of problem for which window functions are very well suited, but since you are using version 5.6, this isn't a possibility. Here is one way to do this:
SELECT
DATEDIFF(
(SELECT t2.date FROM yourTable t2
WHERE t2.typeid = 'admitted' AND t2.date > t1.date
ORDER BY t2.date LIMIT 1),
t1.date) AS difference
FROM yourTable t1
WHERE
typeid = 'approved'
ORDER BY
date;
The logic in the above query is that we restrict only records which are approved type. For each such records, using a correlated subquery, we then seek ahead and time and find the nearest record which is admitted type. Then, we take the difference between those two dates.
Check the working demo link below.
Demo
If you are concerned about performance, you can assign a value to each row which is the cumulative number of "admitted". Then use this for aggregation:
select max(case when typeid = 'approved' then date end) as approved_date,
max(case when typeid = 'admitted' then date end) as admitted_date,
datediff(max(case when typeid = 'admitted' then date end),
max(case when typeid = 'approved' then date end)
) as diff
from (select t.*,
(#cnt := #cnt + (typeid = 'approved')) as grp
from (select t.* from t order by date) t cross join
(select #cnt := 0) params
) t
group by grp;
This can take advantage of an index on (date) for assigning grp. Then it just needs to do a group by.
Using a correlated subquery can become quite expensive as the size of the data grows. So for larger data, this should be much more efficient.
In either case, using window functions (available in MySQL 8+) is much, much the preferred solution.
A person gets a 10% commision for purchases made by his referred friends.
There are two tables :
Reference table
Transaction table
Reference Table
Person_id Referrer_id
3 1
4 1
5 1
6 2
Transaction Table
Person_id Amount Action Date
3 100 Purchase 10-20-2011
4 200 Purchase 10-21-2011
6 400 Purchase 12-15-2011
3 200 Purchase 12-30-2011
1 50 Commision 01-01-2012
1 10 Cm_Bonus 01-01-2012
2 20 Commision 01-01-2012
How to get the following Resultset for Referrer_Person_id=1
Month Ref_Pur Earn_Comm Todate_Earn_Comm BonusRecvd Paid Due
10-2011 300 30 30 0 0 30
11-2011 0 0 30 0 0 30
12-2011 200 20 50 0 0 50
01-2012 0 0 50 10 50 0
Labels used above are:
Ref_Pur = Total Referred Friend's Purchase for that month
Earn_Comm = 10% Commision earned for that month
Todate_Earn_Comm = Total Running Commision earned upto that month
MYSQL CODE that i wrote
SELECT dx1.month,
dx1.ref_pur,
dx1.earn_comm,
( #cum_earn := #cum_earn + dx1.earn_comm ) as todate_earn_comm
FROM
(
select date_format(`date`,'%Y-%m') as month,
sum(amount) as ref_pur ,
(sum(amount)*0.1) as earn_comm
from transaction tr, reference rf
where tr.person_id=rf.person_id and
tr.action='Purchase' and
rf.referrer_id=1
group by date_format(`date`,'%Y-%m')
order by date_format(`date`,'%Y-%m')
)as dx1
JOIN (select #cum_earn:=0)e;
How to join the query to also include BonusRecvd,Paid and Due trnsactions, which is not dependent on reference table?
and also generate row for the month '11-2011', even though no trnx occured on that month
If you want to include commission payments and bonuses into the results, you'll probably need to include corresponding rows (Action IN ('Commision', 'Cm_Bonus')) into the initial dataset you are using to calculate the results on. Or, at least, that's what I would do, and it might be like this:
SELECT t.Amount, t.Action, t.Date
FROM Transaction t LEFT JOIN Reference r ON t.Person_id = r.Person_id
WHERE r.Referrer_id = 1 AND t.Action = 'Purchase'
OR t.Person_id = 1 AND t.Action IN ('Commision', 'Cm_Bonus')
And when calculating monthly SUMs, you can use CASE expressions to distinguish among Amounts related to differnt types of Action. This is how the corresponding part of the query might look like:
…
IFNULL(SUM(CASE Action WHEN 'Purchase' THEN Amount END) , 0) AS Ref_Pur,
IFNULL(SUM(CASE Action WHEN 'Purchase' THEN Amount END) * 0.1, 0) AS Earn_Comm,
IFNULL(SUM(CASE Action WHEN 'Cm_Bonus' THEN Amount END) , 0) AS BonusRecvd,
IFNULL(SUM(CASE Action WHEN 'Commision' THEN Amount END) , 0) AS Paid
…
When calculating the Due values, you can initialise another variable and use it quite similarly to #cum_earn, except you'll also need to subtract Paid, something like this:
(#cum_due := #cum_due + Earn_Comm - Paid) AS Due
One last problem seems to be missing months. To address it, I would do the following:
Get the first and the last date from the subset to be processed (as obtained by the query at the beginning of this post).
Get the corresponding month for each of the dates (i.e. another date which is merely the first of the same month).
Using a numbers table, generate a list of months covering the two calculated in the previous step.
Filter out the months that are present in the subset to be processed and use the remaining ones to add dummy transactions to the subset.
As you can see, the "subset to be processed" needs to be touched twice when performing these steps. So, for effeciency, I would insert that subset into a temporary table and use that table, instead of executing the same (sub)query several times.
A numbers table mentioned in Step #3 is a tool that I would recommend keep always handy. You would only need to initialise it once, and its uses for you may turn out numerous, if you pardon the pun. Here's but one way to populate a numbers table:
CREATE TABLE numbers (n int);
INSERT INTO numbers (n) SELECT 0;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
/* repeat as necessary; every repeated line doubles the number of rows */
And that seems to be it. I will not post a complete solution here to spare you the chance to try to use the above suggestions in your own way, in case you are keen to. But if you are struggling or just want to verify that they can be applied to the required effect, you can try this SQL Fiddle page for a complete solution "in action".