How to get a Cohort analysis using MySQL? - mysql

Here is my current query:
SELECT DATEDIFF(created_at, '2020-07-01') DIV 6 period,
user_id FROM transactions
WHERE DATE(created_at) >= '2020-07-01'
GROUP BY user_id, DATEDIFF(created_at, '2020-07-01') DIV 6
ORDER BY period
It returns a list of users that have had at least one transaction per period (period === 6 days). Here is a simplified of the current output:
// res_table
+--------+---------+
| period | user_id |
+--------+---------+
| 0 | 1111 |
| 0 | 2222 |
| 0 | 3333 |
| 1 | 7777 |
| 1 | 1111 |
| 2 | 2222 |
| 2 | 1111 |
| 2 | 8888 |
| 2 | 3333 |
+--------+---------+
Now, I need to know, in which period, how many users have had at least one transaction again (in the term of marketing, I'm trying to picturing the retention rate by a Cohort chart). Therefore, the calculations must be done in the Cartesian algorithm; Like a self-join!
Here is the expected result:
+---------+---------+------------+
| periodX | periodY | percentage |
+---------+---------+------------+
| 0 | 0 | 100% | -- it means 3 users exist in period 0 and logically all of them exist in period 0. So 3/3=100%
| 0 | 1 | 33% | -- It means 3 users exist in period 0, and just 1 of them exist in period 1. So 1/3=33%
| 0 | 2 | 66% | -- It means 3 user exists in period 0, and just 2 of them exist in period 2. So 2/3=66%
| 1 | 1 | 100% | -- it means 1 user (only #777, actually #111 is ignored because it's duplicated in pervious periods) exists in period 1 and logically it exists in period 1. So 1/1=100%
| 1 | 2 | 0% |
| 2 | 2 | 100% |
+---------+---------+------------+
Is it possible to do this using MySQL purely?

You can use window functions:
SELECT first_period, period, COUNT(*),
COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY first_period) as ratio
FROM (SELECT DATEDIFF(created_at, '2020-07-01') DIV 6 period,
user_id,
MIN(MIN(DATEDIFF(created_at, '2020-07-01') DIV 6) OVER (PARTITION BY user_id)) as first_period
FROM transactions
WHERE DATE(created_at) >= '2020-07-01'
GROUP BY user_id, DATEDIFF(created_at, '2020-07-01') DIV 6
) u
GROUP BY first_period, period
ORDER BY first_period, period;
This does not include missing periods. That is a little trickers, because you need to enumerate all of them:
with periods as (
select 0 as period union all
select 1 as period union all
select 2 as period
)
select p1.period, p2.period, COUNT(u.user_id)
from periods p1 join
periods p2
on p1.period <= p2.period left join
(SELECT DATEDIFF(created_at, '2020-07-01') DIV 6 period,
user_id,
MIN(MIN(DATEDIFF(created_at, '2020-07-01') DIV 6) OVER (PARTITION BY user_id)) as first_period
FROM transactions
WHERE DATE(created_at) >= '2020-07-01'
GROUP BY user_id, DATEDIFF(created_at, '2020-07-01') DIV 6
) u
ON p1.period = u.first_period AND p2.period = u.period
GROUP BY p1.period, p2.period;

Related

How do I subtract aggregated counts based on different states two columns in one table?

Using MariaDB, I am trying to get a monthly total of items that were created minus items that were deleted that month, for each month. If no items were deleted, the total should be just the number of items that were created that month. If more items were deleted than created, the total should be a negative number.
The table has a created_at column which is never null and a deleted_at column which is set once the item has been 'deleted'
To illustrate, the (simplified) schema is like this:
TABLE Items:
+----------------------------------------------------------------------------+
| idItem | created_at | deleted_at |
+----------------------------------------------------------------------------+
| 1 | 2020-03-20T04:28:41.000+00:00 | 2021-07-27T02:36:05.000+00:00 |
| 2 | 2020-03-20T04:28:41.000+00:00 | 2021-07-27T02:36:05.000+00:00 |
| 3 | 2021-03-02T21:39:10.000+00:00 | ∅ |
| 4 | 2021-03-05T21:13:13.000+00:00 | ∅ |
| 5 | 2021-06-08T13:49:11.000+00:00 | 2021-07-27T02:36:05.000+00:00 |
| 6 | 2021-07-13T02:36:05.000+00:00 | ∅
| 7 | 2021-09-17T21:12:13.000+00:00 | ∅ |
+----------------------------------------------------------------------------+
The information I need is the monthly total that have not been deleted, like so:
+-----------------------------------+
| total_existing | during_month |
+-----------------------------------+
| 2 | 2020-03 | -- two were added
+-----------------------------------+
| 4 | 2021-03 | -- another two were created
+-----------------------------------+
| 5 | 2021-06 | -- another was added
+-----------------------------------+
| 3 | 2021-07 | -- three deleted, one added
+-----------------------------------+
| 4 | 2021-09 | -- one added
+-----------------------------------+
Ultimately, I need to display the total for each month.
I've tried this but it's not right.
SELECT
count(created.idItem) AS monthly_created_count,
count(deleted.idItem) AS monthly_deleted_count,
count(created.idItem) - count(deleted.idItem) as total,
DATE_FORMAT(created.created_at, '%Y-%m') as created_month ,
DATE_FORMAT(deleted.deleted_at, '%Y-%m') as deleted_month
FROM
Item created
LEFT JOIN
Item deleted
ON
DATE_FORMAT(deleted.deleted_at, '%Y-%m') = DATE_FORMAT(created.created_at, '%Y-%m')
GROUP BY DATE_FORMAT(created.created_at, '%Y-%m'), DATE_FORMAT(deleted.deleted_at, '%Y-%m')
I keep thinking I'm so close, but when we look at the rows where the deleted_at dates are set, it's obvious I'm off the mark.
If you're looking for a cumulative total of rows created/deleted, one approach is COUNT the number of records created and deleted by month/year separately. Then join the counts together with UNION ALL and calculate the sum totals:
SELECT t.YearMonth
, SUM(t.TotalCreated) - SUM(t.TotalDeleted) AS TotalExisting
FROM (
SELECT DATE_FORMAT(created_at, '%Y-%m') AS YearMonth
, COUNT(*) AS TotalCreated
, 0 AS TotalDeleted
FROM Item
GROUP BY DATE_FORMAT(created_at, '%Y-%m')
UNION ALL
SELECT DATE_FORMAT(deleted_at, '%Y-%m') AS YearMonth
, 0 AS TotalCreated
, COUNT(*) AS TotalDeleted
FROM Item
WHERE deleted_at IS NOT NULL
GROUP BY DATE_FORMAT(deleted_at, '%Y-%m')
) t
GROUP BY t.YearMonth
ORDER BY t.YearMonth
Results:
YearMonth | TotalExisting
:-------- | ------------:
2020-03 | 2
2021-03 | 2
2021-06 | 1
2021-07 | -2
2021-09 | 1
Then wrap those statements in a CTE and use a Window Function to calculate the rolling total:
See also db<>fiddle
WITH cte AS (
SELECT t.YearMonth
, SUM(t.TotalCreated) - SUM(t.TotalDeleted) AS TotalExisting
FROM (
SELECT DATE_FORMAT(created_at, '%Y-%m') AS YearMonth
, COUNT(*) AS TotalCreated
, 0 AS TotalDeleted
FROM Item
GROUP BY DATE_FORMAT(created_at, '%Y-%m')
UNION ALL
SELECT DATE_FORMAT(deleted_at, '%Y-%m') AS YearMonth
, 0 AS TotalCreated
, COUNT(*) AS TotalDeleted
FROM Item
WHERE deleted_at IS NOT NULL
GROUP BY DATE_FORMAT(deleted_at, '%Y-%m')
) t
GROUP BY t.YearMonth
ORDER BY t.YearMonth
)
SELECT YearMonth, SUM(TotalExisting) OVER (ORDER BY YearMonth) AS TotalExisting
FROM cte;
Final Results:
YearMonth | TotalExisting
:-------- | ------------:
2020-03 | 2
2021-03 | 4
2021-06 | 5
2021-07 | 3
2021-09 | 4

Get appropriate date with GROUP BY

I have a table where I track the duration of watched films by a user for each day.
Now I would like to calculate a unique view count based on date.
So the conditions are:
For each user max view count is 1
View = 1 if one user's SUM(duration) >= 120
Date should be fixed once SUM(duration) reaches 120
But the issue is here to get a correct date row. For example row1.duration + row2.duration >= 120 and thus view count = 1 should be applied for 2021-10-16
| id | user_id | duration | created_at | film_id |
+----+---------+----------+------------+---------+
| 1 | 1 | 80 | 2021-10-15 | 1 |
| 2 | 1 | 70 | 2021-10-16 | 1 |
| 3 | 1 | 200 | 2021-10-17 | 2 |
| 4 | 2 | 50 | 2021-10-18 | 1 |
| 5 | 2 | 90 | 2021-10-18 | 1 |
| 6 | 3 | 140 | 2021-10-18 | 2 |
| 7 | 4 | 10 | 2021-10-19 | 3 |
Expected result:
| cnt | created_at |
+-------+------------+
| 0 | 2021-10-15 |
| 1 | 2021-10-16 |
| 0 | 2021-10-17 |
| 2 | 2021-10-18 |
| 0 | 2021-10-19 |
This is what I tried, but it choses first date, and ignores 0 count.
Here is the fiddle with populated data
SELECT count(*) AS cnt,
created_at
FROM
(SELECT user_id,
sum(duration) AS total,
created_at
FROM watch_time
GROUP BY user_id) AS t
WHERE t.total >= 120
GROUP BY created_at;
Is there any chance to have this work via SQL or it's should be done in application level?
Thanks in advance!
Update:
Version: AWS RDS MySQL 5.7.33
But I'm ok to switch to Postgres if that can help.
Much appreciated even there is a way to have MIN(date) but with the all dates(included 0 views).
Better than this one.
SELECT IFNULL(cnt, 0) as cnt,
t3.created_at
FROM
(SELECT count(*) AS cnt,
created_at
FROM
(SELECT user_id,
sum(duration) AS total,
created_at
FROM watch_time
GROUP BY user_id) AS t
WHERE t.total >= 120
GROUP BY created_at) AS t2
RIGHT JOIN
(SELECT distinct(created_at)
FROM watch_time) AS t3
ON t2.created_at = t3.created_at;
which returns:
| cnt | created_at |
+-------+------------+
| 1 | 2021-10-15 |
| 0 | 2021-10-16 |
| 0 | 2021-10-17 |
| 2 | 2021-10-18 |
| 0 | 2021-10-19 |
But I'm not sure whether the date(2021-10-15) has taken randomly or its always the lowest date
Update 2:
Is it possible to include the film_id as well? Like considering user_id, film_id as a unique view instead of only grouping by user_id.
So in this case:
row1 & row2 both has user_id: 1 and film_id: 1, and the result is 1 view, because the sum of their durations is >= 120. so the date in this case will be 2021-10-16.
but row3 has user_id: 1 and film_id: 2, and with duration >= 120 it's also a 1 view with date 2021-10-17
| id | user_id | duration | created_at | film_id |
+----+---------+----------+------------+---------+
| 1 | 1 | 80 | 2021-10-15 | 1 |
| 2 | 1 | 70 | 2021-10-16 | 1 |
| 3 | 1 | 200 | 2021-10-17 | 2 |
| 4 | 2 | 50 | 2021-10-18 | 1 |
| 5 | 2 | 90 | 2021-10-18 | 1 |
| 6 | 3 | 140 | 2021-10-18 | 2 |
| 7 | 4 | 10 | 2021-10-19 | 3 |
Expected result:
| cnt | created_at |
+-------+------------+
| 0 | 2021-10-15 |
| 1 | 2021-10-16 |
| 1 | 2021-10-17 |
| 2 | 2021-10-18 |
| 0 | 2021-10-19 |
Using MySQL variables, it can implement your count logic, it basically orders the table rows by user_id and created_at, and calculate row by row
http://sqlfiddle.com/#!9/569088/14
SELECT created_at, SUM(CASE WHEN duration >= 120 THEN 1 ELSE 0 END) counts
FROM (
SELECT user_id, created_at,
CASE WHEN #UID != user_id THEN #SUM_TIME := 0 WHEN #SUM_TIME >= 120 AND #DT != created_at THEN #SUM_TIME := 0 - duration ELSE 0 END SX,
#SUM_TIME := #SUM_TIME + duration AS duration,
#UID := user_id,
#DT := created_at
FROM watch_time
JOIN ( SELECT #SUM_TIME :=0, #DT := NOW(), #UID := '' ) t
ORDER BY user_id, created_at
) f
GROUP BY created_at
I think I misunderstood the requirement in my first attempt.
Second attempt
MySql >= 8.0 (or Postgresl) using window functions
I know you are working with MySql 5.7, I add an answer for it next.
I am not sure if I understand correctly your requirement. Do you want the cumulative sum of time watch by user and the first time some user exceed 119 minutes count one that day?
First, I get cumulative sum by user (cte subquery) ordered by date. In subquery cte1 with a CASE statement I set one the first time a user reach 120 minutes (view column). Finally I group by created_at (date) and count() ones in view column:
WITH cte AS (SELECT *, SUM(duration) OVER (PARTITION BY user_id ORDER BY created_at ASC, film_id) as cum_duration
FROM watch_time),
cte1 AS (SELECT *, CASE WHEN cum_duration >= 120 AND COALESCE(LAG(cum_duration) OVER (PARTITION BY user_id ORDER BY created_at ASC), 0) < 120 THEN 1 END AS view
FROM cte)
SELECT created_at, COUNT(view) AS cnt
FROM cte1
GROUP BY created_at;
created_at
cnt
2021-10-15
0
2021-10-16
1
2021-10-17
0
2021-10-18
2
2021-10-19
0
MySql 5.7
I get the cumulative sum for each user and filter cumulative duration >= 120, then I group by user_id and get MIN(created_at). Finally I group by min_created_at and count records.
SELECT min_created_at AS date, count(*) AS cnt
FROM (SELECT user_id, MIN(created_at) AS min_created_at
FROM (SELECT wt1.user_id, wt1.created_at, SUM(wt2.duration) AS cum_duration
FROM (SELECT user_id, created_at, SUM(duration) AS duration FROM watch_time GROUP BY user_id, created_at) wt1
INNER JOIN (SELECT user_id, created_at, SUM(duration) AS duration FROM watch_time GROUP BY user_id, created_at) wt2 ON wt1.user_id = wt2.user_id AND wt1.created_at >= wt2.created_at
GROUP BY wt1.user_id, wt1.created_at
HAVING SUM(wt2.duration) >= 120) AS sq
GROUP BY user_id) AS sq2
GROUP BY min_created_at;
date
cnt
2021-10-16
1
2021-10-18
2
You can JOIN my query (RIGHT JOIN) with the original table (GROUP BY created_at) to get the rest of the dates with count equal to 0.
First attempt
I understood that you want count one each time a user reach 120 minutes per day.
First, I get the total movie watch time by user and date (subquery sq), then with a CASE statement I set one each time a user in a date exceed 119 minutes, I group by created_at (date) and count() ones in CASE statement:
SELECT created_at, COUNT(CASE WHEN total_duration >= 120 THEN 1 END) cnt
FROM (SELECT created_at, user_id, SUM(duration) AS total_duration
FROM watch_time
GROUP BY created_at, user_id) AS sq
GROUP BY created_at;
Output (with sample data from the question):
reated_at
cnt
2021-10-15
0
2021-10-16
0
2021-10-17
1
2021-10-18
2
2021-10-19
0

Query date column from join table

Here is the schema:
Customer (Customer_ID, Name, Address, Phone),
Porder (Customer_ID, Pizza_ID, Quantity, Order_Date),
Pizza (Pizza_ID, Name, Price).
I want to get all customers that ordered a pizza in the last 30 days, based on the Order_Date & who spent the most money in the last 30 days. Can these be combined into one?
Here is what I am trying and I am not sure about DATEDIFF or how the query would calculate the total money.
SELECT customer.customer_ID, customer.name FROM customer
JOIN porder ON customer.customer_ID = porder.customer_ID
GROUP BY customer.customer_ID, customer.name
WHERE DATEDIFF(porder.porder_date,getdate()) between 0 and 30
Who spent the most money last 30 days?
SELECT porder.customer_ID, porder.pizza_id, porder.quantity FROM order
JOIN pizza ON porder.pizza_ID = pizza.pizza_ID
GROUP BY porder.customer_ID
WHERE MAX((porder.quantity * pizza.price)) && DATEDIFF(porder.porder_date,getdate()) between 0 and 30
Remember that functions are blackboxes to query optimizer, so you better make the query fit the index, and not the other way around.
WHERE DATEDIFF(order.order_date,getdate()) between 0 and 30
can be rewritten, so that the query would use plain index on order_date
WHERE order.order_date >= CURRENT_DATE - INTERVAL 30 DAY
Who spent the most money in the last 30 days
SELECT
o.customer_id, SUM(p.price * o.quantity)
FROM
order o
INNER JOIN pizza p
ON o.pizza_id = p.pizza_id
WHERE
order_date >= CURRENT_DATE - INTERVAL 30 DAY
GROUP BY o.customer_id
ORDER BY SUM(p.price * o.quantity) DESC
LIMIT 1
Something to think about once you've sorted out your tables, and separated order details from orders.
SELECT * FROM ints;
+---+
| i |
+---+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
+---+
SELECT x.*
, IF(x.i = y.maxi,1,0) is_biggest
FROM ints x
LEFT
JOIN (SELECT MAX(i) maxi FROM ints) y
ON y.maxi = x.i;
+---+------------+
| i | is_biggest |
+---+------------+
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 1 |
+---+------------+

Get the balance of my users in the same table

Help please, I have a table like this:
| ID | userId | amount | type |
-------------------------------------
| 1 | 10 | 10 | expense |
| 2 | 10 | 22 | income |
| 3 | 3 | 25 | expense |
| 4 | 3 | 40 | expense |
| 5 | 3 | 63 | income |
I'm looking for a way to use one query and retrive the balance of each user.
The hard part comes when the amounts has to be added on expenses and substracted on incomes.
This would be the result table:
| userId | balance |
--------------------
| 10 | 12 |
| 3 | -2 |
You need to get each totals of income and expense using subquery then later on join them so you can subtract expense from income
SELECT a.UserID,
(b.totalIncome - a.totalExpense) `balance`
FROM
(
SELECT userID, SUM(amount) totalExpense
FROM myTable
WHERE type = 'expense'
GROUP BY userID
) a INNER JOIN
(
SELECT userID, SUM(amount) totalIncome
FROM myTable
WHERE type = 'income'
GROUP BY userID
) b on a.userID = b.userid
SQLFiddle Demo
This is easiest to do with a single group by:
select user_id,
sum(case when type = 'income' then amount else - amount end) as balance
from t
group by user_id
You could have 2 sub-queries, each grouped by id: one sums the incomes, the other the expenses. Then you could join these together, so that each row had an id, the sum of the expenses and the sum of the income(s), from which you can easily compute the balance.

MySQL pro rata counting of distinct values in last period

This one is quite tricky i've been scratching my head all day.
I have a table containing billing periods
ID | DATEBEGIN | DATEEND | CUSTOMER_ID
=======+===========+==========+=================
1 | 1/1/2011 | 30/1/2011 | 1
I have a table containing 'sub customers'
ID | NAME
=======+===========
1 | JOHN
2 | JACK
3 | Elwood
I have a table containing items purchased on a subscription (wholesale account
ID | DATE | SUBCUSTOMER_ID | CUSTOMER ID
=======+===========+================+==============
1 | 15/1/2011 | 1 | 1
2 | 18/1/2011 | 1 | 1
3 | 25/1/2011 | 2 | 1
4 | 28/1/2011 | 3 | 1
So I want to count 'credits' to deduct from their account. So the subscription is per 'sub customer'.
So at the end of the billing period (30/1/2011 from first table). I need to count the distinct sub customers (there are 3). They are charged pro-rata from the first purchase they make during the billing period.
Days Having Subscription | SUBCUSTOMER_ID | Pro Rata Rate | CUSTOMER_ID
==========================+===================+==================+==============
3 | 3 | 3/30 | 1
6 | 2 | 6/30 | 1
16 | 1 | 16/30 | 1
The output should therefore be
CUSTOMER_ID | BILLING CREDITS
============+========================
1 | 25/30
I have to count it pro rata, previously it would be unfair to bill a full period even if they purchase an item 1 day prior to the billing date
SELECT customerId, SUM(DATEDIFF(dateend, fp) + 1) / (DATEDIFF(dateend, datestart) + 1)
FROM (
SELECT b.*, MIN(date) AS fp
FROM billing b
JOIN purchase p
ON p.customerId = b.customerId
AND p.date >= b.datebegin
AND p.date < b.dateend + INTERVAL 1 DAY
GROUP BY
b.id, p.subcustomerId
) q
GROUP BY
customerId
SELECT customer_id, sum(pro_rated_date) / max(days_in_billing_cycle)
FROM (SELECT min(date),
subcustomer_id,
datebegin,
dateend,
items.customer_id,
datediff(dateend, datebegin) + 1 AS days_in_billing_cycle,
datediff(dateend, min(date)) AS pro_rated_date
FROM items
JOIN
billing_period
ON items.date BETWEEN billing_period.datebegin
AND billing_period.dateend
AND items.customer_id = billing_period.customer_id
GROUP BY subcustomer_id) AS billing
GROUP BY customer_id