how to make cohort analysis in mysql

how to make cohort analysis in mysql - mysql

I have a table called order_star_member:
create table order_star_member(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
users_id INT(11) NOT NULL,
createdAt datetime NOT NULL,
total_price_star_member decimal(10,2) NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO order_star_member(users_id, createdAt, total_price_star_member)
VALUES
(15, '2021-01-01', 350000),
(15, '2021-01-02', 400000),
(16, '2021-01-02', 700000),
(15, '2021-02-01', 350000),
(16, '2021-02-02', 700000),
(15, '2021-03-01', 350000),
(16, '2021-03-01', 850000),
(17, '2021-03-03', 350000);
DB Fiddle
I want to find users in the month March with transaction >= 700000 and first transaction >= 700000. The user whose transaction is >= 700000 is called star member.
My query so far:
SELECT COUNT(users_id) count_star_member,
year_and_month DIV 100 `year`,
year_and_month MOD 100 `month`
FROM (SELECT users_id,
MIN(year_and_month) year_and_month
FROM ( SELECT users_id,
DATE_FORMAT(createdAt, '%Y%m') year_and_month,
SUM(total_price_star_member) month_price
FROM order_star_member
GROUP BY users_id,
DATE_FORMAT(createdAt, '%Y%m')
HAVING month_price >= 350000 ) starrings
GROUP BY users_id
HAVING SUM(year_and_month = '202103') > 0 ) first_starrings
GROUP BY year_and_month
ORDER BY `year`, `month`;
+-------------------+------+-------+
| count_star_member | year | month |
+-------------------+------+-------+
| 1 | 2021 | 1 |
+-------------------+------+-------+
Explanation: in march 2021, there's only one 'star member', which is users_id 16, whose first transaction is in january 2021, so 'star member' in march 2021 is as above.
But starting from March, the definition of 'star member' changes from 700,000 to 350,000.
I want to find the 'star member' in March, and his first transaction, but if the first transaction is in a month before March 2021, then the star member should be the user whose transaction >= 700,000 -- but if the first transaction is in March 2021, as I sid, select a user whose transaction >= 350,000.
Thus my updated expectation:
+-------------------+------+-------+
| count_star_member | year | month |
+-------------------+------+-------+
| 2 | 2021 | 1 |
| 1 | 2021 | 3 |
+-------------------+------+-------+
Explanation : users 15, 16, and 17 are star member in march 2021. but users 15 and 16 are doing their first star member in January 2021 (because it is before March 2021, when the requirement to become star member is 700,000), while user 17 is also a star member because the first transaction is 350,000 in March 2021.

My understanding is that in determining the final output, you need 2 things:
A user's first transaction
The users who are star members for the requested month using the condition that before March 2021 cumulative monthly transaction amounts >=700000 and after March >=350000
If correct, since you are using a version less than 8.0(where it could be done with one statement) your solution is as follows:
You need a rules table or some configuration of rules (we'll call it SMLimitDef) which would look like this entered directly in a table:
insert into SMLimitDef(sEffDate,eEffDate,priceLimit)
VALUES('1980-01-01','2021-02-28',700000),
('2021-03-01','2999-12-31',350000);
Next, you need a query or view that figures out your first transactions(called vFirstUserTransMatch) which would look something like this:
create view vFirstUserTransMatch as
SELECT *,month(osm.createdAt) as createMonth, year(osm.createdAt) as createYear
FROM order_star_member osm
where createdAt=(select MIN(createdAt) from order_star_member b
where b.users_id=osm.users_id
)
Next you need a summary view or query that summarizes transactions per month per user
create view vOSMSummary as
SELECT users_id,month(osm.createdAt) as createMonth, year(osm.createdAt) as createYear, sum(total_price_star_member) as totalPrice
FROM order_star_member osm
group by users_id,month(osm.createdAt), year(osm.createdAt);
Next you need a query that puts it all together based on your criteria:
select osm.*,futm.createMonth as firstMonth, futm.createYear as firstYear
from vOSMSummary osm
inner join vFirstUserTransMatch futm
on osm.users_id=futm.users_id
where exists(select 'x' from SMLimitDef c
where osm.createMonth between Month(c.sEffDate) and Month(c.eEffDate)
and osm.createYear between Year(c.sEffDate) and Year(c.eEffDate)
and osm.totalPrice>=c.pricelimit
)
and osm.CreateMonth=3 and osm.createYear=2021
Lastly, you can do your summary
SELECT COUNT(users_id) count_star_member,
firstYear `year`,
firstMonth `month`
FROM (
select osm.*,futm.createMonth as firstMonth, futm.createYear as firstYear
from vOSMSummary osm
inner join vFirstUserTransMatch futm
on osm.users_id=futm.users_id
where exists(select 'x' from SMLimitDef c
where osm.createMonth between Month(c.sEffDate) and Month(c.eEffDate)
and osm.createYear between Year(c.sEffDate) and Year(c.eEffDate)
and osm.totalPrice>=c.pricelimit
)
and osm.CreateMonth=3 and osm.createYear=2021
) d
group by firstYear, firstMonth
Like I said, if you were using mySQL 8, everything could be in one query using "With" statements but for your version, for readability and simplicity, you need views otherwise you can still embed the sql for those views into the final sql.
Fiddle looks like this
Contrast with version 8 which looks like this

This is probably what you need:
SELECT min_year, min_month, COUNT(users_id)
FROM (
SELECT osm2.users_id, YEAR(min_createdAt) min_year, MONTH(min_createdAt) min_month, SUM(total_price_star_member) sum_price
FROM (
SELECT users_id, MIN(createdAt) min_createdAt
FROM order_star_member
GROUP BY users_id
) AS osm1
JOIN order_star_member osm2 ON osm1.users_id = osm2.users_id
WHERE DATE_FORMAT(osm2.createdAt, '%Y%m') = DATE_FORMAT(osm1.min_createdAt, '%Y%m')
GROUP BY osm2.users_id, min_createdAt
) t1
WHERE users_id IN (
SELECT users_id
FROM (
SELECT users_id, DATE_FORMAT(createdAt, '%Y-%m-01') month_createdAt
FROM order_star_member
WHERE DATE_FORMAT(createdAt, '%Y%m') = '202103'
GROUP BY users_id, DATE_FORMAT(createdAt, '%Y-%m-01')
HAVING SUM(total_price_star_member) >= (
CASE
WHEN date(month_createdAt) < date '2021-03-01' THEN 700000
ELSE 350000
END
)
) t3
) AND
(((min_year < 2021 OR min_month < 3) AND t1.sum_price >= 700000) OR
((min_year = 2021 AND min_month = 3) AND t1.sum_price >= 350000))
GROUP BY min_year, min_month
First you find the MIN(createdAt) for each member, with:
SELECT users_id, MIN(createdAt) min_createdAt
FROM order_star_member
GROUP BY users_id
Then you compute the SUM of all the total_price_star_member in the month of the min_createdAt date:
SELECT osm2.users_id, YEAR(min_createdAt) min_year, MONTH(min_createdAt) min_month, SUM(total_price_star_member) sum_price
FROM osm1
JOIN order_star_member osm2 ON osm1.users_id = osm2.users_id
WHERE DATE_FORMAT(osm2.createdAt, '%Y%m') = DATE_FORMAT(osm1.min_createdAt, '%Y%m')
GROUP BY osm2.users_id, min_createdAt
Next you filter on the month you are interested in. Here you cannot use HAVING with something that cannot be computed from what you have in the GROUP BY statement, so you need to project also DATE_FORMAT(createdAt, '%Y-%m-01') to establish the minimum total price for star membership in the HAVING clause that is now allowed.
SELECT users_id
FROM (
SELECT users_id, DATE_FORMAT(createdAt, '%Y-%m-01') month_createdAt
FROM order_star_member
WHERE DATE_FORMAT(createdAt, '%Y%m') = '202102'
GROUP BY users_id, DATE_FORMAT(createdAt, '%Y-%m-01')
HAVING SUM(total_price_star_member) >= (
CASE
WHEN date(month_createdAt) < date '2021-03-01' THEN 700000
ELSE 350000
END
)
) t3
In the end you check also for the min_month and min_year, then you group based on these attributes and COUNT how many members in each group.
SELECT min_year, min_month, COUNT(users_id)
FROM t1
WHERE users_id IN (...) AND
(((min_year < 2021 OR min_month < 3) AND t1.sum_price >= 700000) OR
((min_year = 2021 AND min_month = 3) AND t1.sum_price >= 350000))
GROUP BY min_year, min_month
I have not immediately understood what your goal is and I am not sure I get it now, that is why I changed this query a few times by now so you might be able to simplify it.

Related

count with more than 1 having clause

so i have a case from my previous question how to count with more than 1 having clause mysql
assume i have the data dummy like this
CREATE TABLE order_star_member ( users_id INT,
createdAt DATE,
total_price_star_member DECIMAL(10,2) );
INSERT INTO order_star_member VALUES
(12,'2019-01-01',100000),
(12,'2019-01-10',100000),
(12,'2019-01-20',100000),
(12,'2019-02-10',100000),
(12,'2019-02-15',300000),
(12,'2019-02-21',500000),
(13,'2019-01-02',900000),
(13,'2019-01-11',300000),
(13,'2019-01-18',400000),
(13,'2019-02-06',100000),
(13,'2019-02-08',900000),
(13,'2019-02-14',400000),
(14,'2019-01-21',500000),
(14,'2019-01-23',200000),
(14,'2019-01-24',300000),
(14,'2019-02-08',100000),
(14,'2019-02-09',200000),
(14,'2019-02-14',100000),
(15, '2019-03-04',1000000),
(14, '2019-03-04', 300000),
(14, '2019-03-04', 350000),
(13, '2019-03-04', 400000),
(15, '2019-01-23', 620000),
(15, '2019-02-01', 650000),
(12, '2019-03-03', 750000),
(16, '2019-03-04', 650000),
(17, '2019-03-03', 670000),
(18, '2019-02-02', 450000),
(19, '2019-03-03', 750000);
SELECT * from order_star_member;
and then i summarize data per month
-- summary per-month data
SELECT users_id,
SUM( CASE WHEN MONTHNAME(createdAt) = 'January'
THEN total_price_star_member
END ) total_price_star_member_January,
SUM( CASE WHEN MONTHNAME(createdAt) = 'February'
THEN total_price_star_member
END ) total_price_star_member_February,
SUM( CASE WHEN MONTHNAME(createdAt) = 'March'
THEN total_price_star_member
END ) total_price_star_member_March
FROM order_star_member
GROUP BY users_id
ORDER BY 1;
on my previous question i have a data called order_star_member, which contain createdAt as the date of the transaction, users_id as the buyer, total_price_star_member as the amount of the transaction. on this case i want to find out buyer who had star member (transaction in a month within >= 600000) and find out where they coming from, the data(dummy) for order_star_member begin on January 2019 untill March 2019 and it solved by #Akina with this query
SELECT COUNT(users_id) count_star_member,
year_and_month DIV 100 `year`,
year_and_month MOD 100 `month`
FROM (SELECT users_id,
MIN(year_and_month) year_and_month
FROM ( SELECT users_id,
DATE_FORMAT(createdAt, '%Y%m') year_and_month,
SUM(total_price_star_member) month_price
FROM order_star_member
GROUP BY users_id,
DATE_FORMAT(createdAt, '%Y%m')
HAVING month_price >= 600000 ) starrings
GROUP BY users_id
HAVING SUM(year_and_month = '201903') > 0 ) first_starrings
GROUP BY year_and_month
ORDER BY `year`, `month`;
explanation i want to find out the distribution for each star member (users_id who transaction >= 600000 in a month) in march and where the users_Id doing his transaction >= 600.000 before march (if the users_Id doing transaction on the first time in march, then the users_Id enter the march to march statistic)
but on april 2019 to be star member you have to transaction >= 700.000 instead of 600.000, so i want to find out the data for user where transaction in april 2019 with >= 700.000 (star member) in a month of april 2019 and doing first transaction before april 2019 with total amount of 600.000 in a month
so which part i should change in this query to find out the user who doing total transaction >= 700.000 in april in a month and doing his first transaction (if first transaction before april) >= 600.000

Including and excluding specific records

I want to find some of buyer who had special condition (in this case, transaction >= 600000 called star member)
In this case, I want to find out star member (transaction >= 600000) who exists in January 2020 and March 2020, but it does not include star member who is doing transaction in February 2020.
here's my syntax
SELECT users_id
FROM order_star_member
GROUP BY users_id
HAVING SUM(CASE WHEN MONTHNAME(createdAt) = 'January'
THEN total_price_star_member END) >= 600000
AND SUM(CASE WHEN MONTHNAME(createdAt) = 'March'
THEN total_price_star_member END) >= 600000
AND NOT EXISTS (SELECT 1 FROM order_star_member
GROUP BY users_id
having sum(case when monthname(createdAt) = 'February'
THEN total_price_star_member END) >= 600000);
and here's my fiddle
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=2c85037215fe71f700b51c8fd3a5ae76
on my fiddle, the expected result are the users_Id 15 because that id order at january and march but not in february

First in the inner t we group by month to determine all the star members.
The outer grouping groups by user_id. Their score is the sum of their star_member.
For February (m=2 (February being the second month) on the first line of the query below) if they are a star_member, they get an penalty (-100) as an arbitrary value that the SUM cannot overcome.
The only way a month_score=2 can exist if if a user has a star_member being true (1) for both January and March but not February.
SELECT users_id, SUM(IF(m=2 and star_member, -100, star_member)) as month_score
FROM
(SELECT users_id,
MONTH(createdAt) as m,
SUM(total_price_star_member) >= 600000 as star_member
FROM order_star_member
WHERE createdAt BETWEEN '20190101' AND '20190331'
GROUP BY users_id, m
) t
GROUP BY users_id
HAVING month_score=2
fiddle

How to sum columns from two tables if the month match and group by month

Currently I achieved to do it with a single table with this query:
SELECT EXTRACT(MONTH FROM date) as month, SUM(total) as total FROM invoices GROUP BY month ORDER BY month ASC
But now I'm going crazy trying to return the same result from two columns, let's say total1 and total2, and group by month, if there is no invoice in a month in one of the columns the result should be cero.
Tables structure and expected result:
invoices payments
date date
total income
month totalInvoices totalPayments
1 10005 8017
2 756335 5019
3 541005 8017
4 34243 8870
How do I achieve this? Any suggestions?

You need a third element to the query structure which provides a complete list of all relevant years/months. This might be an existing table or a subquery, but the overall query will follow the structure outlined below:
CREATE TABLE invoices
(`id` int, `invdate` datetime, `invtotal` numeric);
INSERT INTO invoices
(`id`, `invdate`, `invtotal`)
VALUES
(1, '2017-01-21 00:00:00', 12.45);
CREATE TABLE payments
(`id` int, `paydate` datetime, `paytotal` numeric);
INSERT INTO payments
(`id`, `paydate`, `paytotal`)
VALUES
(1, '2017-02-21 00:00:00', 12.45);
select
ym.year, ym.month, inv.invtotal, pay.paytotal
from (
SELECT
EXTRACT(YEAR FROM invdate) as year
, EXTRACT(MONTH FROM invdate) as month
FROM invoices
UNION
SELECT
EXTRACT(YEAR FROM paydate) as year
, EXTRACT(MONTH FROM paydate) as month
FROM payments
) ym
left join (
SELECT
EXTRACT(YEAR FROM invdate) as year
, EXTRACT(MONTH FROM invdate) as month
, SUM(invtotal) as invtotal
FROM invoices
GROUP BY year, month
) inv on ym.year = inv.year and ym.month = inv.month
left join (
SELECT
EXTRACT(YEAR FROM paydate) as year
, EXTRACT(MONTH FROM paydate) as month
, SUM(paytotal) as paytotal
FROM payments
GROUP BY year, month
) pay on ym.year = pay.year and ym.month = pay.month;
year | month | invtotal | paytotal
-----|-------|----------|--------|
2017 | 1 | 12 | null |
2017 | 2 | null | 12 |
In my example the "third element" is the subquery ym but this may be too inefficient for your actual query, but it should server to identify how to co-ordinate data over disparate time ranges.
dbfiddle here

MySQL: how to select record with latest date before a certain date

Assume a simple table as follows containing status events for two users. A status_id of 1 makes them 'active', anything else makes them de-active.
I need to find out all those users that became inactive within one year of, for example, 2015-05-01 (not including that date).
CREATE TABLE user_status(
user_id INT,
status_id INT,
date_assigned VARCHAR(10) );
INSERT INTO user_status( user_id, status_id, date_assigned)
VALUES
(1234, 1, '2015-01-01'), -- 1234 becomes active (status id = 1)
(1234, 2, '2015-07-01'), -- 1234 de-activated for reason 2
(5678, 1, '2015-02-01'), -- 5678 becomes active (status id = 1)
(5678, 3, '2015-04-01'), -- 5678 de-activated for reason 3
(5678, 5, '2015-06-01'); -- 5678 de-activated for reason 5
Using the query
SELECT t1.*
FROM user_status t1
WHERE t1.date_assigned = (SELECT MIN(t2.date_assigned) -- the first occurrence
FROM user_status t2
WHERE t2.user_id = t1.user_id -- for this user
AND t2.status_id <> 1 -- where status not active
AND t2.date_assigned BETWEEN -- within 1 yr of given date
'2015-05-01' + INTERVAL 1 DAY -- (not including that date)
AND
'2015-05-01' + INTERVAL 1 YEAR
)
I can get the result
user_id status_id date_assigned
1234 2 2015-07-01
5678 5 2015-06-01
This is sort of right but user 5678 should not be there because although they had an inactive event within the date range, they were already inactive before the desired date range began and so did not become inactive within that range.
I need to add a bit to my query along the lines of 'only show me those users who had an inactive event and where the previous status_id for that user was 1, ie they were active at the time the inactive event happened.
Can anyone help me to get the syntax correct?
See SQL fiddle

You can add NOT EXISTS to your query:
SELECT t1.*
FROM user_status t1
WHERE t1.date_assigned = (SELECT MIN(t2.date_assigned) -- the first occurance
FROM user_status t2
WHERE t2.user_id = t1.user_id -- for this user
AND t2.status_id <> 1 -- where status not active
AND t2.date_assigned BETWEEN -- within 1 yr of given date
'2015-05-01' + INTERVAL 1 DAY -- (not including that date)
AND
'2015-05-01' + INTERVAL 1 YEAR
)
AND NOT EXISTS (SELECT 1 -- such a record should not exist
FROM user_status t3
WHERE t3.user_id = t1.user_id -- for this user
AND t3.status_id <> 1 -- where status is not active
AND t3.date_assigned < -- before the examined period
'2015-05-01' + INTERVAL 1 DAY )
Demo here
Edit:
You can use the following query that also considers the case of having multiple activation dates:
SELECT *
FROM user_status
WHERE (user_id, date_assigned) IN (
-- get last de-activation date
SELECT t1.user_id, MAX(t1.date_assigned)
FROM user_status AS t1
JOIN (
-- get last activation date
SELECT user_id, MAX(date_assigned) AS activation_date
FROM user_status
WHERE status_id = 1
GROUP BY user_id
) AS t2 ON t1.user_id = t2.user_id AND t1.date_assigned > t2.activation_date
GROUP BY user_id
HAVING MAX(date_assigned) BETWEEN '2015-05-01' + INTERVAL 1 DAY AND '2015-05-01' + INTERVAL 1 YEAR AND
MIN(date_assigned) > '2015-05-01' + INTERVAL 1 DAY)

Self join solution : finding minimum ( the first time the status changed) within your date criteria:
select a.user_id,b.status_id,max(b.date_assigned)
from user_status a
inner join user_status b
on a.user_id=b.user_id
and a.date_assigned <b.date_assigned
where b.status_id >1 and a.status_id=1
group by a.user_id,b.status_id
having max(b.date_assigned)> '2015-05-01'
and max(b.date_assigned) <='2016-05-01'

Mysql : Finding empty time blocks between two dates and times?

I wanted to find out user's availability from database table:
primary id | UserId | startdate | enddate
1 | 42 | 2014-05-18 09:00 | 2014-05-18 10:00
2 | 42 | 2014-05-18 11:00 | 2014-05-18 12:00
3 | 42 | 2014-05-18 14:00 | 2014-05-18 16:00
4 | 42 | 2014-05-18 18:00 | 2014-05-18 19:00
Let's consider above inserted data is user's busy time, I want to find out free time gap blocks from table between start time and end time.
BETWEEN 2014-05-18 11:00 AND 2014-05-18 19:00;
Let me add here schema of table for avoiding confusion:
Create Table availability (
pid int not null,
userId int not null,
StartDate datetime,
EndDate datetime
);
Insert Into availability values
(1, 42, '2013-10-18 09:00', '2013-10-18 10:00'),
(2, 42, '2013-10-18 11:00', '2013-10-18 12:00'),
(3, 42, '2013-10-18 14:00', '2013-11-18 16:00'),
(4, 42, '2013-10-18 18:00', '2013-11-18 19:00');
REQUIREMENT:
I wanted to find out free gap records like:
'2013-10-27 10:00' to '2013-10-28 11:00' - User is available for 1 hours and
'2013-10-27 12:00' to '2013-10-28 14:00' - User is available for 2 hours and
available start time is '2013-10-27 10:00' and '2013-10-27 12:00' respectively.

Here you go
SELECT t1.userId,
t1.enddate, MIN(t2.startdate),
MIN(TIMESTAMPDIFF(HOUR, t1.enddate, t2.startdate))
FROM user t1
JOIN user t2 ON t1.UserId=t2.UserId
AND t2.startdate > t1.enddate AND t2.pid > t1.pid
WHERE
t1.endDate >= '2013-10-18 09:00'
AND t2.startDate <= '2013-11-18 19:00'
GROUP BY t1.UserId, t1.endDate
http://sqlfiddle.com/#!2/50d693/1

Using your data, the easiest way is to list the hours when someone is free. The following gets a list of hours when someone is available:
select (StartTime + interval n.n hour) as FreeHour
from (select cast('2014-05-18 11:00' as datetime) as StartTime,
cast('2014-05-18 19:00' as datetime) as EndTime
) var join
(select 0 as n union all select 1 union all select 2 union all select 3 union all select 4 union all
select 5 union all select 6 union all select 7 union all select 8 union all select 9
) n
on StartTime + interval n.n hour <= EndTime
where not exists (select 1
from availability a
where StartTime + interval n.n hour < a.EndDate and
StartTime + interval n.n hour >= a.StartDate
);
EDIT:
The general solution to your problem requires denormalizing the data. The basic query is:
select thedate, sum(isstart) as isstart, #cum := #cum + sum(isstart) as cum
from ((select StartDate as thedate, 1 as isstart
from availability a
where userid = 42
) union all
(select EndDate as thedate, -1 as isstart
from availability a
where userid = 42
) union all
(select cast('2014-05-18 11:00' as datetime), 0 as isstart
) union all
(select cast('2014-05-18 19:00' as datetime), 0 as isstart
)
) t
group by thedate cross join
(select #cum := 0) var
order by thedate
You then just choose the values where cum = 0. The challenge is getting the next date from this list. In MySQL that is a real pain, because you cannot use a CTE or view or window function, so you have to repeat the query. This is why I think the first approach is probably better for your situation.

The core query can be this. You can dress it up as you like, but I'd handle all that stuff in the presentation layer...
SELECT a.enddate 'Available From'
, MIN(b.startdate) 'To'
FROM user a
JOIN user b
ON b.userid = a.user
AND b.startdate > a.enddate
GROUP
BY a.enddate
HAVING a.enddate < MIN(b.startdate)
For times outside the 'booked' range, you have to extend this a little with a UNION, or again handle the logic at the application level

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to make cohort analysis in mysql - mysql

Related

count with more than 1 having clause

Including and excluding specific records

How to sum columns from two tables if the month match and group by month

MySQL: how to select record with latest date before a certain date

Mysql : Finding empty time blocks between two dates and times?

Categories

Resources