Calculate churn rate using a query instead of a stored procedure - mysql

I am writing queries for some KPIs (Key Performance Indicators) to track user engagement. One such KPI is "Churn Rate", which I am calculating for a given month by:
Churn rate = (Total users deleted in month)/(Total users on the 1st of month)
I am using a users table with the following columns:
created_at, deleted_at
My process is to get all relevant months of user activity (in this case, based on "created_at" column, since we are getting several new users per month. We also have an activity log table which might technically be more accurate to use but doesn't go back as far) and then loop over them in a stored procedure. For each month, I'm calculating who was deleted that month and who was active on the first of that month (created on or before the 1st of the month and either not deleted or deleted after the first of that month). Then I'm dividing them to find churn rate and inserting into a temporary table. Here is my stored procedure:
DROP PROCEDURE ChurnRate;
DELIMITER $$
CREATE PROCEDURE ChurnRate()
BEGIN
DECLARE start_date DATETIME;
DECLARE end_date DATETIME;
DECLARE cur_date DATETIME;
DECLARE current_month VARCHAR(255);
DECLARE end_month VARCHAR(255);
DECLARE deleted_count BIGINT;
DECLARE active_user_count BIGINT;
DECLARE churn_rate FLOAT;
SELECT created_at FROM users ORDER BY created_at ASC LIMIT 1 INTO start_date;
SELECT created_at FROM users ORDER BY created_at DESC LIMIT 1 INTO end_date;
SET cur_date = start_date;
SET current_month = SUBSTR(cur_date,1,7);
SET end_month = SUBSTR(end_date,1,7);
DROP TEMPORARY TABLE IF EXISTS churn_table;
CREATE TEMPORARY TABLE churn_table
(
user_month VARCHAR(255),
deleted_count BIGINT,
active_user_count BIGINT,
churn_rate FLOAT
);
loop_label: LOOP
SELECT COUNT(U.id) FROM users AS U WHERE SUBSTR(U.deleted_at,1,7) = current_month INTO deleted_count;
SELECT COUNT(U.id) FROM users AS U
WHERE (U.deleted_at >= DATE_ADD(DATE_ADD(LAST_DAY(cur_date),INTERVAL 1 DAY),INTERVAL -1 MONTH) OR U.deleted_at IS NULL)
AND SUBSTR(U.created_at,1,7) <= current_month
INTO active_user_count;
INSERT INTO churn_table (user_month, deleted_count, active_user_count, churn_rate) VALUES (current_month, deleted_count, active_user_count, (deleted_count/active_user_count));
SET cur_date = DATE_ADD(cur_date, INTERVAL 1 MONTH);
SET current_month = SUBSTR(cur_date,1,7);
IF current_month <= end_month THEN
ITERATE loop_label;
END IF;
LEAVE loop_label;
END LOOP;
SELECT * FROM churn_table;
END$$
DELIMITER ;
CALL ChurnRate();
Here is a sample of some data that was produced:
user_month
churn_rate_percentage
2019-12
0
2020-01
0.0396982
2020-02
0
2020-03
0
2020-04
0
2020-05
0.112116
2020-06
0.59691
2020-07
0.26689
2020-08
0.144374
2020-09
0.141767
2020-10
0.125
2020-11
0.272904
2020-12
0.14937
My problem is this: I am using an API that requires this to be a select query. I have previously tried writing select queries for this, but they have been flawed. Grouping by "deleted_at" will not work because we will not show months for which no users have been deleted. Grouping by "created_at" and using subqueries ends up being extremely slow, as we have about 50k users. Is there a clean, efficient way to write this as a select query without affecting performance?
If there is not, I will have to write a chron to run this procedure and export the data.
Thank you

You shouldn't use loops in SQL that is often an indication you are doing something wrong.
Here is how to do this in a single query:
-- recursive CTE to create list of months of interest
with RECURSIVE base_months(d,y,m) AS
(
SELECT DateSerial(Year(min(create_at)), Month(min(create_at)), "1"),
min(create_at) , year(min(create_at)) , month(min(create_at))
FROM users
UNION ALL
SELECT data_add(d INTERVAL 1 MONTH) , year(data_add(d INTERVAL 1 MONTH)) , month(data_add(d INTERVAL 1 MONTH))
FROM base_months
WHERE YEAR(d) <= YEAR(CURDATE()) && MONTH(d) <= MONTH(CURDATE())
)
select
b.y as year,
b.m as month,
count(u.created_at) as total_user
sum(case when month(u.deleted_at) = b.m and year(u.deleted_at = b.y) then 1 else 0 end) as left_this_month
from base_months b
-- for each month join to the users table
join user u on u.created_at < b.d and (u.deleted_at > b.d or u.deleted_at is null)
group by b.y, b.m
If this isn't clear, first we use a recursive CTE to get all the months and years of interest -- you could do a non-recursive query on the table with a group by if only want to include create date months that are in the table -- but I think that would give you interesting results since months that don't have anyone created in that month would not be included.
Then I join that back to the users table with filters on the join to only include the rows we want to count for the given year and month. We use group by and aggregation functions to find the results.

Looping is likely to be terribly slow.
Is this how you decide if a user exists on Nov 1, 2020?
WHERE created_at < '2020-11'
AND deleted_at > '2020-11'
Hence, a COUNT(*) with that test would give that count?
For deletions for that month:
WHERE LEFT(deleted_at, 7) = '2020-11'
Putting those together into a single query or all months:
SELECT LEFT(created_at, 7) AS yyyymm,
( SELECT COUNT(*)
FROM users
WHERE created_at < yyyymm
AND deleted_at > yyyymm
) AS new_users,
( SELECT COUNT(*)
FROM users
WHERE deleted_at >= yyyymm
AND deleted_at < CONCAT(yyyymm, '-01')
) AS deleted_users
FROM users
GROUP BY yyyymm
ORDER BY yyyymm
That gives you 3 columns; check it out. To get the churn:
SELECT LEFT(created_at, 7) AS yyyymm,
( SELECT ... ) / ( SELECT ... ) AS churn
FROM users
GROUP BY yyyymm
ORDER BY yyyymm

Related

Select while instead of While select causing issue

I'm working on a ranged date query, and trying to adjust the rules for the loop but I have a bit of a problem:
Take the following:
DROP PROCEDURE
IF EXISTS test;
CREATE PROCEDURE test ( IN start_date DATE ) BEGIN
DECLARE group_name VARCHAR ( 10 ) DEFAULT 'clientA';
DECLARE service_name VARCHAR ( 10 ) DEFAULT 'serviceA';
WHILE ( start_date < CURDATE( ) && SUBDATE( start_date, INTERVAL - 2 WEEK ) < CURDATE( ) ) DO
SELECT start_date AS 'Start Day', SUBDATE( start_date, INTERVAL - 2 WEEK ) AS 'End Day';
SET start_date = SUBDATE( start_date, INTERVAL - 2 WEEK );
END WHILE;
END;
This selects a start and end date from a starting point up to today:
CALL test ( '2019-08-29' );
Returns 5 results:
08/29 & 09/12
09/12 & 09/26
09/26 & 10/10
10/10 & 10/24
10/24 & 11/7
This is what I want but rather than 5 results. I want each of these as rows in one result. I think the best way to do this is via a sub-query with the inner query running the loop and doing the selects but the outer query serving as a wrapper to constrain them into one set.
I have the following code:
DROP PROCEDURE
IF EXISTS test;
CREATE PROCEDURE test ( IN start_date DATE ) BEGIN
DECLARE group_name VARCHAR ( 10 ) DEFAULT 'clientA';
DECLARE service_name VARCHAR ( 10 ) DEFAULT 'serviceA';
SELECT * FROM (WHILE ( start_date < CURDATE( ) && SUBDATE( start_date, INTERVAL - 2 WEEK ) < CURDATE( ) ) DO
SELECT start_date AS 'Start Day', SUBDATE( start_date, INTERVAL - 2 WEEK ) AS 'End Day';
SET start_date = SUBDATE( start_date, INTERVAL - 2 WEEK );
END WHILE;
)
END;
But that gives me:
1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'AS 'Result' FROM (
WHILE ( start_date < CURDATE( ) && SUBDATE( start_date, IN' at line 4
I feel like there's something small wrong with my syntax here but I'm having difficulty understanding exactly what. Any guidance would be great!
you can create a dynamic list of date activity by using MySQL # variables and joining to any table that has as many rows as you expect in result set... ex: if you need 5, 10 or 1000 records in the dynamic result.
select
-- whatever latest date is BECOMES the Begin Date
#beginDT BeginDate,
-- now, add 2 weeks to the #beginDT variable and save as the END Date
#beginDT := date_add( #beginDT, interval 2 week ) EndDate
from
-- pick any table that has as many 2-week cycles as you expect.
-- ex: if you wanted 1 yr, you would need any table with 26 or 27 records
AnyTableWithManyRecords,
-- start the query with your starting date, alias sqlvars is just place-holder
-- and will only prepare the variable and be one row for rest of query
( select #beginDT := '2019-08-29' ) sqlvars
having
-- having will stop until your maximum date of interest
BeginDate < curdate()
-- but limit to 100 so you don't query against table of millions of records.
-- how many records do you REALLY need to go through... again, 26 biweekly = 1 year
-- this limit of 100 would allow for almost 4 yrs worth
limit 100;
Then, if you wanted data from some other table, you could join to the above result set as its own such as
select
SOT.WhateverColumns
from
( above query ) MyDates
JOIN SomeOtherTable SOT
on MyDates.BeginDate <= SOT.SomeDate
AND SOT.SomeDate < MyDates.EndDate

SUM subquery with condition depends on parent query columns returns NULL

everyone!
I'm trying to calc sum of price of deals by each day. What i do:
SET #symbols_set = "A,B,C,D";
DROP TABLE IF EXISTS temp_deals;
CREATE TABLE temp_deals AS SELECT Deal, TimeMsc, Price, VolumeExt, Symbol FROM deals WHERE TimeMsc >= "2019-04-01" AND TimeMsc <= "2019-06-30" AND FIND_IN_SET(Symbol, #symbols_set) > 0;
SELECT
DATE_FORMAT(TimeMsc, "%d/%m/%Y") AS Date,
Symbol,
(SELECT SUM(Price) FROM temp_deals dap WHERE dap.TimeMsc BETWEEN Date AND Date + INTERVAL 1 DAY AND dap.Symbol = Symbol) AS AvgPrice
FROM temp_deals
ORDER BY Date;
DROP TABLE IF EXISTS temp_deals;
But in result i've got NULL in AvgPrice column. I can't understand what i'm doing wrong.
It's look like i can't pass parent query's column to subquery, am i right?
Qualify your column names. But mostly, don't use a string for comparing dates:
SELECT DATE_FORMAT(d.TimeMsc, '%d/%m/%Y') AS Date,
d.Symbol,
(SELECT SUM(dap.Price)
FROM temp_deals dap
WHERE dap.TimeMsc >= d.TimeMsc AND
dap.TimeMsc < d.TimeMsc + INTERVAL 2 DAY AND -- not sure if you want 1 day or 2 day
dap.Symbol = d.Symbol
) AS AvgPrice
FROM temp_deals d
ORDER BY d.TimeMsc;

SQL Query issue with joining data don't exist in table [duplicate]

I've got a SQL Server CE 3.5 table (Transactions) with the following Schema:
ID
Transaction_Date
Category
Description
Amount
Query:
SELECT Transaction_Date, SUM(Amount)
FROM Transactions
GROUP BY Transaction_Date;
I'm trying to do a SUM(Amount) and group by transaction_date just so I can get the total amount for each day but I want to get back values even for days there were no transactions so basically the record for a day with no transactions would just have $0.00 for amount.
Thanks for the help!
You need a Calendar table to select over the dates. Alternatively, if you have a Numbers table, you could turn that effectively into a Calendar table. Basically, it's just a table with every date in it. It's easy enough to build and generate the data for it and it comes in very handy for these situations. Then you would simply use:
SELECT
C.calendar_date,
SUM(T.amount)
FROM
Calendar C
LEFT OUTER JOIN Transactions T ON
T.transaction_date = C.calendar_date
GROUP BY
C.calendar_date
ORDER BY
C.calendar_date
A few things to keep in mind:
If you're sending this to a front-end or reporting engine then you should just send the dates that you have (your original query) and have the front end fill in the $0.00 days itself if that's possible.
Also, I've assumed here that the date is an exact date value with no time component (hence the "=" in the join). Your calendar table could include a "start_time" and "end_time" so that you can use BETWEEN for working with dates that include a time portion. That saves you from having to strip off time portions and potentially ruining index usage. You could also just calculate the start and end points of the day when you use it, but since it's a prefilled work table it's easier IMO to include a start_time and end_time.
You'll need to upper and lower bound your statement somehow, but perhaps this will help.
DECLARE #Start smalldatetime, #End smalldatetime
SELECT #Start = 'Jan 1 2010', #End = 'Jan 18 2010';
--- make a CTE of range of dates we're interested in
WITH Cal AS (
SELECT CalDate = convert(datetime, #Start)
UNION ALL
SELECT CalDate = dateadd(d,1,convert(datetime, CalDate)) FROM Cal WHERE CalDate < #End
)
SELECT CalDate AS TransactionDate, ISNULL(SUM(Amount),0) AS TransactionAmount
FROM Cal AS C
LEFT JOIN Transactions AS T On C.CalDate = T.Transaction_Date
GROUP BY CalDate ;
Once you have a Calendar table (more on that later) you can then do an inner join on the range of your data to fill in missing dates:
SELECT CalendarDate, NULLIF(SUM(t.Amount),0)
FROM (SELECT CalendardDate FROM Calendar
WHERE CalendarDate>= (SELECT MIN(TransactionDate) FROM Transactions) AND
CalendarDate<= (SELECT MAX(TransactionDate) FROM Transactions)) c
LEFT JOIN
Transactions t ON t.TransactionDate=c.CalendarDate
GROUP BY CalendarDate
To create a calendar table, you can use a CTE:
WITH CalendarTable
AS
(
SELECT CAST('20090601' as datetime) AS [date]
UNION ALL
SELECT DATEADD(dd, 1, [date])
FROM CTE_DatesTable
WHERE DATEADD(dd, 1, [date]) <= '20090630' /* last date */
)
SELECT [date] FROM CTE_DatesTable
OPTION (MAXRECURSION 0);
Combining the two, we have
WITH CalendarTable
AS
(
SELECT MIN(TransactionDate) FROM Transactions AS [date]
UNION ALL
SELECT DATEADD(dd, 1, [date])
FROM CTE_DatesTable
WHERE DATEADD(dd, 1, [date]) <= (SELECT MAX(TransactionDate) FROM Transactions)
)
SELECT c.[date], NULLIF(SUM(t.Amount),0)
FROM Calendar c
LEFT JOIN
Transactions t ON t.TransactionDate=c.[date]
GROUP BY c.[date]
Not sure if any this works with CE
With common table expressions
DECLARE #StartDate DATETIME
DECLARE #EndDate DATETIME
SET #StartDate = '2010-07-10'
SET #EndDate = '2010-07-20'
;WITH Dates AS (
SELECT #StartDate AS DateValue
UNION ALL
SELECT DateValue + 1
FROM Dates
WHERE DateValue + 1 <= #EndDate
)
SELECT Dates.DateValue, ISNULL(SUM(Transactions.Amount), 0)
FROM Dates
LEFT JOIN Transactions ON
Dates.DateValue = Transactions.Transaction_Date
GROUP BY Dates.DateValue;
With loop + temporary table
DECLARE #StartDate DATETIME
DECLARE #EndDate DATETIME
SET #StartDate = '2010-07-10'
SET #EndDate = '2010-07-20'
SELECT #StartDate AS DateValue INTO #Dates
WHILE #StartDate <= #EndDate
BEGIN
SET #StartDate = #StartDate + 1
INSERT INTO #Dates VALUES (#StartDate)
END
SELECT Dates.DateValue, ISNULL(SUM(Transactions.Amount), 0)
FROM #Dates AS Dates
LEFT JOIN Transactions ON
Dates.DateValue = Transactions.Transaction_Date
GROUP BY Dates.DateValue;
DROP TABLE #Dates
If you want dates that don't have transactions to appear
you can add a DUMMY transaction for each day with the amount of zero
it won't interfere with SUM and would so what you want

Find number of "active" rows each month for multiple months in one query

I have a mySQL database with each row containing an activate and a deactivate date. This refers to the period of time when the object the row represents was active.
activate deactivate id
2015-03-01 2015-05-10 1
2013-02-04 2014-08-23 2
I want to find the number of rows that were active at any time during each month. Ex.
Jan: 4
Feb: 2
Mar: 1
etc...
I figured out how to do this for a single month, but I'm struggling with how to do it for all 12 months in a year in a single query. The reason I would like it in a single query is for performance, as information is used immediately and caching wouldn't make sense in this scenario. Here's the code I have for a month at a time. It checks if the activate date comes before the end of the month in question and that the deactivate date was not before the beginning of the period in question.
SELECT * from tblName WHERE activate <= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND deactivate >= DATE_SUB(NOW(), INTERVAL 2 MONTH)
If anybody has any idea how to change this and do grouping such that I can do this for an indefinite number of months I'd appreciate it. I'm at a loss as to how to group.
If you have a table of months that you care about, you can do:
select m.*,
(select count(*)
from table t
where t.activate_date <= m.month_end and
t.deactivate_date >= m.month_start
) as Actives
from months m;
If you don't have such a table handy, you can create one on the fly:
select m.*,
(select count(*)
from table t
where t.activate_date <= m.month_end and
t.deactivate_date >= m.month_start
) as Actives
from (select date('2015-01-01') as month_start, date('2015-01-31') as month_end union all
select date('2015-02-01') as month_start, date('2015-02-28') as month_end union all
select date('2015-03-01') as month_start, date('2015-03-31') as month_end union all
select date('2015-04-01') as month_start, date('2015-04-30') as month_end
) m;
EDIT:
A potentially faster way is to calculate a cumulative sum of activations and deactivations and then take the maximum per month:
select year(date), month(date), max(cumes)
from (select d, (#s := #s + inc) as cumes
from (select activate_date as d, 1 as inc from table t union all
select deactivate_date, -1 as inc from table t
) t cross join
(select #s := 0) param
order by d
) s
group by year(date), month(date);

Need SQL query to find users with activity in two different periods for each day

I'm trying to do some analysis on usage of our web based app.
I have a table with the the following columns
email address
activity date
I want to create a query that answers this question:
For each day in the past 180 days, how many people who did an activity between 60 and 30 days prior ALSO did an activity between 30 and 0 days prior.
I already have this working as a stored procedure where I literally loop over the past 180 days (using a date table with 1 row per day), but this is kinda slow as I'm doing 180 queries.
I also tried my hand at doing it with one query with the IN clause but it took about 5 minutes to complete (the table only has about 2,000 rows total so I'm guessing it was HIGHLY un-optimized)
How would I do this with one query (or even a stored proc) that's optimized?
Here is the current stored proc (which works but is slow) if it helps:
BEGIN
DECLARE mydate DATE;
DECLARE period1 INT;
DECLARE period2 INT;
DECLARE done INT;
DECLARE cur CURSOR FOR SELECT date_value from dim_date order by date_value DESC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
SET done = 0;
OPEN cur;
REPEAT
FETCH cur INTO mydate;
IF NOT done THEN
REPLACE INTO churn (payment_received,period2,period1,churn_name)
select
mydate,
count(distinct(case when (sales.payment_received BETWEEN DATE_SUB(mydate,INTERVAL p2 month) AND DATE_SUB(mydate,INTERVAL p1 month)) then email end)) AS period2,
(
select count(distinct(case when (sales.payment_received BETWEEN DATE_SUB(mydate,INTERVAL p1 month) AND mydate) then email end))
from sales where subscription = 1 AND email in (select email from sales where sales.payment_received BETWEEN DATE_SUB(mydate,INTERVAL p2 month) AND DATE_SUB(mydate,INTERVAL p1 month) )
)
AS period1,
churn_name as cname
from sales
where subscription = 1;
END IF;
UNTIL done END REPEAT;
CLOSE cur;
END;;
Thanks!
Step 1) Get users with activity in the last month (DISTINCT cause we don't care how many times in the last month, just weather they were active at all):
SELECT DISTINCT email
FROM sales
WHERE payment_received BETWEEN NOW() AND DATE_ADD(NOW(),INTERVAL -1 MONTHS)
Step 2) Get users with activity 1-2 months ago:
SELECT DISTINCT email
FROM sales
WHERE payment_received BETWEEN DATE_ADD(NOW(),INTERVAL -1 MONTHS) AND DATE_ADD(NOW(),INTERVAL -2 MONTHS)
Step 3) Join these into one result set
SELECT M1.email
FROM (
SELECT DISTINCT email
FROM sales
WHERE payment_received BETWEEN NOW() AND DATE_ADD(NOW(),INTERVAL -1 MONTHS)
) M1,
(
SELECT DISTINCT email
FROM sales
WHERE payment_received BETWEEN DATE_ADD(NOW(),INTERVAL -1 MONTHS) AND DATE_ADD(NOW(),INTERVAL -2 MONTHS)
) M2
WHERE M1.email = M2.email
I'm going to go ahead and assume that dim_date is a calendar table (very handy things) It might also be nice to know what (if any) indices you may have, but at 2000 rows a decent RDBMS will likely load the entire table into memory regardless, so that probably isn't a factor.
Unfortunately, any way you look at it, this type of analysis is going to take time. I'm fairly certain translating this into a completely set-based approach will speed things up, but I don't have an instance to really test against.
I'd start by re-writing the statement like so:
SELECT Dim_Date.date_value,
COUNT(DISTINCT Period_2.email), COUNT(DISTINCT Period_1.email),
Period_1.churn_name
FROM Dim_Date
JOIN Sales Period_2
ON Period_2.payment_received >= DATE_SUB(Dim_Date.date_value, INTERVAL 60 DAY)
AND Period_2.payment_received < DATE_SUB(Dim_Date.date_value, INTERVAL 30 DAY)
AND Period_2.subscription = 1
LEFT JOIN Sales Period_1
ON Period_1.payment_received >= DATE_SUB(Dim_Date.date_value, INTERVAL 30 DAY)
AND Period_1.payment_received < Dim_Date.date_value
AND Period_1.subscription = 1
AND Period_1.email = Period_2.email
AND Period_1.churn_name = Period_2.churn_name
WHERE Dim_Date.date_value >= DATE_SUB(CURRENT_DATE, INTERVAL 180 DAY)
AND Dim_Date.date_value < CURRENT_DATE
GROUP BY Dim_Date.date_value, Period_1.churn_name
This statement should run, but is otherwise untested.
(...I'm not sure what I was thinking here originally, I wasn't correlating the two sets per-user...)
One thing - you don't seem to have subscription = 1 as a condition on the inner-most subquery; I didn't know if that was deliberate, or an oversight. I've also assumed that churn_name should be correlated, whatever that is.