Optimizing cohort analysis on Google BigQuery - mysql

I'm attempting to perform a cohort analysis on a very large table. I have a test table with ~30M rows (over double in production). The query fails in BigQuery stating "resources exceeded.." and it's a tier 18 query (tier 1 is $5, so it's a $90 query!)
The query:
with cohort_active_user_count as (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
count(distinct`BQ_TABLE`.bot_user_id) as count,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
group by created_at, bot_id
)
select created_at, period as period,
active_users, retained_users, retention, bot_id
from (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
max(cohort_size.count) as active_users, -- all equal in group
count(distinct future_message.bot_user_id) as retained_users,
count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
left join `BQ_TABLE` as future_message on
`BQ_TABLE`.bot_user_id = future_message.bot_user_id
and `BQ_TABLE`.created_at < future_message.created_at
and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
and `BQ_TABLE`.bot_id = future_message.bot_id
left join cohort_active_user_count as cohort_size on
DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
and `BQ_TABLE`.bot_id = cohort_size.bot_id
group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id
Here is the desired output:
From my understanding of BigQuery, the joins are causing a major performance hit because each BigQuery node needs to process them. The table is partitioned by day, which I'm not yet making use of in this query, but I know it will still need to be optimized.
How can this query be optimized or exclude the use of joins to allow BigQuery to process more efficiently in parallel?

Step #1
Try below
Moved JOIN'ing on cohort_active_user_count outside the inner SELECT as I think it is one of main reason for query be expensive. And as you see - using JOIN instead LEFT JOIN for this one as LEFT is not needed here
Please test and let us know result
WITH cohort_active_user_count AS (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
LEFT JOIN BQ_TABLE AS future_message
ON BQ_TABLE.bot_user_id = future_message.bot_user_id
AND BQ_TABLE.created_at < future_message.created_at
AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
AND BQ_TABLE.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
Step # 2
Below "further optimization" is based on assumption that your BQ_TABLE is a raw data with multiple entries for the same user_id/bit_id for the same day - thus increasing a lot expenses of LEFT JOIN in inner SELECT.
I propose first to aggregate this, as it is done below. In addition to drastically reducing size of JOIN - it also eliminates all those conversion from TIMESTAMP to DATE in each join'ed row
WITH BQ_TABLE_AGG AS (
SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
FROM BQ_TABLE
GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
SELECT
created_at,
COUNT(DISTINCT bot_user_id) AS COUNT,
bot_id AS bot_id
FROM BQ_TABLE_AGG
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
BQ_TABLE_AGG.created_at AS created_at,
DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE_AGG.bot_id AS bot_id
FROM BQ_TABLE_AGG
LEFT JOIN BQ_TABLE_AGG AS future_message
ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
AND BQ_TABLE_AGG.created_at < future_message.created_at
AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
AND BQ_TABLE_AGG.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id

If you don't want to enable a higher billing tier given the costs, here are a couple of suggestions that might help to reduce the CPU requirements:
Use INNER JOINs rather than LEFT JOINs if you can. INNER JOINs should generally be less CPU-intensive, but then again you won't get unmatched rows like you would with LEFT JOINs.
Use APPROX_COUNT_DISTINCT(expr) instead of COUNT(DISTINCT expr). You won't get an exact count, but it's less CPU-intensive and may be "good enough" depending on your needs.
You could also consider manually breaking the query into stages of computation, e.g. write the WITH clause statement to a table, then use that in the subsequent query. I don't know what the specific cost tradeoffs would be, though.

Why is it tagged MySQL?
In MySQL, I would change
max(cohort_size.count) as active_users, -- all equal in group
to
( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,
and remove the JOIN to that table. Without doing this, you risk inflating the COUNT(...) values!
Also move the division to get retention into the outside query.
Once you have done that, you can also move the other JOIN into a subquery:
( SELECT count(distinct future_message.bot_user_id)
FROM ... WHERE ... ) as retained_users,
I would have these indexes. Note that created_at needs to be last.
cohort_active_user_count: INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)

Related

order by with union in SQL is not working

Is it possible to order when the data comes from many select and union it together? Such as
In this statement, the vouchers data is not showing in the same sequence as I saved on the database, I also tried it with "ORDER BY v_payments.payment_id ASC" but won't be worked
( SELECT order_id as id, order_date as date, ... , time FROM orders WHERE client_code = '$searchId' AND order_status = 1 AND order_date BETWEEN '$start_date' AND '$end_date' ORDER BY time)
UNION
( SELECT vouchers.voucher_id as id, vouchers.payment_date as date, v_payments.account_name as name, ac_balance as oldBalance, v_payments.debit as debitAmount, v_payments.description as descriptions,
vouchers.v_no as v_no, vouchers.v_type as v_type, v_payments.credit as creditAmount, time, zero as tax, zero as freightAmount FROM vouchers INNER JOIN v_payments
ON vouchers.voucher_id = v_payments.voucher_id WHERE v_payments.client_code = '$searchId' AND voucher_status = 1 AND vouchers.payment_date BETWEEN '$start_date' AND '$end_date' ORDER BY v_payments.payment_id ASC , time )
UNION
( SELECT return_id as id, return_date as date, ... , time FROM w_return WHERE client_code = '$searchId' AND w_return_status = 1 AND return_date BETWEEN '$start_date' AND '$end_date' ORDER BY time)
Wrap the sub-select queries in the union within a SELECT
SELECT id, name
FROM
(
SELECT id, name FROM fruits
UNION
SELECT id, name FROM vegetables
)
foods
ORDER BY name
If you want the order to only apply to one of the sub-selects, use parentheses as you are doing.
Note that depending on your DB, the syntax may differ here. And if that's the case, you may get better help by specifying what DB server (MySQL, SQL Server, etc.) you are using and any error messages that result.
You need to put the ORDER BY at the end of the statement i.e. you are ordering the final resultset after union-ing the 3 intermediate resultsets
To use an ORDER BY or LIMIT clause to sort or limit the entire UNION result, parenthesize the individual SELECT statements and place the ORDER BY or LIMIT after the last one. See link below:
ORDER BY and LIMIT in Unions
(SELECT a FROM t1 WHERE a=10 AND B=1)
UNION
(SELECT a FROM t2 WHERE a=11 AND B=2)
ORDER BY a LIMIT 10;

Speed up Mysql Query using a split function

I am trying to speed up a MYSQL query.
In a column called "MISC", I first have to extract a "traceID" variable, that will be used to match row of another table.
Example of the MISC column:
PFFCC_Strip/fkk49322/PMethod=Diners/CardType=Diners/9999******9999/2010/TraceId=7122910
I am extracting the value "7122910" as traceID and find corresponding row with a left join. The traceId value being unique, only one row must be present on each table.
I cannot set Index on the tables to speed up process. Any approach that could make this query run faster? As it is, it takes a few seconds to run which is not possible.
select *
from
(select TraceID,PP,UDef2, Payment_Method, Approved, TransactionID, Amount
from pr) pr
left join
(select
PAYMENT_ID as Payment_ID_omega, TRANSACTION_TYPE,
REQUESTED_AMOUNT, AMOUNT, `STATUS` as StatusRef_omega,
REQUEST_DATE, Agent,
if (locate('TraceId=',MISC)>0, SUBSTRING_INDEX(MISC,'TraceId=',-1),'') as traceID
from BankingActivity ) omega
on pr.TraceID = omega.traceID
having
(REQUEST_DATE BETWEEN DATE_ADD(DATE(NOW()), INTERVAL -1 DAY) AND NOW())
ORDER BY pr.TraceID DESC
You can place your filters inside the query before join that must make a difference and you must have the index on table pr(TraceID) and BankingActivity(REQUEST_DATE, traceID). For more optimised query, Please post the execution plan.
select * from(select TraceID
,PP
,UDef2
,Payment_Method
,Approved
,TransactionID
,Amount
from pr) pr
left join (select PAYMENT_ID as Payment_ID_omega
,TRANSACTION_TYPE
,REQUESTED_AMOUNT
,AMOUNT
,`STATUS` as StatusRef_omega
,REQUEST_DATE
,Agent
,if (locate('TraceId=', MISC) > 0, SUBSTRING_INDEX(MISC,'TraceId=',-1),'') as traceID
from BankingActivity
WHERE REQUEST_DATE BETWEEN DATE_ADD(DATE(NOW()), INTERVAL -1 DAY) AND NOW()) omega
on pr.TraceID = omega.traceID
ORDER BY pr.TraceID DESC

mysql max and min subquery using date range

I have the following query:
SELECT
(Date + INTERVAL -(WEEKDAY(Date)) DAY) `Date`,
I would like to use a subquery here to get the oldest and newest inventory from the max and min Date:
(select sellable from clabDevelopment.fba_history_daily where Date =
max(Date))
max(Date), min(Date),
ASIN,
ItemSKU,
it.avgInv,
kt.Account, kt.Country, SUM(Sessions) `Sessions`, avg(Session_Pct)`Session_Pct`,
sum(Page_Views)`Page_Views`, avg(Page_Views_Pct)`Page_Views_Pct`, avg(Buy_Box_Pct)`Buy_Box_Pct`,
sum(Units_Ordered)`Units_Ordered`, sum(Units_Ordered_B2B) `Units_Ordered_B2B`,
avg(Unit_Session_Pct)`Unit_Session_Pct`, avg(Unit_Session_Pct_B2B)`Unit_Session_Pct_B2B`,
sum(Ordered_Product_Sales)`Ordered_Product_Sales`, sum(Total_Order_Items) `Total_Order_Items`, sum(Actual_Sales) `Actual_Sales`,
sum(Orders) `Orders`, sum(PPC_Revenue) `PPC_Revenue`, sum(PPC_Orders) `PPC_Orders`,
sum(Revenue)`Revenue`, sum(Sales_Tax_Collected) `Sales_Tax_Collected`, sum(Total_Ad_Spend) `Total_Ad_Spend`, sum(Impressions) `Impressions`,
sum(Profit_after_Fees_before_Costs) `Profit_after_Fees_before_Cost`
FROM clabDevelopment.KPI_kpireport as kt
left outer join
(SELECT Month(Date) as mnth, sku, account, country, avg(sellable)`avgInv` FROM clabDevelopment.`fba_history_daily`
where sellable >= 0
group by Month(Date), sku, account, country) as it
on kt.ItemSKU = it.SKU
and kt.Account = it.account
and kt.Country = it.country
and it.mnth = Month(kt.Date)
WHERE kt.Country = 'USA' or kt.Country = 'CAN'
GROUP BY Account, Country,(Date + INTERVAL -(WEEKDAY(Date)) DAY), ItemSKU
ORDER BY Date desc
The sub-query would be from the same table I am joining on the bottom except I group by month there. So I want to run this subquery and grab the value under sellable for the date of max(Date):
(select sellable from clabDevelopment.`fba_history_daily where Date = max(Date))
When I do it this way I get invalid use of group function.
Without known your schema and the engine/db it is difficult to understand the problem. But, here is a best guess with the following schema:
fba_history_daily
- mnth
- sku
- account
- country
- sellable
- SKU
KPI_kpireport
- Account
- Country
- ItemSKU
- Account
- Date
- Country
- ASIN
The following query would give you what you're looking for. This uses a GROUP_CONCAT in order to build the required results through aggregation. With the nested query join MySQL might be building a temporary table within memory to sort through those records which would not be optimal. You can check this using EXPLAIN and you would see Using temporary in the details.
SELECT
(Date + INTERVAL -(WEEKDAY(Date)) DAY) `Date`,
ASIN,
ItemSKU,
-- MIN
(SUBSTRING_INDEX(GROUP_CONCAT(it.sellable ORDER BY it.Date ASC),',', 1) AS minSellable),
-- MAX
(SUBSTRING_INDEX(GROUP_CONCAT(it.sellable ORDER BY it.Date DESC),',', 1) AS maxSellable),
-- AVG
AVG(it.sellable) avgInv,
kt.Account, kt.Country, SUM(Sessions) `Sessions`, avg(Session_Pct)`Session_Pct`,
sum(Page_Views)`Page_Views`, avg(Page_Views_Pct)`Page_Views_Pct`, avg(Buy_Box_Pct)`Buy_Box_Pct`,
sum(Units_Ordered)`Units_Ordered`, sum(Units_Ordered_B2B) `Units_Ordered_B2B`,
avg(Unit_Session_Pct)`Unit_Session_Pct`, avg(Unit_Session_Pct_B2B)`Unit_Session_Pct_B2B`,
sum(Ordered_Product_Sales)`Ordered_Product_Sales`, sum(Total_Order_Items) `Total_Order_Items`, sum(Actual_Sales) `Actual_Sales`,
sum(Orders) `Orders`, sum(PPC_Revenue) `PPC_Revenue`, sum(PPC_Orders) `PPC_Orders`,
sum(Revenue)`Revenue`, sum(Sales_Tax_Collected) `Sales_Tax_Collected`, sum(Total_Ad_Spend) `Total_Ad_Spend`, sum(Impressions) `Impressions`,
sum(Profit_after_Fees_before_Costs) `Profit_after_Fees_before_Cost`
FROM KPI_kpireport as kt
left outer join fba_history_daily it on
kt.ItemSKU = it.SKU
and kt.Account = it.account
and kt.Country = it.country
and Month(it.Date) = Month(kt.Date)
and it.sellable >= 0
WHERE kt.Country = 'USA' or kt.Country = 'CAN'
GROUP BY Account, Country,(Date + INTERVAL -(WEEKDAY(Date)) DAY), ItemSKU
ORDER BY Date desc

sql calculate change and percent by year

I have an data set that simulates the rate of return for a trading account. There is an entry for each day showing the balance and the open equity. I want to calculate the yearly, or quarterly, or monthly change and percent gain or loss. I have this working for daily data, but for some reason I can't seem to get it to work for yearly data.
The code for daily data follows:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM tim_account_history p
CROSS JOIN
(SELECT #pequity:= NULL
FROM tim_account_history
ORDER by `Date` LIMIT 1) as a
ORDER BY `Date`) as b
ORDER by `Date` ASC
Grouping by YEAR(Date) doesn't seem to make the desired difference. I have tried everything I can think of, but it still seems to return daily rate of change even if you group by month or year, etc. I think I'm not using windowing correctly, but I can't seem to figure it out. If anyone knows of a good book about this sort of query I'd appreciate that also.
Thanks.sqlfiddle example
Using what Lolo contributed, I have added some code so the data comes from the last day of the year, instead of the first. I also just need the Open_Equity, not the sum.
I'm still not certain I understand why this works, but it does give me what I was looking for. Using another select statement as a from seems to be the key here; I don't think I would have come up with this without Lolo's help. Thank you.
SELECT b.`yyyy`, b.Open_Equity,
concat('$',round(delta, 2)) as delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT (EXTRACT(YEAR FROM `Date`)) as `yyyy`,
(SUBSTRING_INDEX(GROUP_CONCAT(CAST(`Open_Equity` AS CHAR) ORDER BY `Date` DESC), ',', 1 )) AS `Open_Equity`
FROM tim_account_history GROUP BY `yyyy` ORDER BY `yyyy` DESC) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `yyyy` ) as b
ORDER by `yyyy` ASC
Try this:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT YEAR(`Date`) `Date`, SUM(Open_Equity) Open_Equity FROM tim_account_history GROUP BY YEAR(`Date`)) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `Date` ) as b
ORDER by `Date` ASC

Checking for maximum length of consecutive days which satisfy specific condition

I have a MySQL table with the structure:
beverages_log(id, users_id, beverages_id, timestamp)
I'm trying to compute the maximum streak of consecutive days during which a user (with id 1) logs a beverage (with id 1) at least 5 times each day. I'm pretty sure that this can be done using views as follows:
CREATE or REPLACE VIEW daycounts AS
SELECT count(*) AS n, DATE(timestamp) AS d FROM beverages_log
WHERE users_id = '1' AND beverages_id = 1 GROUP BY d;
CREATE or REPLACE VIEW t AS SELECT * FROM daycounts WHERE n >= 5;
SELECT MAX(streak) AS current FROM ( SELECT DATEDIFF(MIN(c.d), a.d)+1 AS streak
FROM t AS a LEFT JOIN t AS b ON a.d = ADDDATE(b.d,1)
LEFT JOIN t AS c ON a.d <= c.d
LEFT JOIN t AS d ON c.d = ADDDATE(d.d,-1)
WHERE b.d IS NULL AND c.d IS NOT NULL AND d.d IS NULL GROUP BY a.d) allstreaks;
However, repeatedly creating views for different users every time I run this check seems pretty inefficient. Is there a way in MySQL to perform this computation in a single query, without creating views or repeatedly calling the same subqueries a bunch of times?
This solution seems to perform quite well as long as there is a composite index on users_id and beverages_id -
SELECT *
FROM (
SELECT t.*, IF(#prev + INTERVAL 1 DAY = t.d, #c := #c + 1, #c := 1) AS streak, #prev := t.d
FROM (
SELECT DATE(timestamp) AS d, COUNT(*) AS n
FROM beverages_log
WHERE users_id = 1
AND beverages_id = 1
GROUP BY DATE(timestamp)
HAVING COUNT(*) >= 5
) AS t
INNER JOIN (SELECT #prev := NULL, #c := 1) AS vars
) AS t
ORDER BY streak DESC LIMIT 1;
Why not include user_id in they daycounts view and group by user_id and date.
Also include user_id in view t.
Then when you are queering against t add the user_id to the where clause.
Then you don't have to recreate your views for every single user you just need to remember to include in your where clause.
That's a little tricky. I'd start with a view to summarize events by day:
CREATE VIEW BView AS
SELECT UserID, BevID, CAST(EventDateTime AS DATE) AS EventDate, COUNT(*) AS NumEvents
FROM beverages_log
GROUP BY UserID, BevID, CAST(EventDateTime AS DATE)
I'd then use a Dates table (just a table with one row per day; very handy to have) to examine all possible date ranges and throw out any with a gap. This will probably be slow as hell, but it's a start:
SELECT
UserID, BevID, MAX(StreakLength) AS StreakLength
FROM
(
SELECT
B1.UserID, B1.BevID, B1.EventDate AS StreakStart, DATEDIFF(DD, StartDate.Date, EndDate.Date) AS StreakLength
FROM
BView AS B1
INNER JOIN Dates AS StartDate ON B1.EventDate = StartDate.Date
INNER JOIN Dates AS EndDate ON EndDate.Date > StartDate.Date
WHERE
B1.NumEvents >= 5
-- Exclude this potential streak if there's a day with no activity
AND NOT EXISTS (SELECT * FROM Dates AS MissedDay WHERE MissedDay.Date > StartDate.Date AND MissedDay.Date <= EndDate.Date AND NOT EXISTS (SELECT * FROM BView AS B2 WHERE B1.UserID = B2.UserID AND B1.BevID = B2.BevID AND MissedDay.Date = B2.EventDate))
-- Exclude this potential streak if there's a day with less than five events
AND NOT EXISTS (SELECT * FROM BView AS B2 WHERE B1.UserID = B2.UserID AND B1.BevID = B2.BevID AND B2.EventDate > StartDate.Date AND B2.EventDate <= EndDate.Date AND B2.NumEvents < 5)
) AS X
GROUP BY
UserID, BevID