mysql max and min subquery using date range - mysql

I have the following query:
SELECT
(Date + INTERVAL -(WEEKDAY(Date)) DAY) `Date`,
I would like to use a subquery here to get the oldest and newest inventory from the max and min Date:
(select sellable from clabDevelopment.fba_history_daily where Date =
max(Date))
max(Date), min(Date),
ASIN,
ItemSKU,
it.avgInv,
kt.Account, kt.Country, SUM(Sessions) `Sessions`, avg(Session_Pct)`Session_Pct`,
sum(Page_Views)`Page_Views`, avg(Page_Views_Pct)`Page_Views_Pct`, avg(Buy_Box_Pct)`Buy_Box_Pct`,
sum(Units_Ordered)`Units_Ordered`, sum(Units_Ordered_B2B) `Units_Ordered_B2B`,
avg(Unit_Session_Pct)`Unit_Session_Pct`, avg(Unit_Session_Pct_B2B)`Unit_Session_Pct_B2B`,
sum(Ordered_Product_Sales)`Ordered_Product_Sales`, sum(Total_Order_Items) `Total_Order_Items`, sum(Actual_Sales) `Actual_Sales`,
sum(Orders) `Orders`, sum(PPC_Revenue) `PPC_Revenue`, sum(PPC_Orders) `PPC_Orders`,
sum(Revenue)`Revenue`, sum(Sales_Tax_Collected) `Sales_Tax_Collected`, sum(Total_Ad_Spend) `Total_Ad_Spend`, sum(Impressions) `Impressions`,
sum(Profit_after_Fees_before_Costs) `Profit_after_Fees_before_Cost`
FROM clabDevelopment.KPI_kpireport as kt
left outer join
(SELECT Month(Date) as mnth, sku, account, country, avg(sellable)`avgInv` FROM clabDevelopment.`fba_history_daily`
where sellable >= 0
group by Month(Date), sku, account, country) as it
on kt.ItemSKU = it.SKU
and kt.Account = it.account
and kt.Country = it.country
and it.mnth = Month(kt.Date)
WHERE kt.Country = 'USA' or kt.Country = 'CAN'
GROUP BY Account, Country,(Date + INTERVAL -(WEEKDAY(Date)) DAY), ItemSKU
ORDER BY Date desc
The sub-query would be from the same table I am joining on the bottom except I group by month there. So I want to run this subquery and grab the value under sellable for the date of max(Date):
(select sellable from clabDevelopment.`fba_history_daily where Date = max(Date))
When I do it this way I get invalid use of group function.

Without known your schema and the engine/db it is difficult to understand the problem. But, here is a best guess with the following schema:
fba_history_daily
- mnth
- sku
- account
- country
- sellable
- SKU
KPI_kpireport
- Account
- Country
- ItemSKU
- Account
- Date
- Country
- ASIN
The following query would give you what you're looking for. This uses a GROUP_CONCAT in order to build the required results through aggregation. With the nested query join MySQL might be building a temporary table within memory to sort through those records which would not be optimal. You can check this using EXPLAIN and you would see Using temporary in the details.
SELECT
(Date + INTERVAL -(WEEKDAY(Date)) DAY) `Date`,
ASIN,
ItemSKU,
-- MIN
(SUBSTRING_INDEX(GROUP_CONCAT(it.sellable ORDER BY it.Date ASC),',', 1) AS minSellable),
-- MAX
(SUBSTRING_INDEX(GROUP_CONCAT(it.sellable ORDER BY it.Date DESC),',', 1) AS maxSellable),
-- AVG
AVG(it.sellable) avgInv,
kt.Account, kt.Country, SUM(Sessions) `Sessions`, avg(Session_Pct)`Session_Pct`,
sum(Page_Views)`Page_Views`, avg(Page_Views_Pct)`Page_Views_Pct`, avg(Buy_Box_Pct)`Buy_Box_Pct`,
sum(Units_Ordered)`Units_Ordered`, sum(Units_Ordered_B2B) `Units_Ordered_B2B`,
avg(Unit_Session_Pct)`Unit_Session_Pct`, avg(Unit_Session_Pct_B2B)`Unit_Session_Pct_B2B`,
sum(Ordered_Product_Sales)`Ordered_Product_Sales`, sum(Total_Order_Items) `Total_Order_Items`, sum(Actual_Sales) `Actual_Sales`,
sum(Orders) `Orders`, sum(PPC_Revenue) `PPC_Revenue`, sum(PPC_Orders) `PPC_Orders`,
sum(Revenue)`Revenue`, sum(Sales_Tax_Collected) `Sales_Tax_Collected`, sum(Total_Ad_Spend) `Total_Ad_Spend`, sum(Impressions) `Impressions`,
sum(Profit_after_Fees_before_Costs) `Profit_after_Fees_before_Cost`
FROM KPI_kpireport as kt
left outer join fba_history_daily it on
kt.ItemSKU = it.SKU
and kt.Account = it.account
and kt.Country = it.country
and Month(it.Date) = Month(kt.Date)
and it.sellable >= 0
WHERE kt.Country = 'USA' or kt.Country = 'CAN'
GROUP BY Account, Country,(Date + INTERVAL -(WEEKDAY(Date)) DAY), ItemSKU
ORDER BY Date desc

Related

Advance Select Query - FIFO

I have a table with columns id, item, qty and expiration date. I have to select item with a given total qty based on expiration date. The item that expires first need to be the priority to select. The query works fine if the item has different expiration date but, my problem is if the item has the same expiration date it does not return any row. Please check out the query below.
SELECT t.ID,
t.itemcode,
t.qty,
t.expdate,
t.total,
t.prev_total,
CASE WHEN t.total > 500 THEN 500 - t.prev_total ELSE t.qty END AS total
FROM
(
SELECT t1.ID,
t1.itemcode,
t1.qty,
t1.expdate,
(SELECT SUM(t2.qty) FROM put_in t2
WHERE t2.expdate <= t1.expdate AND t2.itemcode = 'ITEM01') AS total,
COALESCE((SELECT SUM(t2.qty) FROM put_in t2
WHERE t2.expdate < t1.expdate AND t2.itemcode = 'ITEM01'), 0) AS prev_total
FROM put_in t1
WHERE t1.itemcode = 'ITEM01'
) t
WHERE t.total - t.qty < 500 AND
t.itemcode = 'ITEM01'
ORDER BY t.expdate;

Optimizing cohort analysis on Google BigQuery

I'm attempting to perform a cohort analysis on a very large table. I have a test table with ~30M rows (over double in production). The query fails in BigQuery stating "resources exceeded.." and it's a tier 18 query (tier 1 is $5, so it's a $90 query!)
The query:
with cohort_active_user_count as (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
count(distinct`BQ_TABLE`.bot_user_id) as count,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
group by created_at, bot_id
)
select created_at, period as period,
active_users, retained_users, retention, bot_id
from (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
max(cohort_size.count) as active_users, -- all equal in group
count(distinct future_message.bot_user_id) as retained_users,
count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
left join `BQ_TABLE` as future_message on
`BQ_TABLE`.bot_user_id = future_message.bot_user_id
and `BQ_TABLE`.created_at < future_message.created_at
and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
and `BQ_TABLE`.bot_id = future_message.bot_id
left join cohort_active_user_count as cohort_size on
DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
and `BQ_TABLE`.bot_id = cohort_size.bot_id
group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id
Here is the desired output:
From my understanding of BigQuery, the joins are causing a major performance hit because each BigQuery node needs to process them. The table is partitioned by day, which I'm not yet making use of in this query, but I know it will still need to be optimized.
How can this query be optimized or exclude the use of joins to allow BigQuery to process more efficiently in parallel?
Step #1
Try below
Moved JOIN'ing on cohort_active_user_count outside the inner SELECT as I think it is one of main reason for query be expensive. And as you see - using JOIN instead LEFT JOIN for this one as LEFT is not needed here
Please test and let us know result
WITH cohort_active_user_count AS (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
LEFT JOIN BQ_TABLE AS future_message
ON BQ_TABLE.bot_user_id = future_message.bot_user_id
AND BQ_TABLE.created_at < future_message.created_at
AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
AND BQ_TABLE.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
Step # 2
Below "further optimization" is based on assumption that your BQ_TABLE is a raw data with multiple entries for the same user_id/bit_id for the same day - thus increasing a lot expenses of LEFT JOIN in inner SELECT.
I propose first to aggregate this, as it is done below. In addition to drastically reducing size of JOIN - it also eliminates all those conversion from TIMESTAMP to DATE in each join'ed row
WITH BQ_TABLE_AGG AS (
SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
FROM BQ_TABLE
GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
SELECT
created_at,
COUNT(DISTINCT bot_user_id) AS COUNT,
bot_id AS bot_id
FROM BQ_TABLE_AGG
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
BQ_TABLE_AGG.created_at AS created_at,
DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE_AGG.bot_id AS bot_id
FROM BQ_TABLE_AGG
LEFT JOIN BQ_TABLE_AGG AS future_message
ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
AND BQ_TABLE_AGG.created_at < future_message.created_at
AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
AND BQ_TABLE_AGG.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
If you don't want to enable a higher billing tier given the costs, here are a couple of suggestions that might help to reduce the CPU requirements:
Use INNER JOINs rather than LEFT JOINs if you can. INNER JOINs should generally be less CPU-intensive, but then again you won't get unmatched rows like you would with LEFT JOINs.
Use APPROX_COUNT_DISTINCT(expr) instead of COUNT(DISTINCT expr). You won't get an exact count, but it's less CPU-intensive and may be "good enough" depending on your needs.
You could also consider manually breaking the query into stages of computation, e.g. write the WITH clause statement to a table, then use that in the subsequent query. I don't know what the specific cost tradeoffs would be, though.
Why is it tagged MySQL?
In MySQL, I would change
max(cohort_size.count) as active_users, -- all equal in group
to
( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,
and remove the JOIN to that table. Without doing this, you risk inflating the COUNT(...) values!
Also move the division to get retention into the outside query.
Once you have done that, you can also move the other JOIN into a subquery:
( SELECT count(distinct future_message.bot_user_id)
FROM ... WHERE ... ) as retained_users,
I would have these indexes. Note that created_at needs to be last.
cohort_active_user_count: INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)

COUNT() the number of MAX() occurrences in MySQL

I have a donation database and one of the reports I run against it I would like to include the number of donations that equal the months maximum donation. For example the months highest donation may be $100, but there may be 5 people who all donated $100, I would like to get that count.
My current query is:
SELECT SUM(mc_gross) AS Donations,
SUM(mc_fee) AS Fees,
COUNT(payment_date) AS DontationCount,
COUNT(DISTINCT payer_email) AS DonatorCount,
MAX(mc_gross) AS MaxDonation,
#MaxD:=MAX(mc_gross),
(
SELECT COUNT(*)
FROM #__paypal_donations
WHERE MONTH(payment_date) = MONTH(CURDATE())
AND YEAR(payment_date) = YEAR(CURDATE())
AND mc_gross = #MaxD
) as MaxDonationMultiplier,
AVG(mc_gross) AS AverageDonation
FROM #__paypal_donations
WHERE MONTH(payment_date) = MONTH(CURDATE())
AND YEAR(payment_date) = YEAR(CURDATE())
So I think I may be close, but it looks like either the value I am storing in #MaxD for use in my subquery is not working or the comparison itself in mc_gross = #MaxD is not working because if I replace #MaxD with a real value I get a proper count.
You cannot depend on the order of assignment of expressions in MySQL. That makes a query such as yours quite dangerous. Fortunately, you can easily solve this problem with a correlated subquery:
SELECT SUM(mc_gross) AS Donations, SUM(mc_fee) AS Fees, COUNT(payment_date) AS DontationCount,
COUNT(DISTINCT payer_email) AS DonatorCount, MAX(mc_gross) AS MaxDonation,
(SELECT COUNT(*)
FROM #__paypal_donations pd2
WHERE MONTH(pd2payment_date) = MONTH(pd.payment_date)) AND
YEAR(pd2payment_date) = YEAR(pd.payment_date) AND
pd2.mc_gross = MAX(mc_gross)
) as MaxDonationMultiplier,
AVG(mc_gross) AS AverageDonation
FROM #__paypal_donations pd
WHERE MONTH(payment_date) = MONTH(CURDATE()) AND
YEAR(payment_date) = YEAR(CURDATE());

Get last data for contracts

I want to select last information about client's balance from MySQL's database. I wrote next script:
SELECT *
FROM
(SELECT
contract_balance.cid,
/*contract_balance.yy,
contract_balance.mm,*/
contract_balance.expenses,
contract_balance.revenues,
contract_balance.expenses + contract_balance.revenues AS total,
(CAST(CAST(CONCAT(contract_balance.yy,'-',contract_balance.mm,'-01')AS CHAR) AS DATE)) AS dt
FROM contract_balance
/*WHERE
CAST(CAST(CONCAT(contract_balance.yy,'-',contract_balance.mm,'-01')AS CHAR) AS DATE) < '2013-11-01'
LIMIT 100*/
) AS tmp
WHERE tmp.dt = (
SELECT MAX(b.dt)
FROM tmp AS b
WHERE tmp.cid = b.cid
)
But server return:
Table 'clientsdatabase.tmp' doesn't exist
How to change this code for get required data?
Try this one in your subquery you are trying to get the MAX of (CAST(CAST(CONCAT(contract_balance.yy,'-',contract_balance.mm,'-01')AS CHAR) AS DATE)) AS dt but in subquery your aliased table tmp doesn't exist so the simplest way you can do is to calculate the MAX of dt and use GROUP BY contract_balance.cid contractor id ,i guess it will fullfill your needs
SELECT
contract_balance.cid,
contract_balance.expenses,
contract_balance.revenues,
contract_balance.expenses + contract_balance.revenues AS total,
MAX((CAST(CAST(CONCAT(contract_balance.yy,'-',contract_balance.mm,'-01')AS CHAR) AS DATE))) AS dt
FROM contract_balance
GROUP BY contract_balance.cid
Try this:
SELECT *
FROM (SELECT cb.cid, cb.expenses, cb.revenues, cb.expenses + cb.revenues AS total,
(CAST(CAST(CONCAT(cb.yy,'-',cb.mm,'-01')AS CHAR) AS DATE)) AS dt
FROM contract_balance cb ORDER BY dt DESC
) AS A
GROUP BY A.cid

Optimize SQL Server Query

I have to do a query to get the total cost of previous month and compared to current month to calculate the percentage difference.
this is the script:
create table #calc
(
InvoiceDate Date,
TotalCost decimal (12,2)
)
insert into #calc values ('2013-07-01', 9470.36)
insert into #calc values ('2013-08-01', 11393.81)
and this is the query:
select InvoiceDate,
TotalCost,
PrevTotalCost,
(CASE WHEN (PrevTotalCost = 0)
THEN 0
ELSE (((TotalCost - PrevTotalCost) / PrevTotalCost) * 100.0)
END) AS PercentageDifference
from (
select a.InvoiceDate, a.TotalCost,
isnull((select b.TotalCost
from #calc b
where InvoiceDate = (select MAX(InvoiceDate)
from #calc c
where c.InvoiceDate < a.InvoiceDate)), 0) as PrevTotalCost
from #calc a) subq
Is there a more efficient way to do it for cgetting the previous month?
Using a ranking function to put more burden on sorts than table scans seems the fastest when using no indexes. The query below processed 6575 records in under a second:
SELECT
Main.InvoiceDate,
Main.TotalCost,
PreviousTotalCost=Previous.TotalCost,
PercentageDifference=
CASE WHEN COALESCE(Previous.TotalCost,0) = 0 THEN 0
ELSE (((Main.TotalCost - Previous.TotalCost) / Previous.TotalCost) * 100.00)
END
FROM
(
SELECT
InvoiceDate,
TotalCost,
OrderInGroup=ROW_NUMBER() OVER (ORDER BY InvoiceDate DESC)
FROM
Test
)AS Main
LEFT OUTER JOIN
(
SELECT
InvoiceDate,
TotalCost,
OrderInGroup=ROW_NUMBER() OVER (ORDER BY InvoiceDate DESC)
FROM
Test
)AS Previous ON Previous.OrderInGroup=Main.OrderInGroup+1
Using nested looping as the case when getting the previous invoice cost in a select subquery proves the slowest - 6575 rows in 30 seconds.
SELECT
X.InvoiceDate,
X.TotalCost,
X.PreviousTotalCost,
PercentageDifference=
CASE WHEN COALESCE(X.PreviousTotalCost,0) = 0 THEN 0
ELSE (((X.TotalCost - X.PreviousTotalCost) / X.PreviousTotalCost) * 100.00)
END
FROM
(
SELECT
InvoiceDate,
TotalCost,
PreviousTotalCost=(SELECT TotalCost FROM Test WHERE InvoiceDate=(SELECT MAX(InvoiceDate) FROM Test WHERE InvoiceDate<Main.InvoiceDate))
FROM
Test AS Main
)AS X
Your query processed 6575 records in 20 seconds with the biggest cost coming from the nested loops for inner join
select InvoiceDate,
TotalCost,
PrevTotalCost,
(CASE WHEN (PrevTotalCost = 0)
THEN 0
ELSE (((TotalCost - PrevTotalCost) / PrevTotalCost) * 100.0)
END) AS PercentageDifference
from (
select a.InvoiceDate, a.TotalCost,
isnull((select b.TotalCost
from Test b
where InvoiceDate = (select MAX(InvoiceDate)
from #calc c
where c.InvoiceDate < a.InvoiceDate)), 0) as PrevTotalCost
from Test a) subq
Using indexes would be a big plus unless you are required to use temp tables.
Hope this helps :)
SELECT
`current`.`InvoiceDate`,
`current`.`TotalCost`,
`prev`.`TotalCost` AS `PrevTotalCost`,
(`current`.`TotalCost` - `prev`.`TotalCost`) AS `CostDifference`
FROM dates `current`
LEFT JOIN
dates `prev`
ON `prev`.`InvoiceDate` <= DATE_FORMAT(`current`.`InvoiceDate` - INTERVAL 1 MONTH, '%Y-%m-01');
Screenshot of the results I got: http://cl.ly/image/0b3z2x1f2H1n
I think this might be what you're looking for.
Edit: I wrote this query in MySQL, so it's possible you may need to alter a couple minor syntax things for your server.