Deleting Outliers and Compute for the Average Using Mysql

Deleting Outliers and Compute for the Average Using Mysql - mysql

I am computing for the average lead time after outliers are removed. I have both z-scores and standard deviation to do the calculation.
I have used this sql query:
SELECT *
FROM (
SELECT
ROUND(AVG(DATEDIFF(shipped_date, order_date)),2) AS actual_ave_lead_time,
DATEDIFF(shipped_date, order_date) - AVG(DATEDIFF(shipped_date, order_date))/
STDDEV(DATEDIFF(shipped_date, order_date)) AS zscore
FROM orders
) AS score_table
WHERE zscore BETWEEN zscore<1.96 AND >-.96;
I am expecting to get the overall average of the actual_ave_lead_time.

To calculate your z-scores you need to run the calculation (z = (x-μ)/σ) per order. The innermost sub-query calculates the AVG() and STDDEV() for the full set of orders, which is then joined to the orders to calculate the z-scores, which can then be used to exclude the outliers from the outermost AVG() -
SELECT AVG(lead_time)
FROM (
SELECT
DATEDIFF(shipped_date, order_date) AS lead_time,
(DATEDIFF(shipped_date, order_date) - avg_lead_time) / stddev_lead_time AS zscore
FROM orders o
JOIN (
SELECT
AVG(DATEDIFF(shipped_date, order_date)) AS avg_lead_time,
STDDEV(DATEDIFF(shipped_date, order_date)) AS stddev_lead_time
FROM orders
) s
HAVING zscore BETWEEN -0.96 AND 1.96
) t;
If you are using MySQL 8 you could use the aggregate functions as window functions -
SELECT AVG(lead_time)
FROM (
SELECT
DATEDIFF(shipped_date, order_date) AS lead_time,
(DATEDIFF(shipped_date, order_date) - AVG(DATEDIFF(shipped_date, order_date)) OVER()) / STDDEV(DATEDIFF(shipped_date, order_date)) OVER() AS zscore
FROM orders
) t
WHERE zscore BETWEEN -0.96 AND 1.96;
Without example data, this is untested but I think it is correct.

Related

Display SUM and LAST_VALUE groups by Year

Last forum I made question how to display values following max date, but I want to sum some values group by max date unfortunately a single value didn't want follow max date.
Here was my table:
And my query:
SELECT
SUM(pembelian) AS Buying,
SUM(penjualan) AS Selling,
SUM(penjualan)-SUM(pembelian) AS Benefit,
saldo,
MAX(tgl_lap)
FROM laporan GROUP BY DATE_FORMAT(tgl_lap,'%Y')
The results:
As we saw it, the results does work for some values but a single value (saldo) didn't following max date, guys can see the row of 2020 should be '23581800' and 2021 should be '35639800' according table.
So what I have missed?

I mean next query can solve the problem:
SELECT Buying, Selling, Benefit, saldo, last_tgl_lap
FROM laporan
JOIN (
SELECT
SUM(pembelian) AS Buying,
SUM(penjualan) AS Selling,
SUM(penjualan)-SUM(pembelian) AS Benefit,
MAX(tgl_lap) last_tgl_lap
FROM laporan
GROUP BY YEAR(tgl_lap)
) aggregated on aggregated.last_tgl_lap = tgl_lap;
Look here the example SQLize.online
If your MySQL version is 8.0 or greater you can use window function like:
SELECT
Pembelian AS Buying,
Penjualan AS Selling,
Penjualan - Pembelian AS Benefit,
Saldo,
LastDate
FROM (
SELECT
SUM(pembelian) OVER (PARTITION BY YEAR(tgl_lap) ORDER BY tgl_lap ASC) AS Pembelian,
SUM(penjualan) OVER (PARTITION BY YEAR(tgl_lap) ORDER BY tgl_lap ASC) AS Penjualan,
LAST_VALUE(saldo) OVER (PARTITION BY YEAR(tgl_lap) ORDER BY tgl_lap ASC) AS Saldo,
LAST_VALUE(tgl_lap) OVER (PARTITION BY YEAR(tgl_lap) ORDER BY tgl_lap ASC) AS LastDate,
ROW_NUMBER() OVER (PARTITION BY YEAR(tgl_lap) ORDER BY tgl_lap DESC) AS row_num
FROM laporan
) tbl
WHERE row_num = 1;
Fiddle on SQLize.online

Because of MySQL mode ONLY_FULL_GROUP_BY is disabled I think your query is not throwing error even though you have used non aggregated column saldo in the select clause.
Update after Clarification from OP
Another alternative to use window function first_value for saldo if you can,
select sum(pembelian) as Buying,
sum(penjualan) as Selling,
sum(penjualan)-sum(pembelian) as Benefit,
max(saldo) as Saldo,
max(tgl_lap) as tgl_lap
from
( select id_lap,pembelian,penjualan,tgl_lap,
first_value(saldo) over
(partition by date_format(tgl_lap,'%Y') order by tgl_lap desc) as saldo
from laporan
) l
group by date_format(tgl_lap,'%Y')

Your query is malformed. You have saldo in the SELECT, but it is not in the GROUP BY. You should be getting an error. An MySQL finally conforms to the SQL standard and to other databases in generating an error.
In MySQL 8.0, I would recommend conditional aggregation:
SELECT SUM(pembelian) AS Buying, SUM(penjualan) AS Selling,
SUM(penjualan)-SUM(pembelian) AS Benefit,
MAX(CASE WHEN seqnum = 1 THEN saldo END) as saldo,
MAX(tgl_lap)
FROM (SELECT l.*,
ROW_NUMBER() OVER (PARTITION BY YEAR(tgl_lap) ORDER BY tgl_lap DESC) as seqnum
FROM laporan l
) l
GROUP BY YEAR(tgl_lap);
Note that I replaced DATE_FORMAT() with YEAR(). It just seems clearer to me to use the appropriate date function when it is available.
In older versions, there is a hack to get the latest saldo value in each year:
SELECT SUM(pembelian) AS Buying, SUM(penjualan) AS Selling,
SUM(penjualan)-SUM(pembelian) AS Benefit,
SUBSTRING_INDEX(GROUP_CONCAT(saldo ORDER BY tgl_lap DESC), ',', 1) + 0 as saldo,
MAX(tgl_lap)
FROM laporan l
GROUP BY YEAR(tgl_lap);
This concatenates the saldo values into a string and then takes the first element. The only caveat is that the default internal length is about 1,000 characters for GROUP_CONCAT().

How add three numbers from three different tables in one row mysql phpmyadmin

Im helping a friend with a query, these one needs show the percentage of each one of these tables: purchases, expenses and sales transactions in the month.
So I need show this:
Sales Percentage - Purchases Percentage - Expenses Percentage
10% 50% 40%
One of the problems is these three tables dont have any type of relationship (foreign key).
To make these possible, obviously I must get the total of each one and add it to have the percentage base.
So, how can I add everything in a SELECT clausule?
Greetings.

well, after so much thinking, I finally got it.
so this is the final answer, hope it can usefully:
select
sum(total) as total_transacciones,
(select count(id_venta) from venta where fecha__venta between '2019-01-01' and '2019-04-02') as total_ventas,
(select round((count(id_venta) / sum(total) * 100),2) from venta where fecha__venta between '2019-01-01' and '2019-04-02') as porcentaje_ventas,
(select count(id_compra) from compra where fecha_compra between '2019-01-01' and '2019-04-02') as total_compras,
(select round((count(id_compra) / sum(total) * 100),2) from compra where fecha_compra between '2019-01-01' and '2019-04-02') as porcentaje_compras,
(select count(id_gastos) from gastos where fecha_cr between '2019-01-01' and '2019-04-02') as total_gastos,
(select round((count(id_gastos) / sum(total) * 100),2) from gastos where fecha_cr between '2019-01-01' and '2019-04-02') as porcentaje_gastos
from
(
select count(id_venta) as total from venta
UNION ALL
select count(id_compra) as total from compra
UNION ALL
select count(id_gastos) as total from gastos
) t
A capture from the result:
https://i.stack.imgur.com/XdkxP.png
Greetings.

Optimizing cohort analysis on Google BigQuery

I'm attempting to perform a cohort analysis on a very large table. I have a test table with ~30M rows (over double in production). The query fails in BigQuery stating "resources exceeded.." and it's a tier 18 query (tier 1 is $5, so it's a $90 query!)
The query:
with cohort_active_user_count as (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
count(distinct`BQ_TABLE`.bot_user_id) as count,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
group by created_at, bot_id
)
select created_at, period as period,
active_users, retained_users, retention, bot_id
from (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
max(cohort_size.count) as active_users, -- all equal in group
count(distinct future_message.bot_user_id) as retained_users,
count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
left join `BQ_TABLE` as future_message on
`BQ_TABLE`.bot_user_id = future_message.bot_user_id
and `BQ_TABLE`.created_at < future_message.created_at
and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
and `BQ_TABLE`.bot_id = future_message.bot_id
left join cohort_active_user_count as cohort_size on
DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
and `BQ_TABLE`.bot_id = cohort_size.bot_id
group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id
Here is the desired output:
From my understanding of BigQuery, the joins are causing a major performance hit because each BigQuery node needs to process them. The table is partitioned by day, which I'm not yet making use of in this query, but I know it will still need to be optimized.
How can this query be optimized or exclude the use of joins to allow BigQuery to process more efficiently in parallel?

Step #1
Try below
Moved JOIN'ing on cohort_active_user_count outside the inner SELECT as I think it is one of main reason for query be expensive. And as you see - using JOIN instead LEFT JOIN for this one as LEFT is not needed here
Please test and let us know result
WITH cohort_active_user_count AS (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
LEFT JOIN BQ_TABLE AS future_message
ON BQ_TABLE.bot_user_id = future_message.bot_user_id
AND BQ_TABLE.created_at < future_message.created_at
AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
AND BQ_TABLE.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
Step # 2
Below "further optimization" is based on assumption that your BQ_TABLE is a raw data with multiple entries for the same user_id/bit_id for the same day - thus increasing a lot expenses of LEFT JOIN in inner SELECT.
I propose first to aggregate this, as it is done below. In addition to drastically reducing size of JOIN - it also eliminates all those conversion from TIMESTAMP to DATE in each join'ed row
WITH BQ_TABLE_AGG AS (
SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
FROM BQ_TABLE
GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
SELECT
created_at,
COUNT(DISTINCT bot_user_id) AS COUNT,
bot_id AS bot_id
FROM BQ_TABLE_AGG
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
BQ_TABLE_AGG.created_at AS created_at,
DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE_AGG.bot_id AS bot_id
FROM BQ_TABLE_AGG
LEFT JOIN BQ_TABLE_AGG AS future_message
ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
AND BQ_TABLE_AGG.created_at < future_message.created_at
AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
AND BQ_TABLE_AGG.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id

If you don't want to enable a higher billing tier given the costs, here are a couple of suggestions that might help to reduce the CPU requirements:
Use INNER JOINs rather than LEFT JOINs if you can. INNER JOINs should generally be less CPU-intensive, but then again you won't get unmatched rows like you would with LEFT JOINs.
Use APPROX_COUNT_DISTINCT(expr) instead of COUNT(DISTINCT expr). You won't get an exact count, but it's less CPU-intensive and may be "good enough" depending on your needs.
You could also consider manually breaking the query into stages of computation, e.g. write the WITH clause statement to a table, then use that in the subsequent query. I don't know what the specific cost tradeoffs would be, though.

Why is it tagged MySQL?
In MySQL, I would change
max(cohort_size.count) as active_users, -- all equal in group
to
( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,
and remove the JOIN to that table. Without doing this, you risk inflating the COUNT(...) values!
Also move the division to get retention into the outside query.
Once you have done that, you can also move the other JOIN into a subquery:
( SELECT count(distinct future_message.bot_user_id)
FROM ... WHERE ... ) as retained_users,
I would have these indexes. Note that created_at needs to be last.
cohort_active_user_count: INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)

get rows without duplicates

I have a query I wrote a while ago that returns a count. I would now like to see the rows that are being counted but I can't seem to get my query right. Here is the query that returns the count.
select count(distinct f_Shipmentnumber) from t_shipment shipment
join t_Pilot pilot on pilot.f_PilotID=shipment.f_Pilot_ID
where pilot.f_ProviderID='12' and shipment.f_ShipmentType=2
and shipment.f_date > DATEADD(yy, DATEDIFF(yy,0,getdate()), 0)
and here is what I have come up with, but it returns duplicate shipment numbers
select * from t_shipment shipment
join t_Pilot pilot on pilot.f_PilotID=shipment.f_Pilot_ID
where pilot.f_ProviderID='12' and shipment.f_ShipmentType=2
and shipment.f_date > DATEADD(yy, DATEDIFF(yy,0,getdate()), 0)
Any help would be great. Thank!!

Based on your answer in comments, I assume when there is more than one pilot associated with a shipment, you don't care which pilot gets returned in the results. In that case, you can solve this with a CTE and the Row_Number() function:
WITH cte AS (
select *
, ROW_NUMBER() OVER (PARTITION BY f_Shipmentnumber ORDER BY f_Shipmentnumber) AS rn
from t_shipment shipment
join t_Pilot pilot on pilot.f_PilotID=shipment.f_Pilot_ID
where pilot.f_ProviderID='12' and shipment.f_ShipmentType=2
and shipment.f_date > DATEADD(yy, DATEDIFF(yy,0,getdate()), 0)
)
SELECT * FROM CTE WHERE rn=1

You have to use group by, but in group by , you have to group by all columns and you have to decide which value you want to return on the other columns. some thing like this:
select f_Shipmentnumber,f_ProviderID,f_ShipmentType,Max(f_date ) from t_shipment shipment
join t_Pilot pilot on pilot.f_PilotID=shipment.f_Pilot_ID
where pilot.f_ProviderID='12' and shipment.f_ShipmentType=2
and shipment.f_date > DATEADD(yy, DATEDIFF(yy,0,getdate()), 0)
group by f_Shipmentnumber,f_ProviderID,f_ShipmentType

sql calculate change and percent by year

I have an data set that simulates the rate of return for a trading account. There is an entry for each day showing the balance and the open equity. I want to calculate the yearly, or quarterly, or monthly change and percent gain or loss. I have this working for daily data, but for some reason I can't seem to get it to work for yearly data.
The code for daily data follows:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM tim_account_history p
CROSS JOIN
(SELECT #pequity:= NULL
FROM tim_account_history
ORDER by `Date` LIMIT 1) as a
ORDER BY `Date`) as b
ORDER by `Date` ASC
Grouping by YEAR(Date) doesn't seem to make the desired difference. I have tried everything I can think of, but it still seems to return daily rate of change even if you group by month or year, etc. I think I'm not using windowing correctly, but I can't seem to figure it out. If anyone knows of a good book about this sort of query I'd appreciate that also.
Thanks.sqlfiddle example
Using what Lolo contributed, I have added some code so the data comes from the last day of the year, instead of the first. I also just need the Open_Equity, not the sum.
I'm still not certain I understand why this works, but it does give me what I was looking for. Using another select statement as a from seems to be the key here; I don't think I would have come up with this without Lolo's help. Thank you.
SELECT b.`yyyy`, b.Open_Equity,
concat('$',round(delta, 2)) as delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT (EXTRACT(YEAR FROM `Date`)) as `yyyy`,
(SUBSTRING_INDEX(GROUP_CONCAT(CAST(`Open_Equity` AS CHAR) ORDER BY `Date` DESC), ',', 1 )) AS `Open_Equity`
FROM tim_account_history GROUP BY `yyyy` ORDER BY `yyyy` DESC) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `yyyy` ) as b
ORDER by `yyyy` ASC

Try this:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT YEAR(`Date`) `Date`, SUM(Open_Equity) Open_Equity FROM tim_account_history GROUP BY YEAR(`Date`)) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `Date` ) as b
ORDER by `Date` ASC

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Deleting Outliers and Compute for the Average Using Mysql - mysql

Related

Display SUM and LAST_VALUE groups by Year

How add three numbers from three different tables in one row mysql phpmyadmin

Optimizing cohort analysis on Google BigQuery

get rows without duplicates

sql calculate change and percent by year

Categories

Resources