MySQL Ignoring Outliers - mysql

I have to present some data to work colleagues and i am having issues analysing it in MySQL.
I have 1 table called 'payments'. Each payment has columns for:
Client (our client e.g. a bank)
Amount_gbp (the GBP equivalent of the value of the transaction)
Currency
Origin_country
Client_type (individual or company)
I have written pretty simple queries like:
SELECT
AVG(amount_GBP),
COUNT(client) AS '#Of Results'
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1'
But what i really need to do is eliminate outliers from the average AND/OR only include results within a number of Standard Deviations from the Mean.
For example, ignore the top/bottom 10 results of 2% of results etc.
AND/OR ignore any results that fall outside of 2 STDEVs from the Mean
Can anyone help?

--- EDITED ANSWER -- TRY AND LET ME KNOW ---
Your best best is to create a TEMPORARY table with the avg and std_dev values and compare against them. Let me know if that is not feasible:
CREATE TEMPORARY TABLE payment_stats AS
SELECT
AVG(p.amount_gbp) as avg_gbp,
STDDEV(amount_gbp) as std_gbp,
(SELECT MIN(srt.amount_gbp) as max_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp DESC
LIMIT <top_numbers to ignore>
) srt
) max_g,
(SELECT MAX(srt.amount_gbp) as min_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp ASC
LIMIT <top_numbers to ignore>
) srt
) min_g
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1';
You can then compare against the temp table
SELECT
AVG(p.amount_gbp) as avg_gbp,
COUNT(p.client) AS '#Of Results'
FROM payments p
WHERE
p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2)
FROM payment_stats)
AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2)
FROM payment_stats)
AND p.amount_gbp > (SELECT min_g FROM payment_stats)
AND p.amount_gbp < (SELECT max_g FROM payment_stats)
AND p.client_type = 'individual'
AND p.amount_gbp IS NOT NULL
AND p.currency = 'TRY'
AND p.country_origin = 'GB'
AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';
-- Later on
DROP TEMPORARY TABLE payment_stats;
Notice I had to repeat the WHERE condition. Also change *2 to whatever <factor> to what you need!
Still Phew!
Each compare will check a different stat
Let me know if this is better

Related

Selecting rows until a column value isn't the same

SELECT product.productID
, product.Name
, product.date
, product.status
FROM product
INNER JOIN shelf ON product.sheldID=shelf.shelfID
WHERE product.weekID = $ID
AND product.date < '$day'
OR (product.date = '$day' AND shelf.expire <= '$time' )
ORDER BY concat(product.date,shelf.expire)
I am trying to stop the SQL statement at a specific value e.g. bad.
I have tried using max-date, but am finding it hard as am making the time stamp in the query. (Combining date/time)
This example table shows that 3 results should be returned and if the status "bad" was the first result than no results should be returned. (They are ordered by date and time).
ProductID Date status
1 2017-03-27 Good
2 2017-03-27 Good
3 2017-03-26 Good
4 2017-03-25 Bad
5 2017-03-25 Good
Think I may have fixed it, I added this to my while loop.
The query gives the results in order by present to past using date and time, this while loop checks if the column of that row is equal to 'bad' if it is does something (might be able to use an array to fill it up with data). If not than the loop is broken.
I know it doesn't seem ideal but it works lol
while ($row = mysqli_fetch_assoc($result)) {
if ($row['status'] == "bad") {
$counter += 1;
}
else{
break;}
I will provide an answer just with your output as if it was just one table. It will give you the main ideia in how to solve your problem.
Basically I created a column called ord that will work as a row_number (MySql doesn't support it yet AFAIK). Then I got the minimum ord value for a bad status then I get everything from the data where ord is less than that.
select y.*
from (select ProductID, dt, status, #rw:=#rw+1 ord
from product, (select #rw:=0) a
order by dt desc) y
where y.ord < (select min(ord) ord
from (select ProductID, status, #rin:=#rin+1 ord
from product, (select #rin:=0) a
order by dt desc) x
where status = 'Bad');
Result will be:
ProductID dt status ord
-------------------------------------
1 2017-03-27 Good 1
2 2017-03-27 Good 2
3 2017-03-26 Good 3
Also tested with the use case where the Bad status is the first result, no results will be returned.
See it working here: http://sqlfiddle.com/#!9/28dda/1

how to know the source table from a complex sql query

PFB the query in which I want know the actual table where I can find these (pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK )....
select count(1),pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK from
(SELECT vw1.pin_ein, vw1.engineer_pin,
vw1.transaction_date, vw1.effective_week,
vw1.stores_tools_cost,
(vw1.stores_total_cost - vw1.stores_tools_cost
) stores_total_cost_excl_tools,
vw1.item_count, vw1.stores_visit_count,
CASE
WHEN vw1.stores_total_cost <
5
THEN vw1.stores_visit_count
END stores_low_cost_visit_count,
vw1.actual_ouc ouc, er.eng_name engineer_name,
CASE
WHEN c.home_parked IS NOT NULL
THEN c.home_parked
ELSE 'N'
END home_parker,
CASE
WHEN c.home_parked IS NOT NULL
THEN c.commute_time
ELSE -99
END commute_time,
vw1.stores_com_cost ---v9.3---
FROM (SELECT pin_ein, engineer_pin, actual_ouc,
transaction_date, effective_week,
NVL
(SUM
(CASE
WHEN ( cow LIKE '%TOOL%'
OR cow LIKE
'%TOOLE%'
)
THEN transaction_value
END
),
0
) stores_tools_cost,
SUM
(transaction_value
) stores_total_cost,
SUM (transaction_quantity)
item_count,
COUNT
(DISTINCT sta_code
) stores_visit_count,
NVL
(SUM
(CASE
WHEN cow in (SELECT cow FROM orbit_odw.stores_cow_ref)
THEN transaction_value
END
),
0
) stores_com_cost ---v9.3---
FROM orbit_odw.stores_transaction_dtls
WHERE effective_week BETWEEN 201543
AND 201610
/***Ver 6.0---last 13 weeks data to be considered***/
AND transaction_date
BETWEEN to_date('19-10-2015','dd-mm-yyyy')
AND to_date('06-03-2016','dd-mm-yyyy')
/***Ver 6.0---last 13 weeks data to be considered***/
GROUP BY pin_ein,
engineer_pin,
actual_ouc,
transaction_date,
effective_week) vw1,
(SELECT *
FROM orbit_odw.dim_wms_rmdm
WHERE current_status = 1) er,
(SELECT engineer_ein, commute_time,
home_parked
FROM orbit_odw.eng_parking_at_home_dtls
WHERE rec_end_date > SYSDATE
AND home_parked = 'Y') c
WHERE TO_CHAR (vw1.pin_ein) = er.ein
AND vw1.pin_ein = c.engineer_ein(+)) group by pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK having count(1)>1;
Please help..
thanks in advance
Basically if i understand your problem correctly. You need to understand from where your outer SELECT is fetching data. So there is a simple rule set used to get these data. Steps are as follows.
Check the column output i.e in your case it's
pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK
Check the inline view or table from which your outer query is
fetching the data. In your case its the only inline view you have
used --> So bit easy to identify :P
Now to identify how your inline view VW1 is populating data. In your
case table orbit_odw.stores_transaction_dtls is used to populate the
required fields.
Hope this much information is required. Also for simple queries you can always go to ALL_TAB_COLUMNS system tables to identify a table's column easily.

query optimization for mysql

I have the following query which takes about 28 seconds on my machine. I would like to optimize it and know if there is any way to make it faster by creating some indexes.
select rr1.person_id as person_id, rr1.t1_value, rr2.t0_value
from (select r1.person_id, avg(r1.avg_normalized_value1) as t1_value
from (select ma1.person_id, mn1.store_name, avg(mn1.normalized_value) as avg_normalized_value1
from matrix_report1 ma1, matrix_normalized_notes mn1
where ma1.final_value = 1
and (mn1.normalized_value != 0.2
and mn1.normalized_value != 0.0 )
and ma1.user_id = mn1.user_id
and ma1.request_id = mn1.request_id
and ma1.request_id = 4 group by ma1.person_id, mn1.store_name) r1
group by r1.person_id) rr1
,(select r2.person_id, avg(r2.avg_normalized_value) as t0_value
from (select ma.person_id, mn.store_name, avg(mn.normalized_value) as avg_normalized_value
from matrix_report1 ma, matrix_normalized_notes mn
where ma.final_value = 0 and (mn.normalized_value != 0.2 and mn.normalized_value != 0.0 )
and ma.user_id = mn.user_id
and ma.request_id = mn.request_id
and ma.request_id = 4
group by ma.person_id, mn.store_name) r2
group by r2.person_id) rr2
where rr1.person_id = rr2.person_id
Basically, it aggregates data depending on the request_id and final_value (0 or 1). Is there a way to simplify it for optimization? And it would be nice to know which columns should be indexed. I created an index on user_id and request_id, but it doesn't help much.
There are about 4907424 rows on matrix_report1 and 335740 rows on matrix_normalized_notes table. These tables will grow as we have more requests.
First, the others are right about knowing better how to format your samples. Also, trying to explain in plain language what you are trying to do is also a benefit. With sample data and sample result expectations is even better.
However, that said, I think it can be significantly simplified. Your queries are almost completely identical with the exception of the one field of "final_value" = 1 or 0 respectively. Since each query will result in 1 record per "person_id", you can just do the average based on a CASE/WHEN AND remove the rest.
To help optimize the query, your matrix_report1 table should have an index on ( request_id, final_value, user_id ). Your matrix_normalized_notes table should have an index on ( request_id, user_id, store_name, normalized_value ).
Since your outer query is doing the average based on an per stores averages, you do need to keep it nested. The following should help.
SELECT
r1.person_id,
avg(r1.ANV1) as t1_value,
avg(r1.ANV0) as t0_value
from
( select
ma1.person_id,
mn1.store_name,
avg( case when ma1.final_value = 1
then mn1.normalized_value end ) as ANV1,
avg( case when ma1.final_value = 0
then mn1.normalized_value end ) as ANV0
from
matrix_report1 ma1
JOIN matrix_normalized_notes mn1
ON ma1.request_id = mn1.request_id
AND ma1.user_id = mn1.user_id
AND NOT mn1.normalized_value in ( 0.0, 0.2 )
where
ma1.request_id = 4
AND ma1.final_Value in ( 0, 1 )
group by
ma1.person_id,
mn1.store_name) r1
group by
r1.person_id
Notice the inner query is pulling all transactions for the final value as either a zero OR one. But then, the AVG is based on a case/when of the respective value for the normalized value. When the condition is NOT the 1 or 0 respectively, the result is NULL and is thus not considered when the average is computed.
So at this point, it is grouped on a per-person basis already with each store and Avg1 and Avg0 already set. Now, roll these values up directly per person regardless of the store. Again, NULL values should not be considered as part of the average computation. So, if Store "A" doesn't have a value in the Avg1, it should not skew the results. Similarly if Store "B" doesnt have a value in Avg0 result.

Need help to make one mysql query to get expected result for my requirement

I am facing few issue to write mysql query in my scope to get result. Actually I am getting appropriate result using this existing query but it is not written appropriate way. Here is my query:
SELECT c.ID, c.chn_name,c.chn_logo,
(SELECT ID FROM tv_showtime WHERE showtime<='2013-02-18 10:28:35' AND status='Enable' AND chn_id=c.ID ORDER BY ID DESC Limit 0,1) as currentshowid,
(SELECT tv_showtime FROM tv_showtime WHERE showtime<='2013-02-18 10:28:35' AND status='Enable' AND chn_id=c.ID ORDER BY ID DESC Limit 0,1) as currentshowtime ,
(SELECT tv_showtime FROM tv_showtime WHERE showtime >'2013-02-18 10:28:35' AND status='Enable' AND chn_id=c.ID ORDER BY ID ASC Limit 0,1) as nextshowtime
FROM tv_channels AS c
WHERE c.status="Enable"
ORDER BY c.chn_name
LIMIT 0,10
Here, there are only two tables named as "tv_channels" and "tv_showtime". I need one record for each channel at a time ( for current time). So here suppose 12 channels and approx 30 (may vary foe each channel) records for each channel and I only need to display channels with current show (More clarification: only channels will be displayed which has current show time and/or next show time.)
Problem: I need more field values from "tv_showtime" to display other required values. And if I will use this way then I have to write more inner select query and it will slow down my website to load. So can you suggest or advise any other way to write this query please?
Database table detail:
tv_channels [ID, chn_name, [other required fields]],
tv_showtime [ID, chn_id, showtime, show_name, hits, last_ip [and few more fields]]
Please let me know if you will need further detail to get this question.
Any help or suggestion will be appreciated. thanks.
As another asked, but you didnt respond to an "end time" for each show, I had to go on the premise that the show time was when it started. That said, how do you determine which is the current show running for a given channel based on CURTIME() (instead of fixed time value).
Get each channel and the MAXIMUM SHOW Time that exists PRIOR TO the current time...
Likewise, how to get the NEXT Show? Get each channel with the MINIMUM SHOW time that STARTS AFTER the current time.
So, if I had the following records for 1 channels and the current time is 2:15pm
Channel ShowTime Show_Name
1 12:30pm Show "X"
1 01:00pm Show "B"
1 01:30pm Show "C"
1 02:00pm Show "D" <- Current Show
1 02:30pm Show "Y" <- Next Show
1 03:00pm Show "Z"
The current show running is the latest one PRIOR to 2:15 (Show "D" starting at 2pm)
and the NEXT Show is first AFTER current time (Show "Y" starting at 2:30pm). The above will work even if the rows are not in sequential order as I am using MIN() and MAX() respectively to get the time.
So, I start with the channel table and do a left-join to each separate pre-aggregate query for detecting the current show and next show times respectively and join on the channel ID which each COULD return at most one record --- provided there IS a record within qualified WHERE CURTIME() consideration.
From THAT, I am re-joining THOSE result sets back to the actual tv schedule table AGAIN, but this time, on the channel AND the time that matched the corresponding current or next time.
So now, I have everything lined up ready to go with respective aliases for content. Now, I just grab the columns I want to present.
Since the joins are all LEFT-JOINs, each side COULD have NULL values, so you might want to adjust the query to prevent nulls using COALESCE(), such as I've sampled...
SELECT
TC.ID,
TC.Chn_Name,
TC.Chn_Logo,
COALESCE( CurShowTimeDetail.ShowTime, 'no time' ) CurShowTime,
COALESCE( CurShowTimeDetail.Show_Name, '' ) CurShowName,
COALESCE( CurShowTimeDetail.Hits, 0 ) CurHits,
COALESCE( NextShowTimeDetail.ShowTime, 'no time' ) NextShowTime,
COALESCE( NextShowTimeDetail.Show_Name, '' ) NextShowName,
COALESCE( NextShowTimeDetail.Hits, 0 ) NextHits
from
TV_Channels TC
LEFT JOIN ( SELECT
ST.chn_id,
MAX( ST.showtime ) CurShowTime
from
tv_showtime ST
where
ST.ShowTime < CURTIME()
group by
ST.chn_id ) CurrentShow
ON TC.ID = CurrentShow.Chn_ID
LEFT JOIN tv_showtime CurShowTimeDetail
ON CurrentShow.Chn_ID = CurShowTimeDetail.Chn_ID
AND CurrentShow.CurShowTime = CurShowTimeDetail.ShowTime
LEFT JOIN ( SELECT
ST.chn_id,
MIN( ST.showtime ) NextShowTime
from
tv_showtime ST
where
ST.ShowTime > CURTIME()
group by
ST.chn_id ) NextShow
ON TC.ID = NextShow.Chn_ID
LEFT JOIN tv_showtime NextShowTimeDetail
ON NextShow.Chn_ID = NextShowTimeDetail.Chn_ID
AND NextShow.NextShowTime = NextShowTimeDetail.ShowTime
To select last (first) records from a table by some order, you may LEFT JOIN the table with itself as any next (previous) element, and add a condition that there is no such element.
SELECT c.ID, c.chn_name, c.chn_logo
, curr_sh.ID AS currentshowid, curr_sh.showtime AS currentshowtime -- Continue with desired columns
, next_sh.showtime AS nextshowtime -- Continue with desired columns
FROM tv_channels AS c
LEFT JOIN tv_showtime AS curr_sh
ON curr_sh.chn_id = c.ID
AND curr_sh.showtime <= '2013-02-18 10:28:35'
AND curr_sh.status='Enable'
LEFT JOIN tv_showtime AS curr_next_sh
ON curr_next_sh.chn_id = curr_sh.chn_id
AND curr_next_sh.showtime > curr_sh.showtime
AND curr_next_sh.showtime <= '2013-02-18 10:28:35'
AND curr_next_sh.status = 'Enable'
LEFT JOIN tv_showtime AS next_sh
ON next_sh.chn_id = c.ID
AND next_sh.showtime > '2013-02-18 10:28:35'
AND next_sh.status='Enable'
LEFT JOIN tv_showtime AS next_prev_sh
ON next_prev_sh.chn_id = next_sh.chn_id
AND next_prev_sh.showtime < next_sh.showtime
AND next_prev_sh.showtime > '2013-02-18 10:28:35'
AND next_prev_sh.status = 'Enable'
WHERE c.status = 'Enable'
AND curr_next_sh.ID IS NULL -- This gives us only the latest current show
AND next_prev_sh.ID IS NULL -- This gives us only the earliest next show
AND (curr_sh.ID IS NOT NULL OR next_sh.ID IS NOT NULL) -- This gives us 'which has current show time and/or next show time'
ORDER BY c.chn_name
LIMIT 0,10
But I'm not sure about performance, and whether this solution is optimal.

SQL : How to calculate limited rows and reset the counter

I am dealing with an issue and need some expert advice on to achieve the problem, my sql query generates output with two columns, 1st column displays id (for e.g. abc-123 in following table) and next column displays corresponding result to the id stored in db which is pass or fail.
I need to implement, when resolution is pass it should display success attempt, in following example, abc-123 failed 1st time however def-456 passed in next attempt thus success rate is 50%, now counter should reset and go to next row where there is pass thus it should show 100%, again when code hits pass counter resets then goes next and displays 33% bec there are two fail and one pass at the end, how it can be achieved in sql? (id and resolution are column names)
**date** **id resolution**
6/6/2012 abc-123 fail 50%
6/7/2012 abc-456 pass
6/8/2012 abc-789 pass 100%
6/9/2012 abc-799 fail 33%
6/10/2012 abc-800 fail
6/1/2012 abc-900 pass
Thanks
SELECT
*
FROM
table
INNER JOIN
(
SELECT
MIN(g.id) AS first_id,
MAX(g.id) AS last_id,
COUNT(*) AS group_size
FROM
table AS p
INNER JOIN
table AS g
ON g.id > COALESCE(
(SELECT MAX(id) FROM table WHERE id < p.id AND resolution = 'pass'),
''
)
AND g.id <= p.id
WHERE
p.resolution = 'pass'
GROUP BY
p.id
)
AS groups
ON table.id >= groups.first_id
AND table.id <= groups.last_id
There's more than one way to do it:
SELECT st.*,
#prev:=#counter + 1,
#counter:= CASE
WHEN st.resolution = 'pass'
THEN 0
ELSE #counter + 1
END c,
CASE WHEN #counter = 0
THEN CONCAT(FORMAT(100/#prev, 2), '%')
ELSE '-'
END res
FROM so_test st, (SELECT #counter:=0) sc
Here's proof of concept.