how to know the source table from a complex sql query - mysql

PFB the query in which I want know the actual table where I can find these (pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK )....
select count(1),pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK from
(SELECT vw1.pin_ein, vw1.engineer_pin,
vw1.transaction_date, vw1.effective_week,
vw1.stores_tools_cost,
(vw1.stores_total_cost - vw1.stores_tools_cost
) stores_total_cost_excl_tools,
vw1.item_count, vw1.stores_visit_count,
CASE
WHEN vw1.stores_total_cost <
5
THEN vw1.stores_visit_count
END stores_low_cost_visit_count,
vw1.actual_ouc ouc, er.eng_name engineer_name,
CASE
WHEN c.home_parked IS NOT NULL
THEN c.home_parked
ELSE 'N'
END home_parker,
CASE
WHEN c.home_parked IS NOT NULL
THEN c.commute_time
ELSE -99
END commute_time,
vw1.stores_com_cost ---v9.3---
FROM (SELECT pin_ein, engineer_pin, actual_ouc,
transaction_date, effective_week,
NVL
(SUM
(CASE
WHEN ( cow LIKE '%TOOL%'
OR cow LIKE
'%TOOLE%'
)
THEN transaction_value
END
),
0
) stores_tools_cost,
SUM
(transaction_value
) stores_total_cost,
SUM (transaction_quantity)
item_count,
COUNT
(DISTINCT sta_code
) stores_visit_count,
NVL
(SUM
(CASE
WHEN cow in (SELECT cow FROM orbit_odw.stores_cow_ref)
THEN transaction_value
END
),
0
) stores_com_cost ---v9.3---
FROM orbit_odw.stores_transaction_dtls
WHERE effective_week BETWEEN 201543
AND 201610
/***Ver 6.0---last 13 weeks data to be considered***/
AND transaction_date
BETWEEN to_date('19-10-2015','dd-mm-yyyy')
AND to_date('06-03-2016','dd-mm-yyyy')
/***Ver 6.0---last 13 weeks data to be considered***/
GROUP BY pin_ein,
engineer_pin,
actual_ouc,
transaction_date,
effective_week) vw1,
(SELECT *
FROM orbit_odw.dim_wms_rmdm
WHERE current_status = 1) er,
(SELECT engineer_ein, commute_time,
home_parked
FROM orbit_odw.eng_parking_at_home_dtls
WHERE rec_end_date > SYSDATE
AND home_parked = 'Y') c
WHERE TO_CHAR (vw1.pin_ein) = er.ein
AND vw1.pin_ein = c.engineer_ein(+)) group by pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK having count(1)>1;
Please help..
thanks in advance

Basically if i understand your problem correctly. You need to understand from where your outer SELECT is fetching data. So there is a simple rule set used to get these data. Steps are as follows.
Check the column output i.e in your case it's
pin_ein,engineer_pin,transaction_date,EFFECTIVE_WEEK
Check the inline view or table from which your outer query is
fetching the data. In your case its the only inline view you have
used --> So bit easy to identify :P
Now to identify how your inline view VW1 is populating data. In your
case table orbit_odw.stores_transaction_dtls is used to populate the
required fields.
Hope this much information is required. Also for simple queries you can always go to ALL_TAB_COLUMNS system tables to identify a table's column easily.

Related

MySQL Ignoring Outliers

I have to present some data to work colleagues and i am having issues analysing it in MySQL.
I have 1 table called 'payments'. Each payment has columns for:
Client (our client e.g. a bank)
Amount_gbp (the GBP equivalent of the value of the transaction)
Currency
Origin_country
Client_type (individual or company)
I have written pretty simple queries like:
SELECT
AVG(amount_GBP),
COUNT(client) AS '#Of Results'
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1'
But what i really need to do is eliminate outliers from the average AND/OR only include results within a number of Standard Deviations from the Mean.
For example, ignore the top/bottom 10 results of 2% of results etc.
AND/OR ignore any results that fall outside of 2 STDEVs from the Mean
Can anyone help?
--- EDITED ANSWER -- TRY AND LET ME KNOW ---
Your best best is to create a TEMPORARY table with the avg and std_dev values and compare against them. Let me know if that is not feasible:
CREATE TEMPORARY TABLE payment_stats AS
SELECT
AVG(p.amount_gbp) as avg_gbp,
STDDEV(amount_gbp) as std_gbp,
(SELECT MIN(srt.amount_gbp) as max_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp DESC
LIMIT <top_numbers to ignore>
) srt
) max_g,
(SELECT MAX(srt.amount_gbp) as min_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp ASC
LIMIT <top_numbers to ignore>
) srt
) min_g
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1';
You can then compare against the temp table
SELECT
AVG(p.amount_gbp) as avg_gbp,
COUNT(p.client) AS '#Of Results'
FROM payments p
WHERE
p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2)
FROM payment_stats)
AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2)
FROM payment_stats)
AND p.amount_gbp > (SELECT min_g FROM payment_stats)
AND p.amount_gbp < (SELECT max_g FROM payment_stats)
AND p.client_type = 'individual'
AND p.amount_gbp IS NOT NULL
AND p.currency = 'TRY'
AND p.country_origin = 'GB'
AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';
-- Later on
DROP TEMPORARY TABLE payment_stats;
Notice I had to repeat the WHERE condition. Also change *2 to whatever <factor> to what you need!
Still Phew!
Each compare will check a different stat
Let me know if this is better

query optimization for mysql

I have the following query which takes about 28 seconds on my machine. I would like to optimize it and know if there is any way to make it faster by creating some indexes.
select rr1.person_id as person_id, rr1.t1_value, rr2.t0_value
from (select r1.person_id, avg(r1.avg_normalized_value1) as t1_value
from (select ma1.person_id, mn1.store_name, avg(mn1.normalized_value) as avg_normalized_value1
from matrix_report1 ma1, matrix_normalized_notes mn1
where ma1.final_value = 1
and (mn1.normalized_value != 0.2
and mn1.normalized_value != 0.0 )
and ma1.user_id = mn1.user_id
and ma1.request_id = mn1.request_id
and ma1.request_id = 4 group by ma1.person_id, mn1.store_name) r1
group by r1.person_id) rr1
,(select r2.person_id, avg(r2.avg_normalized_value) as t0_value
from (select ma.person_id, mn.store_name, avg(mn.normalized_value) as avg_normalized_value
from matrix_report1 ma, matrix_normalized_notes mn
where ma.final_value = 0 and (mn.normalized_value != 0.2 and mn.normalized_value != 0.0 )
and ma.user_id = mn.user_id
and ma.request_id = mn.request_id
and ma.request_id = 4
group by ma.person_id, mn.store_name) r2
group by r2.person_id) rr2
where rr1.person_id = rr2.person_id
Basically, it aggregates data depending on the request_id and final_value (0 or 1). Is there a way to simplify it for optimization? And it would be nice to know which columns should be indexed. I created an index on user_id and request_id, but it doesn't help much.
There are about 4907424 rows on matrix_report1 and 335740 rows on matrix_normalized_notes table. These tables will grow as we have more requests.
First, the others are right about knowing better how to format your samples. Also, trying to explain in plain language what you are trying to do is also a benefit. With sample data and sample result expectations is even better.
However, that said, I think it can be significantly simplified. Your queries are almost completely identical with the exception of the one field of "final_value" = 1 or 0 respectively. Since each query will result in 1 record per "person_id", you can just do the average based on a CASE/WHEN AND remove the rest.
To help optimize the query, your matrix_report1 table should have an index on ( request_id, final_value, user_id ). Your matrix_normalized_notes table should have an index on ( request_id, user_id, store_name, normalized_value ).
Since your outer query is doing the average based on an per stores averages, you do need to keep it nested. The following should help.
SELECT
r1.person_id,
avg(r1.ANV1) as t1_value,
avg(r1.ANV0) as t0_value
from
( select
ma1.person_id,
mn1.store_name,
avg( case when ma1.final_value = 1
then mn1.normalized_value end ) as ANV1,
avg( case when ma1.final_value = 0
then mn1.normalized_value end ) as ANV0
from
matrix_report1 ma1
JOIN matrix_normalized_notes mn1
ON ma1.request_id = mn1.request_id
AND ma1.user_id = mn1.user_id
AND NOT mn1.normalized_value in ( 0.0, 0.2 )
where
ma1.request_id = 4
AND ma1.final_Value in ( 0, 1 )
group by
ma1.person_id,
mn1.store_name) r1
group by
r1.person_id
Notice the inner query is pulling all transactions for the final value as either a zero OR one. But then, the AVG is based on a case/when of the respective value for the normalized value. When the condition is NOT the 1 or 0 respectively, the result is NULL and is thus not considered when the average is computed.
So at this point, it is grouped on a per-person basis already with each store and Avg1 and Avg0 already set. Now, roll these values up directly per person regardless of the store. Again, NULL values should not be considered as part of the average computation. So, if Store "A" doesn't have a value in the Avg1, it should not skew the results. Similarly if Store "B" doesnt have a value in Avg0 result.

How to make a select that returns 4 totals from same table but with different filters

I'm trying to make a report in SSRS where I show some totals from the same table. I know I can use selects into select, but I've heard that could affect the performance and make it slow. That is why I decided to use store procedures but I'm not so familiar with it (I only did some basic SP) so some help will be apreciated:
This is what I need to get:
|--------------|------------------------- TOTALS AND PERCENTAGES ----------------------|
|COMPANY | PACKAGES | WEIGHT | PACKAGE_DELIVERED |% DELIVERED | ONTIME |% ONTIME |
These are the querys I did in a previous version of the report (using asp):
SELECT COMPANY_NAME, COUNT(ID) AS PACKAGES, SUM(WEIGHT) AS WEIGHT
FROM PACKAGE
WHERE ACTUAL_DELIVERY_DATE BETWEEN 'X' AND 'Y'
GROUP BY COMPANY_CODE, COMPANY_NAME
Then I put the results in arrays and then make a new select to get the rest of information adding the COMPANY as filter:
SELECT COMPANY_CODE, ESTIMATED_DELIVERY_DATE, ACTUAL_DELIVERY_DATE
FROM PACKAGE
WHERE ACTUAL_DELIVERY_DATE BETWEEN 'X' AND 'Y'
AND STATUS = 'DELIVERED'
AND COMPANY_CODE = 'DHL'
ORDER BY STATUS
For every row
PACKAGES_DELIVERED = + 1
IF ACTUAL_DELIVERY_DATE < ESTIMATED_DELIVERY_DATE THEN ONTIME = + 1
Next
Then I calculate the percentages and show all together in a table.
Somebody that can help me to put all this in a Store Procedure or maybe have another idea.
Thanks in advance.
I would add the following columns to the original SELECT, using SUM on a CASE statement:
, SUM ( CASE WHEN STATUS = 'DELIVERED' THEN 1 ELSE 0 END ) AS PACKAGES_DELIVERED
, SUM ( CASE WHEN STATUS = 'DELIVERED' AND ACTUAL_DELIVERY_DATE < ESTIMATED_DELIVERY_DATE THEN 1 ELSE 0 END ) AS ONTIME
This doesnt seem complex enough to bother with a Stored Procedure.

summation in mysql does not work properly and returns 0

I have a sql query as follow:
but the problem is that if the the second select staement with dataitem=3 returns null then the whole calculation becomes 0. For example for first select I have 100 and for second it returns null. Adding them should result 100 but it gives back 0!!!!!
can anyone say the reason and also what to do to get rid of that?
Here is also copiable code:
select( (
SELECT sum(Sentiment)
FROM entity_epoch_data
WHERE EpochID IN
(SELECT ID
FROM epoch
WHERE StartDateTime>='2013-11-1'
AND EndDateTime<='2013-11-30')
AND EntityID =86
AND DataitemType=0
)+
(SELECT sum(Sentiment)
FROM entity_epoch_data
WHERE EpochID IN
(SELECT ID
FROM epoch
WHERE StartDateTime>='2013-11-1'
AND EndDateTime<='2013-11-30')
AND EntityID =86
AND DataitemType=3)
)
Just add to the SQLs that represent values the IFNULL command like IFNULL((select sum()......), 0) it should work fine.
But a little peace of advice. You should improve that query.
I beleave that this query you do the same thing.
SELECT sum(entity_epoch_data.Sentiment)
FROM entity_epoch_data INNER JOIN epoch
ON entity_epoch_data.EpochID = epoch.id
WHERE epoch.StartDateTime>='2013-11-1'
and epoch.EndDateTime<='2013-11-30'
AND entity_epoch_data.EntityID =86
and entity_epoch_data.DataitemType in (0,3)
You are summing the sums of the DataitemType 3 and 0 it just can be one query with a join
Use CASE statements instead of the statements you're using.
select sum(case when StartDateTime>='2013-11-1' and EndDateTime<='2013-11-30' and DataitemType=0 then Sentiment else 0 end) as Sentiment_0, sum(case when StartDateTime>='2013-11-1' and EndDateTime<='2013-11-30' and DataitemType=3 then Sentiment else 0 end) as Sentiment3 from (tables_joined) where EntityID =86
Adding null is undefined and therefore returns null or 0. You can use a CASE expression to avoid that problem.
SELECT SUM(CASE null = sentiment THEN 0 ELSE sentiment END) FROM ....
If your DB does not support CASE inside the SUM() function create a VIEW that uses CASE to substitute the null values with 0.

MAX with extra criteria

I have the following part of a query I'm working on in MYSQL.
SELECT
MAX(CAST(MatchPlayerBatting.BatRuns AS SIGNED)) AS HighestScore
FROM
MatchPlayerBatting
It returns the correct result. However there is another column I need it to work off.
That is if the maximum value it finds also has a value of "not out" within "BatHowOut", it should show the result as for example 96* rather than just 96.
How could this be done?
To help make the data concrete, consider two cases:
BatRuns BatHowOut
96 not out
96 lbw
BatRuns BatHowOut
96 not out
102 lbw
For the first data, the answer should be '96*'; for the second, '102'.
You can achieve this using self-join like this:
SELECT t1.ID
, CONCAT(t1.BatRuns,
CASE WHEN t1.BatHowOut = 'Not Out' THEN '*' ELSE '' END
) AS HighScore
FROM MatchPlayerBatting t1
JOIN
(
SELECT MAX(BatRuns) AS HighestScore
FROM MatchPlayerBatting
) t2
ON t1.BatRuns = t2.HighestScore
See this sample SQLFiddle with highest "Not Out"
See this another sample SQLFiddle with highest "Out"
See this another sample SQLFiddle with two highest scores
How about ordering the scores in descending order and selecting only the first record?
select concat(BatRuns , case when BatHowOut = 'not out' then '*' else '' end)
from mytable
order by cast(BatRuns as signed) desc,
(case when BatHowOut = 'not out' then 1 else 2 end)
limit 1;
Sample here.
If you want to find highest score score for each player, here is a solution that may not be elegant, but quite effective.
select PlayerID,
case when runs != round(runs)
then concat(round(runs),'*')
else
round(runs)
end highest_score
from (select PlayerID,
max(cast(BatRuns as decimal) +
case when BatHowOut = 'not out' then 0.1 else 0 end
) runs
from MatchPlayerBatting
group by PlayerID) max_runs;
This takes advantage of the fact that, runs can never be fractions, only whole numbers. When there is a tie for highest score and one of them is unbeaten,
adding 0.1 to the unbeaten score will make it the highest. This can be later removed and concatenated with *.
Sample here.