MySQL Select Results Excluding Outliers Using AVG and STD Conditions - mysql

I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:
SELECT t.Result FROM
(SELECT Result, AVG(Result) avgr, STD(Result) stdr
FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)
I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:
(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75)
However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?

Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).
You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:
SELECT t.Result FROM
myTable AS t
CROSS JOIN (
SELECT AVG(Result) avgr, STD(Result) stdr
FROM myTable
WHERE myField = myCondition
) AS stats
WHERE
t.myField = myCondition
AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75
Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.
You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.

I created a UDF that doesn't calculate exactly the way you asked (it discards a percent of the results from the top and bottom, instead of using std), but it might be useful for you
(or someone else) anyway, matching the Excel function referenced here https://support.office.com/en-us/article/trimmean-function-d90c9878-a119-4746-88fa-63d988f511d3
https://github.com/StirlingMarketingGroup/mysql-trimmean
Usage
`trimmean` ( `NumberColumn`, double `Percent` [, integer `Decimals` = 4 ] )
`NumberColumn`
The column of values to trim and average.
`Percent`
The fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a data set of 20 points (20 x 0.2): 2 from the top and 2 from the bottom of the set.
`Decimals`
Optionally, the number of decimal places to output. Default is 4.

Related

Getting the correct row data when using MySQL aggregate function MIN? [duplicate]

This question already has answers here:
SQL select only rows with max value on a column [duplicate]
(27 answers)
Closed 4 years ago.
Now, as I understand when you use aggregate functions such as AVG, SUM etc you have to keep in mind that any other fields you SELECT that aren't also involved in an aggregate function will be indeterminate, for example:
SELECT AVG(amount), name, desc FROM some_table;
I understand this and this is because the value coming from the aggregate function isn't tied to any one row and hence the other fields selected are indeterminate.
However, if you use a different type of aggregate function such as MIN or MAX where what they retrieve is tied to a certain row then is it safe to assume that any other fields selected that aren't within an aggregate function can be determined? ... as the result would be tied to a specific row of data unlike the other aggregate function results?
For example:
SELECT MIN(media_id),
auction_id,
media_url
FROM auction_media
WHERE auction_id IN( 119925, 124660, 124663, 129078,
129094, 134395, 149753, 152221,
154733, 154737, 154742, 157694,
161411, 165965, 165973 )
AND media_type = 1
AND upload_in_progress = 0
GROUP BY auction_id;
If I am right in my thinking this would always return the correct media_url right?
However, if you use a different type of aggregate function such as MIN
or MAX where what they retrieve is tied to a certain row then is it
safe to assume that any other fields selected that aren't within an
aggregate function can be determined?
Nope. For one, multiple rows can have the min or max value; for another, there is nothing stopping a query from selecting MIN(a), MAX(a), AVG(a), and SUM(a) all at once (and I highly doubt MySQL would over-complicate it's query engine to take advantage of "if the query has only one aggregate...")
Note: I am fairly certain the only reason MySQL originally even allowed such queries was for short hand in situations like:
SELECT a.*, SUM(b.X)
FROM a INNER JOIN b ON a.PK = b.a_PK
GROUP BY a.PK;
where the query author knows the non-aggregated fields are can be determined by virtue of the grouping, not the aggregated value(s).
MIN and MAX is no more tied to any row than AVG or SUM is. All 4 of them are the result of aggregating multiple rows, whether all rows (like you first query), or the rows in a group (like your second query).
If I am right in my thinking this would always return the correct media_url right?
No. What if your data is:
auction_id media_id media_url
119925 3 http://google.com
119925 5 http://yahoo.com
119925 3 http://bing.com
Your query SELECT MIN(media_id), auction_id, media_url GROUP BY auction_id would return 3 for MIN(media_id), and 119925 for auction_id, but what media_url would it return?
media_url is still indeterminate.
You see, there is nothing in the data that says that media_url is in any way related to media_id.
You might (think you) know that the denormalized media_url is always the same for a particular media_id, but that doesn't matter to the SQL engine.
No. The unaggregated columns (that are not in the group by) in an aggregation query come from arbitrary and indeterminate rows. This awkward behavior is why the syntax is not allowed in most databases and why the most recent versions of MySQL "turn-it-off" by default. So your query would return an error.
Here is one way to do what you want:
SELECT am.*
FROM auction_media am
WHERE auction_id IN (119925, 124660, 124663, 129078,
129094, 134395, 149753, 152221,
154733, 154737, 154742, 157694,
161411, 165965, 165973 ) AND
media_type = 1 AND upload_in_progress = 0 AND
media_id = (SELECT MIN(am2.media_id)
FROM auction_media m2
WHERE m2.auction_id = m.auction_id AND m2.media_type = m.media_type AND m2.upload_in_progress = m.upload_in_progress
);
For performance you want an index on auction_media(auction_id, media_type, upload_in_progress, media_id) and auction_media(media_type, upload_in_progress, auction_id).

Not selecting duplicates in join / where query

I've been trying to learn MySQL, and I'm having some trouble creating a join query to not select duplicates.
Basically, here's where I'm at :
SELECT atable.phonenumber, btable.date
FROM btable
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
However, in my database, there is the possibility of having duplicate rows in column atable.phonenumber.
For example (added asterisks for clarity)
phonenumber | date
-------------|-----------
*555-681-2105 | 2015-08-12
555-425-5161 | 2015-08-15
331-484-7784 | 2015-08-17
*555-681-2105 | 2015-08-25
.. and so on.
I tried using SELECT DISTINCT but that doesn't work. I also was looking through other solutions which recommended GROUP BY, but that threw an error, most likely because of my WHERE clause and condition. Not really sure how I can easily accomplish this.
DISTINCT applies to the whole row being returned, essentially saying "I want only unique rows" - any row value may participate in making the row unique
You are getting phone numbers duplicated because you're only looking at the column in isolation. The database is looking at phone number and also date. The rows you posted have different dates, and these hence cause the rows to be different
I suggest you do as the commenter recommended and decide what you want to do with the dates. If you want the latest date for a phone number, do this:
SELECT atable.phonenumber, max(btable.date)
FROM battle
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
GROUP BY atable.phonenumber
When you write a query that uses grouping, you will get a set of rows where there is only one set of value combinations for anything that is in the group by list. In this case, only unique phone numbers. But, because you want other values as well (I.e. Date) you MUST use what's called an aggregate function, to specify what you want to do with all the various values that aren't part of the unique set. Sometimes it will be MAX or MIN, sometimes it will be SUM, COUNT, AVG and so on.
if you're familiar with hash tables or dictionaries from elsewhere in programming, this is what a group by is: it maps a set of values (a key) to a list of rows that have those key values, and then the aggregating function is applied to any of the values in the list associated with the key
The simple rule when using group by (and one that MySQL will do implicitly for you) is to write queries thus:
SELECT
List,
of,
columns,
you,
want,
in,
unique,
combination,
FN(List),
FN(of),
FN(columns),
FN(you),
FN(want),
FN(aggregating)
FROM table
GROUP BY
List,
of,
columns,
you,
want,
in,
unique,
combination
i.e. You can copy paste from your select list to your group list. MySQL does this implicitly for you if you don't do it (i.e. If you use one or more aggregate functions like max in your select list, but forget or omit the group by clause- it will take everything that isn't in an agggregate function and run the grouping as if you'd written it). Whether group by is hence largely redundant is often debated, but there do exist other things you can do with a group by, such as rollup, cube and grouping sets. Also you can group on a column, if that column is used in a deterministic function, without having to group on the result of he deterministic function. Whether there is any point to doing so is a debate for another time :)
You should add GROUP BY, and an aggregate to the date field, something like this:
SELECT atable.phonenumber, MAX(btable.date)
FROM btable
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
GROUP BY atable.phonenumber
This will return the maximum date, hat is the latest date...

Where clause with one column and multiple criteria returning one row instead of13

I have a simple query with a few rows and multiple criteria in the where clause but it is only returning one row instead of 13. No joins and the syntax was triple checked and appears to be free of errors.
Query:
select column1, column2, column3
from mydb
where onecolumn in (number1, number2....number13)
Results:
returns one row of data associated with a random number in the where clause
spent a big part of the day trying to figure this one out and am now out of ideas. Please help...
Absent a more detailed test case, and the actual SQL statement that is actually running, this question cannot be answered. Here are some "ideas"...
Our first guess is that the rows you think are going to satisfy the predicates aren't actually satisfying all of the conditions.
Our second guess is that you've got an aggregate expression (COUNT(), MAX(), SUM()) in the SELECT list that's causing an implicit GROUP BY. This is a common "gotcha"... the non-standard MySQL extension to GROUP BY which allows non-aggregates to appear in the SELECT list, which are not also included as expressions in the GROUP BY clause. This same gotcha appears when the GROUP BY clause is omitted entirely, and an aggregate is included in the SELECT list.
But the question doesn't make any mention of an aggregate expression in the SELECT list.
Our third guess is another issue that beginners frequently overlook: the order of precedence of operations, especially AND and OR. For example, consider the expressions:
a AND b OR c
a AND ( b OR c )
( a AND b ) OR c
consider those while we sing-along, Sesame Street style,...: "One of these things is not like the others, one of these things just doesn't belong..."
A fourth guess... if it wasn't for the row being returned having a value of onecolumn as a random number in the IN list... if it was instead the first number in the IN list, we'd be very suspicious that the IN list actually contains a single string value that looks like a list a values, but is actually not.
The two expression in the SELECT list look very similar, but they are very different:
SELECT t.n IN (2,3,5,7) AS n_in_list
, t.n IN ('2,3,5,7') AS n_in_string
FROM ( SELECT 2 AS n
UNION ALL SELECT 3
UNION ALL SELECT 5
) t
The first expression is comparing n to each value in a list of four values.
The second expression is equivalent to t.n IN (2).
This is a frequent trip up when neophytes are dynamically creating SQL text, thinking that they can pass in a string value and that MySQL will see the commas within the string as part of the SQL statement.
(But this doesn't explain how a some the random one in the list.)
Those are all just guesses. Those are some of the most frequent trip ups we see, but we're just guessing. It could be something else entirely. In it's current form, there is no definitive "answer" to the question.

Rails ActiveRecord "maximum(:column)" ignores order

I am trying to retrieve the maximum value of a column using ActiveRecord, but after I order and limit the values.
My query is:
max_value = current_user.books.order('created_at DESC').limit(365).maximum(:price)
Yet the resulting query is:
(243.0ms) SELECT MAX(`books`.`price`) AS max_id FROM `books` WHERE `books`.`user_id` = 2 LIMIT 365
The order is ignored completely and as a result the maximum value comes from the first 365 records instead of the last 365 records.
There's a curious line in the active record code (active_record/relation/calculations.rb) which removes the ordering. I say curious because it refers specifically to postgres:
# Postgresql doesn't like ORDER BY when there are no GROUP BY
relation = reorder(nil)
You should be able to use pluck to achieve what you want. It can select a single attribute which can be a reference to an aggregate function:
q = current_user.books.order('created_at DESC').limit(365)
max_value = q.pluck("max(price)").first
pluck will return an array of values so you need the first to get the first one (and only one in this case). If there are no results then it will return nil.
According to the rails guides maximum returns the maximum value of your table for this field so I suppose Active Records tries to optimize your query and ends up messing up with the order of executing your chained methods.
Could you try: First query the 365 rows you want, and then get the maximum?
max_value = (current_user.books.order('created_at DESC').limit(365)).maximum(:price)
I have found the solution thanks to #RubyOnRails on freenode:
max_value = current_user.books.order('created_at DESC').limit(365).pluck(:price).max
Of course the drawback is that this will grab all 365 prices and calculate the max locally. But I'll survive.
Best and the most effective way is to do subquery .. do something like this ...
current_user.books.where(id: current_user.books.order('created_at DESC').limit(365)).maximum(:price)

subtraction in query (SUM)

I have this query that have this output (the correct):
15
44
Query:
SELECT T.numContribuinte,
T.numero,
SUM(C.valor - T.valorTotalChamadas) AS saldo
FROM telemovel T
JOIN CARREGAMENTO C ON C.numero = T.numero
GROUP BY T.numContribuinte, T.numero
HAVING saldo > 0
ORDER BY T.numero DESC
If I remove the word sum the output will be:
15
15
My question is
Why the absence of the sum produce this difference in the output?
The reason for the difference is that by design, MySQL allows columns in the SELECT to not be stated in the GROUP BY or aggregate functions (MAX, MIN, COUNT, etc). The caveat to this functionality is the values returned are arbitrary -- they can't be guaranteed to be consistent every time.
The support is in line with what's dictated by ANSI, but few (SQLite only to my knowledge) support this behavior. Others require the column to either be mentioned in the GROUP BY or enclosed in an aggregate function.
When you GROUP BY some columns, you ask MySQL to take all rows with identical values in those columns, and replace those rows with only one row in the result set. MySQL needs to know from which of those many rows you want each column's value to be in the one row returned. You must use an aggregate function to describe that, like MIN to select the smallest value, MAX to select the largest value, or SUM to select the sum of all the values being replaced.
If you fail to specify an aggregate function, MySQL will take the value from any row it wants. Which row it takes the value from may be different when you run the same query more than once -- the behavior is not defined.