Rails ActiveRecord "maximum(:column)" ignores order - mysql

I am trying to retrieve the maximum value of a column using ActiveRecord, but after I order and limit the values.
My query is:
max_value = current_user.books.order('created_at DESC').limit(365).maximum(:price)
Yet the resulting query is:
(243.0ms) SELECT MAX(`books`.`price`) AS max_id FROM `books` WHERE `books`.`user_id` = 2 LIMIT 365
The order is ignored completely and as a result the maximum value comes from the first 365 records instead of the last 365 records.

There's a curious line in the active record code (active_record/relation/calculations.rb) which removes the ordering. I say curious because it refers specifically to postgres:
# Postgresql doesn't like ORDER BY when there are no GROUP BY
relation = reorder(nil)
You should be able to use pluck to achieve what you want. It can select a single attribute which can be a reference to an aggregate function:
q = current_user.books.order('created_at DESC').limit(365)
max_value = q.pluck("max(price)").first
pluck will return an array of values so you need the first to get the first one (and only one in this case). If there are no results then it will return nil.

According to the rails guides maximum returns the maximum value of your table for this field so I suppose Active Records tries to optimize your query and ends up messing up with the order of executing your chained methods.
Could you try: First query the 365 rows you want, and then get the maximum?
max_value = (current_user.books.order('created_at DESC').limit(365)).maximum(:price)

I have found the solution thanks to #RubyOnRails on freenode:
max_value = current_user.books.order('created_at DESC').limit(365).pluck(:price).max
Of course the drawback is that this will grab all 365 prices and calculate the max locally. But I'll survive.

Best and the most effective way is to do subquery .. do something like this ...
current_user.books.where(id: current_user.books.order('created_at DESC').limit(365)).maximum(:price)

Related

MySQL Select Results Excluding Outliers Using AVG and STD Conditions

I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:
SELECT t.Result FROM
(SELECT Result, AVG(Result) avgr, STD(Result) stdr
FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)
I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:
(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75)
However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?
Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).
You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:
SELECT t.Result FROM
myTable AS t
CROSS JOIN (
SELECT AVG(Result) avgr, STD(Result) stdr
FROM myTable
WHERE myField = myCondition
) AS stats
WHERE
t.myField = myCondition
AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75
Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.
You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.
I created a UDF that doesn't calculate exactly the way you asked (it discards a percent of the results from the top and bottom, instead of using std), but it might be useful for you
(or someone else) anyway, matching the Excel function referenced here https://support.office.com/en-us/article/trimmean-function-d90c9878-a119-4746-88fa-63d988f511d3
https://github.com/StirlingMarketingGroup/mysql-trimmean
Usage
`trimmean` ( `NumberColumn`, double `Percent` [, integer `Decimals` = 4 ] )
`NumberColumn`
The column of values to trim and average.
`Percent`
The fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a data set of 20 points (20 x 0.2): 2 from the top and 2 from the bottom of the set.
`Decimals`
Optionally, the number of decimal places to output. Default is 4.

sorting a query result from DB by custom function

I want to return a sorted list from DB. The function that I want to use might look like
(field1_value * w1 + field2_value * w2) / ( 1 + currentTime-createTime(field3_value))
It is for sort by popularity feature of my application.
I wonder if how other people do this kind of sorting in DB(say MySQL)
I'm going to implement this in django eventually, but any comment on general direction/strategy to achieve this is most welcomed.
Do I define a function and calculate score for rows for every
requests?
Do I set aside a field for this score and calculate score
in a regular interval?
Or using a time as a variable of the sorting
function looks bad?
How do other sites implement 'sort by popularity'?
I put the time variable because I wanted newer posts get more attention.
Do I define a function and calculate score for rows for every requests?
You could do, but it's not necessary: you can simply provide that expression to your ORDER BY clause (the 1 + currentTime part of the denominator doesn't affect the order of the results, so I have removed it):
ORDER BY (field1 * w1 + field2 * w2) / UNIX_TIMESTAMP(field3) DESC
Alternatively, if your query is selecting such a rating you can merely ORDER BY the aliased column name:
ORDER BY rating
Do I set aside a field for this score and calculate score in a regular interval?
I don't know why you would need to calculate at a regular interval (as mentioned above, the constant part of the denominator has no effect on the order of results)—but if you were to store the result of the above expression in its own field, then performing the ORDER BY operation would be very much faster (especially if that new field were suitably indexed).

Two similar MySQL queries give different results

I have a database that holds readings for devices. I am trying to write a query that can select the latest reading from a device. I have two queries that are seemingly the same and that I'd expect to give the same results; however they do not. The queries are as follows:
First query:
select max(datetime), reading
from READINGS
where device_id = '1234567890'
Second query:
select datetime, reading
from READINGS
where device_id = '1234567890' and datetime = (select max(datetime)
from READINGS
where device_id = '1234567890')
The they both give different results for the reading attribute. The second one is the one that gives the right result but why does the first give something different?
This is MySQL behaviour at work. When you use grouping the columns you select must either appear in the group by or be an aggregate function eg min(), max(). Mixing aggregates and normal columns is not allowed in most other database flavours.
The first query will just return the first rating in each group (first in the sense of where it appears on the file system), which is most likely wrong.
The second query correlates rating with maximum time stamp leading to the correct result.
It is because you are not using a GROUP BY reading clause, which you should be using in both queries.
This is normal on MySQL. See the documentation on this:
If you use a group function in a statement containing no GROUP BY clause, it is equivalent to grouping on all rows.
Also, read http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html
You can use the Explain and Explan extended commands to know more about your queries.

Whats wrong with this MYSQL query

I have the following SQL query , it seems to run ok , but i am concerned as my site grows it may not perform as expected ,I would like some feeback as to how effective and efficient this query really is:
select * from articles where category_id=XX AND city_id=XXX GROUP BY user_id ORDER BY created_date DESC LIMIT 10;
Basically what i am trying to achieve - is to get the newest articles by created_date limited to 10 , articles must only be selected if the following criteria are met :
City ID must equal the given value
Category ID must equal the given value
Only one article per user must be returned
Articles must be sorted by date and only the top 10 latest articles must be returned
You've got a GROUP BY clause which only contains one column, but you are pulling all the columns there are without aggregating them. Do you realise that the values returned for the columns not specified in GROUP BY and not aggregated are not guaranteed?
You are also referencing such a column in the ORDER BY clause. Since the values of that column aren't guaranteed, you have no guarantee what rows are going to be returned with subsequent invocations of this script even in the absence of changes to the underlying table.
So, I would at least change the ORDER BY clause to something like this:
ORDER BY MAX(created_date)
or this:
ORDER BY MIN(created_date)
some potential improvements (for best query performance):
make sure you have an index on all columns you querynote: check if you really need an index on all columns because this has a negative performance when the BD has to build the index. -> for more details take a look here: http://dev.mysql.com/doc/refman/5.1/en/optimization-indexes.html
SELECT * would select all columns of the table. SELECT only the ones you really require...

SQL combine COUNT and AVG query with SELECT

I need to get the average rating and the total number of ratings for a particular user and then select all single ratings (rating_value, rating_text, creator) as well:
$rating_query = mysql_query("SELECT COUNT(1) as rating_count
,AVG(rating_value), rating_value, rating_text, creator
FROM user_rating WHERE rated_user = $user_id");
This query would return the COUNT(1) result and the AVG(rating_value) for every row, but I only need those values once.
Is there any way to do this without making 2 separate queries?
There may be a trick I'm not aware of, but I don't think that's possible to do in a single query. You could try using a GROUP BY clause if that would make sense for you, but I'm guessing it probably doesn't from the column names you're using. Any relation requires a single atomic value at any given row and column, even if that value is null. What you are requesting is that columns 1 and 2 in every row but the first have no value, and again I don't think this is possible.