sorting a query result from DB by custom function - mysql

I want to return a sorted list from DB. The function that I want to use might look like
(field1_value * w1 + field2_value * w2) / ( 1 + currentTime-createTime(field3_value))
It is for sort by popularity feature of my application.
I wonder if how other people do this kind of sorting in DB(say MySQL)
I'm going to implement this in django eventually, but any comment on general direction/strategy to achieve this is most welcomed.
Do I define a function and calculate score for rows for every
requests?
Do I set aside a field for this score and calculate score
in a regular interval?
Or using a time as a variable of the sorting
function looks bad?
How do other sites implement 'sort by popularity'?
I put the time variable because I wanted newer posts get more attention.

Do I define a function and calculate score for rows for every requests?
You could do, but it's not necessary: you can simply provide that expression to your ORDER BY clause (the 1 + currentTime part of the denominator doesn't affect the order of the results, so I have removed it):
ORDER BY (field1 * w1 + field2 * w2) / UNIX_TIMESTAMP(field3) DESC
Alternatively, if your query is selecting such a rating you can merely ORDER BY the aliased column name:
ORDER BY rating
Do I set aside a field for this score and calculate score in a regular interval?
I don't know why you would need to calculate at a regular interval (as mentioned above, the constant part of the denominator has no effect on the order of results)—but if you were to store the result of the above expression in its own field, then performing the ORDER BY operation would be very much faster (especially if that new field were suitably indexed).

Related

Usage of weighting in pure SQL

My front-end (SourcePawn) currently does the following:
float fPoints = 0.0;
float fWeight = 1.0;
while(results.FetchRow())
{
fPoints += (results.FetchFloat(0) * fWeight);
fWeight *= 0.95;
}
In case you don't understand this code, it goes through the resultset of this query:
SELECT points FROM table WHERE auth = 'authentication_id' AND points > 0.0 ORDER BY points DESC;
The resultset is floating numbers, sorted by points from high to low.
My front-end takes the 100% of the first row, then 95% of the second one, and it drops by 5% every time. It all adds up to fPoints that is my 'sum' variable.
What I'm looking for, is a solution of how to replicate this code in pure SQL and receive the sum which is called fPoints in my front-end, so I will be able to run it for a table that has over 10,000 rows, in one query instead of 10,000.
I'm very lost. I don't know where to start and guidance of any kind would be very nice.
You can do this using variables:
SELECT points,
(points * (#f := 0.95 * #f) / 0.95) as fPoints
FROM table t CROSS JOIN
(SELECT #f := 1.0) params
WHERE auth = 'authentication_id' AND points > 0.0
ORDER BY points DESC;
A note about the calculation. The value of #f starts at 1. Because we are dealing with variables, the assignment and the use of the variable need to be in the same expression -- MySQL does not guarantee the order of evaluation of expressions.
So, the 0.95 * #f reduces the value by 5%. However, that is for the next iteration. The / 0.95 undoes that to get the right value for this iteration.
While I'm glad the answer Gordon Linoff provides works for you, you should understand it's quite specific. ORDER BY, per the SQL standard, has no effect on how a query is processed, and SQL does not recognize "iteration" in a SELECT statement. So the idea of "reducing a variable on each iteration", where the iteration order is governed by ORDER BY has no basis in standard SQL. You might want to check if it's guaranteed by MySQL, just for your own edification.
To achieve the effect you want in a standard way, proceed as follows.
Create a table Percentiles( Percentile int not null, Factor float not null )
Populate that table with your factors (20 rows).
Write a view or CTE that ranks your points in descending order. Let us call the rank column rank.
Then join your view to Percentiles:
SELECT auth, sum(points * factor) as weight
FROM "your view" as t join percentiles as p
ON r.rank = percentile
WHERE points > 0.0
GROUP BY auth
That query is simple, and its intent obvious. It might even be faster. Most important, it will definitely work, and doesn't depend on any idiosyncrasies of your current DBMS.

MySQL Select Results Excluding Outliers Using AVG and STD Conditions

I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:
SELECT t.Result FROM
(SELECT Result, AVG(Result) avgr, STD(Result) stdr
FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)
I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:
(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75)
However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?
Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).
You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:
SELECT t.Result FROM
myTable AS t
CROSS JOIN (
SELECT AVG(Result) avgr, STD(Result) stdr
FROM myTable
WHERE myField = myCondition
) AS stats
WHERE
t.myField = myCondition
AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75
Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.
You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.
I created a UDF that doesn't calculate exactly the way you asked (it discards a percent of the results from the top and bottom, instead of using std), but it might be useful for you
(or someone else) anyway, matching the Excel function referenced here https://support.office.com/en-us/article/trimmean-function-d90c9878-a119-4746-88fa-63d988f511d3
https://github.com/StirlingMarketingGroup/mysql-trimmean
Usage
`trimmean` ( `NumberColumn`, double `Percent` [, integer `Decimals` = 4 ] )
`NumberColumn`
The column of values to trim and average.
`Percent`
The fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a data set of 20 points (20 x 0.2): 2 from the top and 2 from the bottom of the set.
`Decimals`
Optionally, the number of decimal places to output. Default is 4.

Rails ActiveRecord "maximum(:column)" ignores order

I am trying to retrieve the maximum value of a column using ActiveRecord, but after I order and limit the values.
My query is:
max_value = current_user.books.order('created_at DESC').limit(365).maximum(:price)
Yet the resulting query is:
(243.0ms) SELECT MAX(`books`.`price`) AS max_id FROM `books` WHERE `books`.`user_id` = 2 LIMIT 365
The order is ignored completely and as a result the maximum value comes from the first 365 records instead of the last 365 records.
There's a curious line in the active record code (active_record/relation/calculations.rb) which removes the ordering. I say curious because it refers specifically to postgres:
# Postgresql doesn't like ORDER BY when there are no GROUP BY
relation = reorder(nil)
You should be able to use pluck to achieve what you want. It can select a single attribute which can be a reference to an aggregate function:
q = current_user.books.order('created_at DESC').limit(365)
max_value = q.pluck("max(price)").first
pluck will return an array of values so you need the first to get the first one (and only one in this case). If there are no results then it will return nil.
According to the rails guides maximum returns the maximum value of your table for this field so I suppose Active Records tries to optimize your query and ends up messing up with the order of executing your chained methods.
Could you try: First query the 365 rows you want, and then get the maximum?
max_value = (current_user.books.order('created_at DESC').limit(365)).maximum(:price)
I have found the solution thanks to #RubyOnRails on freenode:
max_value = current_user.books.order('created_at DESC').limit(365).pluck(:price).max
Of course the drawback is that this will grab all 365 prices and calculate the max locally. But I'll survive.
Best and the most effective way is to do subquery .. do something like this ...
current_user.books.where(id: current_user.books.order('created_at DESC').limit(365)).maximum(:price)

Please explain this mysql query.

SELECT * FROM dogs order by rand(dayofyear(CURRENT_DATE)) LIMIT 1
It seems to me that it orders a database by a random number, and this number changes every day. This is a guess, as it'll take me a day to find out if this is true!
How can I change this query to order a database by a new random number every minute rather than every day? I tried this:
SELECT * FROM dogs order by rand(minuteofhour(CURRENT_DATE)) LIMIT 1
but it didn't work :(
Thanks for your time!
A random number generator (RNG) usually needs a 'seed value', a value that is used to generate random numbers. If the seed value is always the same, the sequence of random numbers is always the same. This explains why it changes every day.
The easiest way to solve your problem (change it to every minute) is to find a seed value that changes every minute. A good one would be ROUND(UNIX_TIMESTAMP()/60).
SELECT * FROM dogs order by rand(ROUND(UNIX_TIMESTAMP()/60)) LIMIT 1
I am not good at mysql. but are you sure is there a function minuteofhour in mysql?
The idea of query is to pick a random record from database.
you can do this by:
SELECT * FROM dogs order by rand(20) LIMIT 1
it will order by column "a random number from 1-20"
Use combo of MySQL's funcs MINUTE() and NOW(). NOW will return current date, and MINUTE extracts minute value from it.

MySQL Data - Best way to implement paging?

My iPhone application connects to my PHP web service to retrieve data from a MySQL database, a request can return up to 500 results.
What is the best way to implement paging and retrieve 20 items at a time?
Let's say I receive the first 20 entries from my database, how can I now request the next 20 entries?
From the MySQL documentation:
The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement. LIMIT takes one or two numeric arguments, which must both be nonnegative integer constants (except when using prepared statements).
With two arguments, the first argument specifies the offset of the first row to return, and the second specifies the maximum number of rows to return. The offset of the initial row is 0 (not 1):
SELECT * FROM tbl LIMIT 5,10; # Retrieve rows 6-15
To retrieve all rows from a certain offset up to the end of the result set, you can use some large number for the second parameter. This statement retrieves all rows from the 96th row to the last:
SELECT * FROM tbl LIMIT 95,18446744073709551615;
With one argument, the value specifies the number of rows to return from the beginning of the result set:
SELECT * FROM tbl LIMIT 5; # Retrieve first 5 rows
In other words, LIMIT row_count is equivalent to LIMIT 0, row_count.
For 500 records efficiency is probably not an issue, but if you have millions of records then it can be advantageous to use a WHERE clause to select the next page:
SELECT *
FROM yourtable
WHERE id > 234374
ORDER BY id
LIMIT 20
The "234374" here is the id of the last record from the prevous page you viewed.
This will enable an index on id to be used to find the first record. If you use LIMIT offset, 20 you could find that it gets slower and slower as you page towards the end. As I said, it probably won't matter if you have only 200 records, but it can make a difference with larger result sets.
Another advantage of this approach is that if the data changes between the calls you won't miss records or get a repeated record. This is because adding or removing a row means that the offset of all the rows after it changes. In your case it's probably not important - I guess your pool of adverts doesn't change too often and anyway no-one would notice if they get the same ad twice in a row - but if you're looking for the "best way" then this is another thing to keep in mind when choosing which approach to use.
If you do wish to use LIMIT with an offset (and this is necessary if a user navigates directly to page 10000 instead of paging through pages one by one) then you could read this article about late row lookups to improve performance of LIMIT with a large offset.
Define OFFSET for the query. For example
page 1 - (records 01-10): offset = 0, limit=10;
page 2 - (records 11-20) offset = 10, limit =10;
and use the following query :
SELECT column FROM table LIMIT {someLimit} OFFSET {someOffset};
example for page 2:
SELECT column FROM table
LIMIT 10 OFFSET 10;
There's literature about it:
Optimized Pagination using MySQL, making the difference between counting the total amount of rows, and pagination.
Efficient Pagination Using MySQL, by Yahoo Inc. in the Percona Performance Conference 2009. The Percona MySQL team provides it also as a Youtube video: Efficient Pagination Using MySQL (video),
The main problem happens with the usage of large OFFSETs. They avoid using OFFSET with a variety of techniques, ranging from id range selections in the WHERE clause, to some kind of caching or pre-computing pages.
There are suggested solutions at Use the INDEX, Luke:
"Paging Through Results".
"Pagination done the right way".
This tutorial shows a great way to do pagination.
Efficient Pagination Using MySQL
In short, avoid to use OFFSET or large LIMIT
you can also do
SELECT SQL_CALC_FOUND_ROWS * FROM tbl limit 0, 20
The row count of the select statement (without the limit) is captured in the same select statement so that you don't need to query the table size again.
You get the row count using SELECT FOUND_ROWS();
Query 1: SELECT * FROM yourtable WHERE id > 0 ORDER BY id LIMIT 500
Query 2: SELECT * FROM tbl LIMIT 0,500;
Query 1 run faster with small or medium records, if number of records equal 5,000 or higher, the result are similar.
Result for 500 records:
Query1 take 9.9999904632568 milliseconds
Query2 take 19.999980926514 milliseconds
Result for 8,000 records:
Query1 take 129.99987602234 milliseconds
Query2 take 160.00008583069 milliseconds
Here's how I'm solving this problem using node.js and a MySQL database.
First, lets declare our variables!
const
Key = payload.Key,
NumberToShowPerPage = payload.NumberToShowPerPage,
Offset = payload.PageNumber * NumberToShowPerPage;
NumberToShowPerPage is obvious, but the offset is the page number.
Now the SQL query...
pool.query("SELECT * FROM TableName WHERE Key = ? ORDER BY CreatedDate DESC LIMIT ? OFFSET ?", [Key, NumberToShowPerPage, Offset], (err, rows, fields) => {}));
I'll break this down a bit.
Pool, is a pool of MySQL connections. It comes from mysql node package module. You can create a connection pool using mysql.createPool.
The ?s are replaced by the variables in the array [PageKey, NumberToShow, Offset] in sequential order. This is done to prevent SQL injection.
See at the end were the () => {} is? That's an arrow function. Whatever you want to do with the data, put that logic between the braces.
Key = ? is something I'm using to select a certain foreign key. You would likely remove that if you don't use foreign key constraints.
Hope this helps.
If you are wanting to do this in a stored procedure you can try this
SELECT * FROM tbl limit 0, 20.
Unfortunately using formulas doesn't work so you can you execute a prepared statement or just give the begin and end values to the procedure.