Usage of weighting in pure SQL - mysql

My front-end (SourcePawn) currently does the following:
float fPoints = 0.0;
float fWeight = 1.0;
while(results.FetchRow())
{
fPoints += (results.FetchFloat(0) * fWeight);
fWeight *= 0.95;
}
In case you don't understand this code, it goes through the resultset of this query:
SELECT points FROM table WHERE auth = 'authentication_id' AND points > 0.0 ORDER BY points DESC;
The resultset is floating numbers, sorted by points from high to low.
My front-end takes the 100% of the first row, then 95% of the second one, and it drops by 5% every time. It all adds up to fPoints that is my 'sum' variable.
What I'm looking for, is a solution of how to replicate this code in pure SQL and receive the sum which is called fPoints in my front-end, so I will be able to run it for a table that has over 10,000 rows, in one query instead of 10,000.
I'm very lost. I don't know where to start and guidance of any kind would be very nice.

You can do this using variables:
SELECT points,
(points * (#f := 0.95 * #f) / 0.95) as fPoints
FROM table t CROSS JOIN
(SELECT #f := 1.0) params
WHERE auth = 'authentication_id' AND points > 0.0
ORDER BY points DESC;
A note about the calculation. The value of #f starts at 1. Because we are dealing with variables, the assignment and the use of the variable need to be in the same expression -- MySQL does not guarantee the order of evaluation of expressions.
So, the 0.95 * #f reduces the value by 5%. However, that is for the next iteration. The / 0.95 undoes that to get the right value for this iteration.

While I'm glad the answer Gordon Linoff provides works for you, you should understand it's quite specific. ORDER BY, per the SQL standard, has no effect on how a query is processed, and SQL does not recognize "iteration" in a SELECT statement. So the idea of "reducing a variable on each iteration", where the iteration order is governed by ORDER BY has no basis in standard SQL. You might want to check if it's guaranteed by MySQL, just for your own edification.
To achieve the effect you want in a standard way, proceed as follows.
Create a table Percentiles( Percentile int not null, Factor float not null )
Populate that table with your factors (20 rows).
Write a view or CTE that ranks your points in descending order. Let us call the rank column rank.
Then join your view to Percentiles:
SELECT auth, sum(points * factor) as weight
FROM "your view" as t join percentiles as p
ON r.rank = percentile
WHERE points > 0.0
GROUP BY auth
That query is simple, and its intent obvious. It might even be faster. Most important, it will definitely work, and doesn't depend on any idiosyncrasies of your current DBMS.

Related

MySQL slow query, joined table by points inside circle

Im trying to get this query working, unfortunately it's pretty slow. So i'm guessing there could be a better query for getting the result I'm looking for.
Select samples.X, samples.Y, samples.id, samples.Provnr, samples.costumer_id, avg(lerhalter.lerhalt) from samples
left outer join lerhalter
on SQRT(POW(samples.X - lerhalter.x , 2) + POW(samples.Y - lerhalter.y, 2)) < 100
where samples.customer_id = 900417
group by samples.provnr
I have the table samples, and i'd like to get all the customers samples, and then join the "lerhalt" table. There could be more than one row of each sample when i do the join, therefore id like to get the average value of column lerhalt.
I think i get the result that I'm after, but the query can take up to 10s, for a customer with only 100 samples. There's customers with 2000 samples.
So i have to get a better query time.
Any suggestions?
A small speed up would be to leave out the SQRT function. SQRT() is expensive in terms of computing time and you can simply adjust the right side of your comparison to 100x100 = 10.000:
Select samples.X, samples.Y, samples.id, samples.Provnr, samples.costumer_id, avg(lerhalter.lerhalt) from samples
left outer join lerhalter
on (POW(samples.X - lerhalter.x , 2) + POW(samples.Y - lerhalter.y, 2)) < 10000
where samples.customer_id = 900417
group by samples.provnr
Also, are you sure you need a LEFT OUTER JOIN? Could an INNER JOIN be used instead?
Next question: Are the X and Y coordinated integer values? If not, can they be converted to integers? Integer claucuations are a lot faster usually than floating point operations.
Finally, you clearly do a euclidean distance measure. Is that really needed? Can another distance measure do a sufficiently good job? Maybe city-block distance is good enough for your needs? This would further speed up things a lot.

MySQL Select Results Excluding Outliers Using AVG and STD Conditions

I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:
SELECT t.Result FROM
(SELECT Result, AVG(Result) avgr, STD(Result) stdr
FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)
I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:
(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75)
However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?
Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).
You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:
SELECT t.Result FROM
myTable AS t
CROSS JOIN (
SELECT AVG(Result) avgr, STD(Result) stdr
FROM myTable
WHERE myField = myCondition
) AS stats
WHERE
t.myField = myCondition
AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75
Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.
You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.
I created a UDF that doesn't calculate exactly the way you asked (it discards a percent of the results from the top and bottom, instead of using std), but it might be useful for you
(or someone else) anyway, matching the Excel function referenced here https://support.office.com/en-us/article/trimmean-function-d90c9878-a119-4746-88fa-63d988f511d3
https://github.com/StirlingMarketingGroup/mysql-trimmean
Usage
`trimmean` ( `NumberColumn`, double `Percent` [, integer `Decimals` = 4 ] )
`NumberColumn`
The column of values to trim and average.
`Percent`
The fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a data set of 20 points (20 x 0.2): 2 from the top and 2 from the bottom of the set.
`Decimals`
Optionally, the number of decimal places to output. Default is 4.

SQL Error (1242): Subquery returns more than 1 row - Not fixable with (IN) or a JOIN (I think?)

All - I'm trying to do a basic (in theory) On-Time completion report. I'd like to to list
assigned_to_id | Percent on-time (as a percent - but this is not important now)
I figure I tell MySQL get count of all tasks and a list of all tasks marked close on a date prior to the due date and give me that number... Seems simple?
I'm a sysadmin - not a SQL Developer so excuse the grossness to follow!
I've got
select issues.assigned_to_id, tb1.percent from (
Select
(select count(*) from issues where issues.due_date >= date(issues.closed_on) group by issues.assigned_to_id)/
(select count(*) from issues group by issues.assigned_to_id) as percent
from issues)
as tb1
group by tb1.percent;
It's been mixed up a bit with me trying to solve the multple rows issues so it may be even worse off when I started - but if I could get a list of users with their percentage that would be great!
I'd love to have use something like a "for each" but i know that doesn't exist.
Thanks!
You've got a division operation, e.g (foo) / (bar) and both the numerator and denominator are subqueries, Since you're expecting to take those subqueries and divide their answers, they MUST return a SINGLE value each, e.g. 1 / 2.
The error message indicates that one (or probably both) is returning a multi-value query result, so in effect you're trying to do something like 1,2,3 / 4,5,6, which is not a valid math operation, and you end up with your error message.
Fix the subqueries so they return only a SINGLE value each.
MySQL has equivalent of cross/outer apply which match for this case
SELECT T.*,Data.Value FROM [Table] T OUTER APPLY
you can try to use that.
What you should probably be doing is using an IF statement:
SELECT
assigned_to_id,
SUM(IF(due_date >= date(closed_on), 1, 0))/SUM(1) AS percent
FROM issues
GROUP BY assigned_to_id
ORDER BY percent DESC
Note here I am grouping by assigned_to_id and ordering by percent. This allows you to calculate the percentage for each assigned_to_id group and order those groups by percent.
if we have to exactly rewrite your query, try this.
select issues.assigned_to_id, ((select count(*) from issues where issues.due_date >= date(issues.closed_on) )/
(select count(*) from issues)) as perc from issues group by issues.assigned_to_id;
I think you want something like this, a list of issue ids and percentage done on time.
select distinct issues.assigned_to_id, done_on_time.c / issue_count.c as percent
from issues
join
(select issues.assigned_to_id, count(*) as c
from issues
where issues.due_date >= date(issues.closed_on)
group by issues.assigned_to_id) as done_on_time
on issues.assigned_to_id = done_on_time.assigned_to_id
(select issues.assigned_to_idm, count(*) as c
from issues
group by issues.assigned_to_id) as issue_count
on issues.assigned_to_id = issue_count.assigned_to_id

sorting a query result from DB by custom function

I want to return a sorted list from DB. The function that I want to use might look like
(field1_value * w1 + field2_value * w2) / ( 1 + currentTime-createTime(field3_value))
It is for sort by popularity feature of my application.
I wonder if how other people do this kind of sorting in DB(say MySQL)
I'm going to implement this in django eventually, but any comment on general direction/strategy to achieve this is most welcomed.
Do I define a function and calculate score for rows for every
requests?
Do I set aside a field for this score and calculate score
in a regular interval?
Or using a time as a variable of the sorting
function looks bad?
How do other sites implement 'sort by popularity'?
I put the time variable because I wanted newer posts get more attention.
Do I define a function and calculate score for rows for every requests?
You could do, but it's not necessary: you can simply provide that expression to your ORDER BY clause (the 1 + currentTime part of the denominator doesn't affect the order of the results, so I have removed it):
ORDER BY (field1 * w1 + field2 * w2) / UNIX_TIMESTAMP(field3) DESC
Alternatively, if your query is selecting such a rating you can merely ORDER BY the aliased column name:
ORDER BY rating
Do I set aside a field for this score and calculate score in a regular interval?
I don't know why you would need to calculate at a regular interval (as mentioned above, the constant part of the denominator has no effect on the order of results)—but if you were to store the result of the above expression in its own field, then performing the ORDER BY operation would be very much faster (especially if that new field were suitably indexed).

"Inverse" Limit?

I'm using MySQL to store financial stuff, and using the data to build, among other things, registers of all the transactions for each account. For performance reasons - and to keep the user from being overwhelmed by a gargantuan table - I paginate the results.
Now, as part of the register, I display a running balance for the account. So if I'm displaying 20 transactions per page, and I'm displaying the second page, I use the data as follows:
Transactions 0 - 19: Ignore them - they're more recent than the page being looked at.
Transactions 20 - 39: Select everything from these - they'll be displayed.
Transactions 40 - ??: Sum the amounts from these so the running balance is accurate.
It's that last one that's annoying me. It's easy to select the first 40 transactions using a LIMIT clause, but is there something comparable for everything but the first 40? Something like "LIMIT -40"?
I know I can do this with a COUNT and a little math, but the actual query is a bit ugly (multiple JOINs and GROUP BYs), so I'd rather issue it as few times as possible. And this seems useful enough to be included in SQL - and I just don't know about it. Does anybody else?
The documentation says:
The LIMIT clause can be used to constrain the number of rows returned
by the SELECT statement. LIMIT takes one or two numeric arguments,
which must both be nonnegative integer constants, with these
exceptions:
Within prepared statements, LIMIT parameters can be specified
using ? placeholder markers.
Within stored programs, LIMIT parameters can be specified using
integer-valued routine parameters or local variables as of MySQL
5.5.6.
With two arguments, the first argument specifies the offset of the
first row to return, and the second specifies the maximum number of
rows to return. The offset of the initial row is 0 (not 1):
SELECT * FROM tbl LIMIT 5,10; # Retrieve rows 6-15
To retrieve all rows from a certain offset up to the end of the result
set, you can use some large number for the second parameter. This
statement retrieves all rows from the 96th row to the last:
SELECT * FROM tbl LIMIT 95,18446744073709551615;
Next time, please use the documentation as your first port of call.
You can hack it this way:
select sel.*
from
(
SELECT #rownum:=#rownum+1 rownum, t.*
FROM (SELECT #rownum:=0) r, YourTableOrYourSubSelect t
) sel
where rownum > 40
It's kinda like having Oracle's rownum in MySQL.