MySQL - Calculating Percentile Ranks on the fly - mysql

I have a MySQL SELECT query which uses 20 different comparisons within the same table. Here's an example:
SELECT * FROM mytable
WHERE (col1 > (col2 * 0.25))
AND (col5 < col10) .......
I'm trying to calculate percentile ranks based on the order of a column called SCORE within the SELECT results returned. I've tried using incremental row numbers and COUNT(*) to get the stock's rank and total number of results returned but not sure how to assign the same rank where some of the results have the same SCORE.
Here's the formula that I'm trying to calculate:
((COUNT(lower scores) + (COUNT(same/tied scores) / 2)) * 100) / COUNT(total results)
How do I find the number of lower scores, same/tied scores and total scores within the same result row for calculating percentiles on the fly?
I'm trying to avoid using stored procedures because I want to my application's admins to tailor the SELECT statement within my applications admin area as needed.

Using Shlomi's code above, here's the code that I came up with to calculate percentile ranks (in case anyone wants to calculate these in the future):
SELECT
c.id, c.score, ROUND(((#rank - rank) / #rank) * 100, 2) AS percentile_rank
FROM
(SELECT
*,
#prev:=#curr,
#curr:=a.score,
#rank:=IF(#prev = #curr, #rank, #rank + 1) AS rank
FROM
(SELECT id, score FROM mytable) AS a,
(SELECT #curr:= null, #prev:= null, #rank:= 0) AS b
ORDER BY score DESC) AS c;

Here's a post (of mine) which explains ranking during SELECT: SQL: Rank without Self Join.
It uses user defined variables which are accessed and assigned even as the rows are being iterated.
Using the same logic, it could be extended to include numbers of total scores, distinct scores etc. As a preview, here's a typical query:
SELECT
score_id, student_name, score,
#prev := #curr,
#curr := score,
#rank := IF(#prev = #curr, #rank, #rank+1) AS rank
FROM
score,
(SELECT #curr := null, #prev := null, #rank := 0) sel1
ORDER BY score DESC
;

The responses from Shlomi and Zishan (which uses Shlomi's code) definitely do not give accurate results, as I discovered by examining the results on a large table of mine. As answered elsewhere, it is apparently impossible to calculate percentile ranks in a single MySQL query:
SQL rank percentile
The Shlomi Noach approach using user-defined variables does - at first - look like it's working fine for the top couple percent of rankings, but it quickly degenerates for the lower-ranking rows in your table. Look at your data results for yourself, as I did.
See this blog post by Roland Bouman about why Shlomi's approach using user-defined variables within a single SQL statement doesn't work, with a proposed better solution:
http://rpbouman.blogspot.com/2009/09/mysql-another-ranking-trick.html
So then I adapted Bouman's code for this purpose and here's my solution, which necessarily combines PHP and MySQL:
Step 1) Calculate and store the absolute rank for each row by submitting the following two queries:
SET ##group_concat_max_len := ##max_allowed_packet;
UPDATE mytable INNER JOIN (SELECT ID, FIND_IN_SET(
score,
(SELECT GROUP_CONCAT(
DISTINCT score
ORDER BY score DESC
)
FROM mytable)
) AS rank
FROM mytable) AS a
ON mytable.ID=a.ID
SET mytable.rank = rank;
Step 2: Fetch the total number of rows (and store the result in a PHP variable $total)
SELECT COUNT(ID) FROM mytable
Step 3: Use a PHP loop to iterate through the table to use the absolute rank for each row to calculate the row's percentile rank:
3a) Loop through:
SELECT ID, rank FROM mytable
while storing those row values as $ID and $rank in PHP
3b) For each row run:
$sql = 'UPDATE mytable INNER JOIN (
SELECT (100*COUNT(ID)/'.$total.') percentile
FROM mytable
WHERE rank >= '.$rank.'
) a
ON mytable.ID = a.ID
WHERE mytable.ID='.$ID.'
SET mytable.percentile = a.percentile';
Probably not the most efficient process, but definitely accurate, and since in my case the 'score' value is not updated very often, so I run the above script as a cron batch operation to keep the percentile ranks up-to-date.

Related

What is the other way to get result without using MYSQL ROW_NUMBER Function with PARTITION and multiple group by to reset row count

select
Id,request_id,key_skill_id,
ROW_NUMBER() OVER (PARTITION BY skill_id
ORDER BY request_id,skill_id) sequence
from report
where id= x
GROUP by request_id, skill_id
order by request_id,skill_id ;
I tried to write something like the following, but the result is not the same:
select
id,
request_id,
#skill_id :=skill_id as skill_id,
#row_number :=
CASE
WHEN #skill_id = skill_id THEN #row_number+1
ELSE 1
END AS row_number
from report,
(SELECT #row_number := 0, #skill_id := '') as t
where id =x
GROUP by request_id, skill_id
order by request_id, skill_id;
The original window function strikes me as a bit odd but I confess that I don't use these functions too frequently being confined to MySQL 5.7 myself. The PARTITION BY clause specifies the key_skill_id column so re-numbering 1, 2, 3, etc. will be done on those rows with identical key_skill_id column values. But then there is a final ORDER clause at the very end of the SQL that re-sorts the results so that rows with the same key_skill_id will not in general be together (unless, for example, there was only a single value of feedback_request_id being selected).
To do the initial numbering the rows, however, the table must first be sorted by key_skill_id and then feedback_request_id. The purpose of the GROUP BY clause in the original SQL is to function as an equivalent of a SELECT DISTINCT query, which can't be used because the added row number column guarantees that each row is distinct. The reason why the GROUP BY works is that it is applied before the ROW_NUMBER window function is performed whereas the SELECT DISTINCT implied filtering would be applied after the ROW_NUMBER function is performed.
Given you have provided no table definitions, data, expected output, etc. I was unable to test the following. This is my best guess:
select
x.*,
#row_number :=
CASE
WHEN #key_skill_id = x.key_skill_id THEN #row_number+1
ELSE 1
END AS sequence,
#key_skill_id = x.key_skill_id
from (
select distinct /* to emulate group by */
candidateId,
feedback_id,
key_skill_id
from newFeedbackReport
where candidate_id = 2501
order by key_skill_id, feedback_request /* this is not a mistake */
) x,
(SELECT #row_number := 0, #key_skill_id := '') as t
order by feedback_request_id, key_skill_id;

using FORCE INDEX to ensure the table is ordered with GROUP BY and ORDER BY before calculating user variables

I am trying to sum the nth highest rows.
I am calculating a cycling league table where 1st fastest rider at an event gets 50 points 2nd fastest 49 points and so on .... there are 10 events over the league but only a rider's 8 best results are used (this means a rider can miss up to 2 events without a catastrophic decent down the leader board)
first i need a table where each rider's results from all events in the league are grouped together and listed in order of highest points, and then a sequential number calculated so i can sum the 8 or less best results.
so i used this table select:
set #r := 0, #rn := 0 ;
SELECT
t.*,
#rn := if(#r = t.id_rider, #rn + 1, 1) as seqnum,
#r := t.id_rider as dummy_rider
from results as t
ORDER BY t.id_rider, t.points desc
where the table results is a view as below:
SELECT
a.id_rider,
b.id_event,
b.race_no,
b.id_race,
b.id_race_type,
b.`position`,
c.id_league,
(51 - b.`position`) AS points
FROM
wp_dtk_start_sheet a
JOIN wp_dtk_position_results b ON a.id_event = b.id_event AND a.race_no = b.race_no
JOIN wp_dtk_league_races c ON b.id_race = c.id_race
WHERE
c.id_league = 1
AND b.`position` IS NOT NULL
this does not work as the seqnum is 1 for all results. if i export the view table into excel and crate a test table with the same columns and data it works ok. i believe what is going wrong is that the table is not being sorted by ORDER BY t.id_rider, t.points desc before running through the variables
this reference: https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ states " This technique is pretty much non-deterministic, because it relies on things that you and I don’t get to control directly, such as which indexes MySQL decides to use for grouping"
this reference suggest trying to force the index to use id_rider so i tried:
set #r := 0, #rn := 0 ;
SELECT
a.id_rider,
c.id_league,
(51- b.`position`) as points,
#rn := if(#r = a.id_rider, #rn + 1, 1) as seqnum,
#r := a.id_rider as 'set r'
from wp_dtk_start_sheet as a force index (id_rider)
join wp_dtk_position_results as b on a.id_event = b.id_event and a.race_no = b.race_no
join wp_dtk_league_races as c on b.id_race = c.id_race
where c.id_league = 1 and b.`position` is not null
ORDER BY a.id_rider, points desc
this did not work i got seqnum =1 for all rows as before
my table structure is as below:
table a - wp_dtk_start_sheet
table b - wp_dtk_position_results
table c -wp_dtk_league_races
this stack overlow answer was also very helpfull but also has the same problem with it:
Sum Top 10 Values
can anyone help? perhaps i am going about this all the wrong way?
The solution is much more clear if you use window functions. This allows you to specify the order of rows within each group for purposes of row-numbering.
SELECT t.*
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id_rider ORDER BY points DESC) AS seqnum
FROM results
) AS t
WHERE t.seqnum <= 8;
Support for window functions in MySQL was introduced in version 8.0, so you might have to upgrade. But it's been part of the MySQL product since 2018.
Bill's answer works brilliantly but I have also combined it into one statement as well, this is the combined select command:
Select
t.id_rider,
sum(points) as total
from
(SELECT
a.id_rider,
c.id_league,
(51- b.`position`) as points,
ROW_NUMBER() OVER (PARTITION BY id_rider ORDER BY points DESC) AS seqnum
from wp_dtk_start_sheet as a
join wp_dtk_position_results as b on a.id_event = b.id_event and a.race_no = b.race_no
join wp_dtk_league_races as c on b.id_race = c.id_race
where c.id_league = 1 and b.`position` is not null ) as t
where seqnum <= 8
group by id_rider
order by total desc

Calculating Running totals across rows and grouping by ID

I want to compute running row totals across a table, however the totals must start over for new IDs
https://imgur.com/a/YgQmYQA
My code:
set #csum := 0;
select ID, name, marks, (#rt := #rt + marks) as Running_total from students order by ID;
The output returns the totals however doesn't break or start over for new IDs
Bro try this... It is tested on MSSQL..
select ID, name, marks,
marks + isnull(SUM(marks) OVER ( PARTITION BY ID ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) ,0) as Running_total
from students
You need to partition your running total by ID. A running total always needs an order of some column, by ordering on which you want to calculate the running total. Assuming running total under each ID is based on ORDER of marks,
Approach 1: It can be written in a simple query if your DBMS supports Analytical Functions
SELECT ID
,name
,marks
,Running_total = SUM(marks) OVER (PARTITION BY ID ORDER BY marks ASC)
FROM students
Approach 2: You can make use of OUTER APPLY if your database version / DBMS itself does not support Analytical Functions
SELECT S.ID
,S.name
,S.marks
,Running_total = OA.runningtotalmarks
FROM students S
OUTER APPLY (
SELECT runningtotalmarks = SUM(SI.marks)
FROM students SI
WHERE SI.ID = S.ID
AND SI.marks <= S.marks
) OA;
Note:- The above queries have been tested MS SQL Server.

Add a Column in Query

Here is my query:
SET #rank=0;
SELECT #rank := #rank +1 AS rank_id, name, SUM(points) AS points
FROM battle_points
WHERE category = 'test'
AND user_id !=0
GROUP BY user_id
ORDER BY points DESC;
I'd like to add a column rank based on the total points. With this query, the points are fine but the rank_id virtual column doesn't match up.
For example, the top user with the most points has rank 26, yet the rank_id column has a value of 24.
How do I matchup the rank_id column with the points column?
Note: while I am fully versed in PHP, I need a solution for MySQL only.
You are on the right path, but you need to put the main query in a subquery so that the ordering occurs before the rank calculation, like so:
SET #rank=0;
SELECT #rank := #rank +1 AS rank_id, mainQ.*
FROM (
SELECT name, SUM(points) AS points
FROM battle_points
WHERE category = 'test'
AND user_id !=0
GROUP BY user_id
ORDER BY points DESC
) AS mainQ
;
Edit: Qualified * to mainQ.*.

(How) can I number query result groups by row/result order in a single query?

I have a query that currently returns data with the following attributes:
A number A which is guaranteed to be unique in the result (not in the source table); the result is ordered by A, but the values of A in the result are not necessarily continuous.
A key B which is repeated for multiple rows, tagging them as part of the same group. It comes from the same table as A.
Example:
+--+-+-+
|id|A|B|
+--+-+-+
| 5|1|2|
|15|3|2|
|12|4|5|
|66|6|5|
| 2|7|2|
+--+-+-+
I've seen answers here which explain how to return the row number in the result. What I do need, however, is to obtain a (preferrably 1-based) order number while keeping a distinct count for each B. In the following table, C is the desired result:
+--+-+-+-+
|id|A|B|C|
+--+-+-+-+
| 5|1|2|1|
|15|3|2|2|
|12|4|5|1|
|66|6|5|2|
| 2|7|2|3|
+--+-+-+-+
This goes a little beyond my current SQL skill, so I'll be thankful for any pointers. Including pointers to existing answers!
EDIT: Both answers below work equally well in terms of results (with a dummy wrapping query used for sorting). Thank you all for the help. Which would be the most efficient query? Consider that in my specific use case, the amount of rows returned from the original query is never very large (let's say up to 50 rows, and even that is a stretch of the imagination). Also, the original query has joins used for fetching data from other relations, although they are not relevant for sorting or filtering. Finally, it is possible for all results to have the same B, or for every one of them to have a distinct B - it can go either way or anywhere inbetween.
What you basically want is the RANK() function. However, since it's not available in MySQL, you can simulate it with:
SELECT *
FROM (
SELECT a, b, (CASE b
WHEN #partition THEN #rank := #rank + 1
ELSE #rank := 1 AND #partition := b END) AS c
FROM tbl, (SELECT #rank := 0, #partition := '') tmp
ORDER BY b, a
) tmp
ORDER BY a
DEMO (SQL Fiddle).
select p.*, #i := if(#lastB != p.B, 1, #i + 1)
,#lastB := p.B as B
from table_name p,
(select #i := 0) vt1,
(select #lastB := null) vt2
order by B;
Try this code. (Not tested)
EDIT
demo with sqlfiddle http://sqlfiddle.com/#!2/412df/13/2
This is not going to be very efficient as your query has to be calculated twice and then a group by as well:
SELECT
q.* ,
COUNT(*) AS c --- the "Rank"
FROM
yourQuery AS q
JOIN
yourQuery AS qq
ON qq.B = q.B
AND qq.A <= q.A
GROUP BY
q.A ;