Related to this question.
Actually lest say we want to solve the grouped ranking problem in mysql. We have a table that each row represents an entity, that belongs to a group. We want to assign a rank for each entity based on an attribute separate for each group. Later we could do various manipulations with the rank, like asking for the first 10 entities of each group that satisfy another condition as well, etc.
For example, the entity could be programmers that belong to different "groups" according to their favorite programming language. Then each programmer has a reputation (let's say in forum). We want to add an extra field that would be the programmer's rank based on descending reputation. We want to do this independently for each group.
gid | repu | name |
1 1 john
1 3 anna
2 2 scot
2 1 leni
to become
gid | repu | name | rank
1 3 anna 1
1 1 john 2
2 2 scot 1
2 1 leni 2
Now let's also demand that we do not want to use session variables based solutions. Yes they work pretty well but they clearly violate the mysql demand for not to read and write a session variable on the same statement. (See here)
Now a proposed solution in this post says
-- SOL #1 (SELF-JOIN)
SELECT a.*, count(*) as row_number FROM test a
JOIN test b ON a.gid = b.gid AND a.repu <= b.repu
GROUP BY a.gid, a.repu
Which pretty much does the thing. Some questions I have are, is this legit SQL or does it violate any standard or mysql quirk? Is is guaranteed that it will work on mysql ?
Also another solution that I read here is, which is more of a black magic for me but seems more elegant
-- SOL #2 (SUBQUERY)
SELECT t.* ,
( SELECT COUNT(*) + 1
FROM test
WHERE repu > t.repu AND gid = t.gid
) AS rank
FROM test AS t
ORDER BY gid ASC, rank ASC
This uses a sub query that references an outer table, and does the trick also. Could anybody explain how this one works ?
Also, the same questions here as for solution #1.
Plus any comments on evaluating the performance/compatibility of two proposed solutions.
EDIT: Aditional methods, for the reference
From this post one variation of the session variable method. WARNING: This is that I want to avoid. Notice that in a single statement that #rand and #partition session variables are read (in the case after WHEN and THEN) and written (in the CASE after THEN AND ELSE and also in the next subquery that initializes the variables).
-- SOL #3 (SESSION VARIABLES / ANTIPATTERN)
SELECT t.*, ( CASE gid
WHEN #partition THEN #rank := #rank + 1
ELSE #rank := 1 AND #partition := gid ) AS rank
FROM test t,
(SELECT #rank := 0, #partition := '') tmp
ORDER BY gid ASC, repu DESC
Also here is the set based solution, rather complicated, posted by a fellow bellow.
-- SOL #4 (SET BASED)
SELECT x.*, FIND_IN_SET(CONCAT(x.gid,':',x.repu), y.c) rank
FROM test x
JOIN (
SELECT GROUP_CONCAT(DISTINCT CONCAT(gid,':',repu) ORDER BY gid, repu DESC) c
FROM test GROUP BY gid
) y ON FIND_IN_SET(CONCAT(x.gid,':',x.repu), y.c)
JOIN is legit MYSQL syntax. If it wasn't working, doubt anyone would mark it as the answer.
In terms of subquery, it will be less faster than the first solution. Looking at EXPLAIN PLAN would be a great idea to understand the execution of these queries.
There's another way to achieve the same:-
-- SOL #3: Answer with 30 votes in this post:
ROW_NUMBER() in MySQL
Related
I have a table that saves user scores in games, structured like this:
user_id (int)
game_id (int)
score (int)
I want to add another column to this table which will be a virtual column that holds the ranking of user for a game (like a table).
For example
user_id game_id score rank (virtual)
1 1 50 1
2 1 48 2
3 1 40 3
2 2 80 1
1 2 50 2
3 2 32 3
As you can see, the rank column is virtually calculated by the points in each game.
Is it even possible? And if so, what should I write in the virtuality expression field?
This doesn't account for ties, but may get you started...
SELECT x.*
, CASE WHEN #prev = game_id THEN #i:=#i+1 ELSE #i:=1 END rank
, #prev := game_id
FROM my_table
, (SELECT #prev:=null,#i:=0) vars
ORDER
BY gamed_id, score DESC;
No, it is not possible to use a generated column for calculating rank. As MySQL documentation on generated columns says:
Subqueries, parameters, variables, stored functions, and user-defined functions are not permitted.
To calculate the rank, you would have to determine the position of the record using a certain ordering within the entire table. This would require an expression that can check other records. There are no such functions within MySQL and subqueries are not allowed.
The only way to make this work is via a view, where you are allowed to use variables and subqueries to calculate the ranking.
Note, that this may change with MySQL v8.0 because it has rank() and dense_rank() functions. You have to experiment whether these functions are allowed in generated columns (I'm not sure if they are deterministic).
I'm getting grey hair by now...
I have a table like this.
ID - Place - Person
1 - London - Anna
2 - Stockholm - Johan
3 - Gothenburg - Anna
4 - London - Nils
And I want to get the result where all the different persons are included, but I want to choose which Place to order by.
For example. I want to get a list where they are ordered by LONDON and the rest will follow, but distinct on PERSON.
Output like this:
ID - Place - Person
1 - London - Anna
4 - London - Nils
2 - Stockholm - Johan
Tried this:
SELECT ID, Person
FROM users
ORDER BY FIELD(Place,'London'), Person ASC "
But it gives me:
ID - Place - Person
1 - London - Anna
4 - London - Nils
3 - Gothenburg - Anna
2 - Stockholm - Johan
And I really dont want Anna, or any person, to be in the result more then once.
This is one way to get the specified output, but this uses MySQL specific behavior which is not guaranteed:
SELECT q.ID
, q.Place
, q.Person
FROM ( SELECT IF(p.Person<=>#prev_person,0,1) AS r
, #prev_person := p.Person AS person
, p.Place
, p.ID
FROM users p
CROSS
JOIN (SELECT #prev_person := NULL) i
ORDER BY p.Person, !(p.Place<=>'London'), p.ID
) q
WHERE q.r = 1
ORDER BY !(q.Place<=>'London'), q.Person
This query uses an inline view to return all the rows in a particular order, by Person, so that all of the 'Anna' rows are together, followed by all the 'Johan' rows, etc. The set of rows for each person is ordered by, Place='London' first, then by ID.
The "trick" is to use a MySQL user variable to compare the values from the current row with values from the previous row. In this example, we're checking if the 'Person' on the current row is the same as the 'Person' on the previous row. Based on that check, we return a 1 if this is the "first" row we're processing for a a person, otherwise we return a 0.
The outermost query processes the rows from the inline view, and excludes all but the "first" row for each Person (the 0 or 1 we returned from the inline view.)
(This isn't the only way to get the resultset. But this is one way of emulating analytic functions which are available in other RDBMS.)
For comparison, in databases other than MySQL, we could use SQL something like this:
SELECT ROW_NUMBER() OVER (PARTITION BY t.Person ORDER BY
CASE WHEN t.Place='London' THEN 0 ELSE 1 END, t.ID) AS rn
, t.ID
, t.Place
, t.Person
FROM users t
WHERE rn=1
ORDER BY CASE WHEN t.Place='London' THEN 0 ELSE 1 END, t.Person
Followup
At the beginning of the answer, I referred to MySQL behavior that was not guaranteed. I was referring to the usage of MySQL User-Defined variables within a SQL statement.
Excerpts from MySQL 5.5 Reference Manual http://dev.mysql.com/doc/refman/5.5/en/user-variables.html
"As a general rule, other than in SET statements, you should never assign a value to a user variable and read the value within the same statement."
"For other statements, such as SELECT, you might get the results you expect, but this is not guaranteed."
"the order of evaluation for expressions involving user variables is undefined."
Try this:
SELECT ID, Place, Person
FROM users
GROUP BY Person
ORDER BY FIELD(Place,'London') DESC, Person ASC;
You want to use group by instead of distinct:
SELECT ID, Person
FROM users
GROUP BY ID, Person
ORDER BY MAX(FIELD(Place, 'London')), Person ASC;
The GROUP BY does the same thing as SELECT DISTINCT. But, you are allowed to mention other fields in clauses such as HAVING and ORDER BY.
I'm curious how to create table2 of the same structure with the same data as table1, but with order by the column frequency.
Or, the equivalent of this problem is: to change the id of rows in the table properly.
It doesn't matter, whether by ASC, or DESC.
As result, the table1:
**id - name - frequency**
1 - John - 33
2 - Paul - 127
3 - Andy - 74
Should become table2:
**id - name - frequency**
1 - Paul - 127
2 - Andy - 74
3 - John - 33
What's the shortest way to do that?
Also, I would be interesting in the query that's fastest for huge tables (although performance is not so important for me).
Like this?
CREATE TABLE b SELECT col FROM a ORDER BY col
Be aware, there is no way to guarantee row order in a database (other than physically). You must always use an ORDER BY.
Reference
For this, you need to create the new id. Here is a MySQL way to do it:
create table table2 as
select #rn := #rn + 1 as id, name, frequency
from table1 cross join (select #rn := 0) const
order by frequency desc
Problem
Suppose I have this table tab (fiddle available).
| g | a | b | v |
---------------------
| 1 | 3 | 5 | foo |
| 1 | 4 | 7 | bar |
| 1 | 2 | 9 | baz |
| 2 | 1 | 1 | dog |
| 2 | 5 | 2 | cat |
| 2 | 5 | 3 | horse |
| 2 | 3 | 8 | pig |
I'm grouping rows by g, and for each group I want one value from column v. However, I don't want any value, but I want the value from the row with maximal a, and from all of those, the one with maximal b. In other words, my result should be
| 1 | bar |
| 2 | horse |
Current solution
I know of a query to achieve this:
SELECT grps.g,
(SELECT v FROM tab
WHERE g = grps.g
ORDER BY a DESC, b DESC
LIMIT 1) AS r
FROM (SELECT DISTINCT g FROM tab) grps
Question
But I consider this query rather ugly. Mostly because it uses a dependant subquery, which feels like a real performance killer. So I wonder whether there is an easier solution to this problem.
Expected answers
The most likely answer I expect to this question would be some kind of add-on or patch for MySQL (or MariaDB) which does provide a feature for this. But I'll welcome other useful inspirations as well. Anything which works without a dependent subquery would qualify as an answer.
If your solution only works for a single ordering column, i.e. couldn't distinguish between cat and horse, feel free to suggest that answer as well as I expect it to be still useful to the majority of use cases. For example, 100*a+b would be a likely way to order the above data by both columns while still using only a single expression.
I have a few pretty hackish solutions in mind, and might add them after a while, but I'll first look and see whether some nice new ones pour in first.
Benchmark results
As it is pretty hard to compare the various answers just by looking at them, I've run some benchmarks on them. This was run on my own desktop, using MySQL 5.1. The numbers won't compare to any other system, only to one another. You probably should be doing your own tests with your real-life data if performance is crucial to your application. When new answers come in, I might add them to my script, and re-run all the tests.
100,000 items, 1,000 groups to choose from, InnoDb:
0.166s for MvG (from question)
0.520s for RichardTheKiwi
2.199s for xdazz
19.24s for Dems (sequential sub-queries)
48.72s for acatt
100,000 items, 50,000 groups to choose from, InnoDb:
0.356s for xdazz
0.640s for RichardTheKiwi
0.764s for MvG (from question)
51.50s for acatt
too long for Dems (sequential sub-queries)
100,000 items, 100 groups to choose from, InnoDb:
0.163s for MvG (from question)
0.523s for RichardTheKiwi
2.072s for Dems (sequential sub-queries)
17.78s for xdazz
49.85s for acatt
So it seems that my own solution so far isn't all that bad, even with the dependent subquery. Surprisingly, the solution by acatt, which uses a dependent subquery as well and which I therefore would have considered about the same, performs much worse. Probably something the MySQL optimizer can't cope with. The solution RichardTheKiwi proposed seems to have good overall performance as well. The other two solutions heavily depend on the structure of the data. With many groups small groups, xdazz' approach outperforms all other, whereas the solution by Dems performs best (though still not exceptionally good) for few large groups.
SELECT g, a, b, v
FROM (
SELECT *,
#rn := IF(g = #g, #rn + 1, 1) rn,
#g := g
FROM (select #g := null, #rn := 0) x,
tab
ORDER BY g, a desc, b desc, v
) X
WHERE rn = 1;
Single pass. All the other solutions look O(n^2) to me.
This way doesn't use sub-query.
SELECT t1.g, t1.v
FROM tab t1
LEFT JOIN tab t2 ON t1.g = t2.g AND (t1.a < t2.a OR (t1.a = t2.a AND t1.b < t2.b))
WHERE t2.g IS NULL
Explanation:
The LEFT JOIN works on the basis that when t1.a is at its maximum value, there is no s2.a with a greater value and the s2 rows values will be NULL.
Many RDBMS have constructs that are particularly suited to this problem. MySQL isn't one of them.
This leads you to three basic approaches.
Check each record to see if it is one you want, using EXISTS and a correlated sub-query in an EXISTS clause. (#acatt's answer, but I understand that MySQL doesn't always optimise this very well. Ensure that you have a composite index on (g,a,b) before assuming that MySQL won't do this very well.)
Do a half cartesian product to full-fill the same check. Any record which does not join is a target record. Where each group ('g') is large, this can quickly degrade performance (If there are 10 records for each unique value of g, this will yield ~50 records and discard 49. For a group size of 100 it yields ~5000 records and discard 4999), but it is great for small group sizes. (#xdazz's answer.)
Or use multiple sub-queries to determine the MAX(a) and then the MAX(b)...
Multiple sequential sub-queries...
SELECT
yourTable.*
FROM
(SELECT g, MAX(a) AS a FROM yourTable GROUP BY g ) AS searchA
INNER JOIN
(SELECT g, a, MAX(b) AS b FROM yourTable GROUP BY g, a) AS searchB
ON searchA.g = searchB.g
AND searchA.a = searchB.a
INNER JOIN
yourTable
ON yourTable.g = searchB.g
AND yourTable.a = searchB.a
AND yourTable.b = searchB.b
Depending on how MySQL optimises the second sub-query, this may or may not be more performant than the other options. It is, however, the longest (and potentially least maintainable) code for the given task.
Assuming an composite index on all three search fields (g, a, b), I would presume it to be best for large group sizes of g. But that should be tested.
For small group sizes of g, I'd go with #xdazz's answer.
EDIT
There is also a brute force approach.
Create an identical table, but with an AUTO_INCREMENT column as an id.
Insert your table into this clone, ordered by g, a, b.
The id's can then be found with SELECT g, MAX(id).
This result can then be used to look-up the v values you need.
This is unlikely to be the best approach. If it is, it is effectively a condmenation of MySQL's optimiser's ability to deal with this type of problem.
That said, every engine has it's weak spots. So, personally, I try everything until I think I understand how the RDBMS is behaving and can make my choice :)
EDIT
Example using ROW_NUMBER(). (Oracle, SQL Server, PostGreSQL, etc)
SELECT
*
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY g ORDER BY a DESC, b DESC) AS sequence_id,
*
FROM
yourTable
)
AS data
WHERE
sequence_id = 1
This can be solved using a correlated query:
SELECT g, v
FROM tab t
WHERE NOT EXISTS (
SELECT 1
FROM tab
WHERE g = t.g
AND a > t.a
OR (a = t.a AND b > t.b)
)
The actual question is a little more complex than that, so here goes.
I have a website which reviews games. Ratings/reviews are posted for each game, and so I have a MySQL database to handle it all.
Thing is, I'd really like a page that showed what score (out of 10) meant what, and to illustrate it would have the game that was last reviewed as an example. I can always do it without, but this would be cooler.
So the query should return something like this (but running from 10 to 0):
|---------------*----------------*-----------------*-----------------|
* game.gameName | game.gameImage | review.ourScore | review.postedOn *
|---------------*----------------*-----------------*-----------------|
| Top Game | img | 10 | (unix timestamp)|
| NearlyTop Game| img | 9 | (unix timestamp)|
| Great Game | img | 8 | (unix timestamp)|
|---------------*----------------*-----------------*-----------------|
The information is in two tables, game and review. I think you'd use MAX() to find out the last timestamp and corresponding game information, but as far as complex queries go, I'm in way over my head.
Of course this could be done with 10 simple SELECTs but I'm sure there must be a way to do this in one query.
Thanks for any help.
Here is an ugly solution I found:
This query simply gets the IDs and scores of the reviews that you want to look at. I have included it so that you can understand what the trick is, without getting distracted by other stuff:
SELECT * FROM
(SELECT reviewID, ourScore FROM review ORDER BY postedOn DESC) as `r`
GROUP BY ourScore
ORDER BY ourScore DESC;
This exploits MySQL's 'GROUP BY' behavior. When the grouping is done, if the source rows have different values for different columns, then the value of the topmost source row is used. So if you had rows in this order:
reviewId Score
1 3
0 3
2 3
Then after you group by score, the reviewId is 1 because that row was on the top:
reviewId Score
1 3
So we want to put the most recent review on the top before we do the group by. Since ORDERing is always dones after grouping, in a single SELECT statement, I had to make a subquery to accomplish this. Now we just dress up this query a little bit to get all the fields you wanted:
SELECT `r`.*, game.gameName, game.gameImage FROM
(SELECT reviewID, ourScore, postedOn, gameID FROM review ORDER BY **postedOn DESC**) as `r`
JOIN game ON `r`.gameID = game.gameID
GROUP BY ourScore
ORDER BY ourScore DESC;
That should work.
SELECT DISTINCT game.gameName, game.gameImage, review.ourScore FROM game
LEFT JOIN review
ON game.ID = review.gameID
ORDER BY review.postedOn
LIMIT 10
Or something like that, check out how to use the Distinct first, I'm not sure on the syntax, and you may have to tell the ORDER BY DESC or ASC depending on what you want.
Well..
SELECT game.gameName, game.gameImage, review.ourScore
FROM game
LEFT JOIN review ON game.gameID = review.gameID
GROUP BY review.ourScore DESC
LIMIT 10
returns a list of games grouped by each individual score. But this isn't what I want, I want the game that is last posted - this is why the timestamp is important. With that query, MySQL returns the first result it can find.
I think this would work:
select g.gameName, g.gameImage, r.ourScore, r.postedOn
from game g, review r
where g.gameId = r.gameId
and r.postedOn = (select max(sr.postedOn)
from review sr where sr.ourScore = r.ourScore)
group by r.ourScore
order by r.ourScore desc;
Edit: above SQL was corrected after David Grayson's comment. I think this query is pretty easy to understand but probably performs poorly compared with his solution.