Retrieve distinct values without reducing number of results - mysql

I'm writing a MySQL request for retrieving data from a list of questions.
The table looks like this :
-----------------------------------------------------
| id | answer_name | rating | question_id | answers |
-----------------------------------------------------
Where several rows can have the same answer_name value, since several questions can be asked about the same answer.
Now, for retrieving the data I use a LIMIT clause which is calculated from ratings and the total number of rows.
For example, if I wanna get the data between 80% and 100% of rating, and there are 100 rows, I would use ORDER BY rating LIMIT 80, 20.
My problem is the following : I need to retrieve data with distinct values for answer_name column, but using a GROUP BY clause makes the number of result (e.g. of rows in the table) reduce cause of aggregation, causing the top percentages of rows to return nothing cause of searching rows at a limit that doesn't exist.
Does anyone know if there is a way to keep the number of results the same and still to retrieve distinct results for the answer_name column ?
EDIT :
Here are some sample rows and expected output :
game_data table :
-----------------------------------------------------
| id | answer_name | rating | question_id | answers |
|----|-------------|--------|-------------|---------|
| 1 | A. Merkel | 40 | 1 | [1,2,3] |
| 2 | A. Merkel | 45 | 2 | [2,3,4] |
| 3 | B. Clinton | 55 | 1 | [2,5,8] |
| 4 | B. Clinton | 50 | 2 | [3,5,8] |
| 5 | L. Messi | 17 | 4 | [7,8,9] |
| 6 | L. Messi | 18 | 5 | [7,8,9] |
| 7 | L. Messi | 25 | 6 | [7,8,9] |
| 8 | D. Beckham | 21 | 4 | [6,7,8] |
| 9 | D. Beckham | 52 | 5 | [6,7,8] |
| 10 | D. Beckham | 41 | 6 | [6,7,8] |
-----------------------------------------------------
Where answers is an array of ids referring to another table.
Let's say I wanna retrieve the 50% to 80% of the table, ordered by rating.
SELECT id FROM game_data GROUP BY answer_name ORDER BY rating LIMIT 5, 3
Here the problem is the GROUP BY answer_name is gonna reduce the number of rows of the table, and therefore instead of returning 3 results, will return an empty set.
Also, I want the selected value in the GROUP BY close to be randomly chosen.

Using group by like this goes against pretty much every instinct, but you said you want random values, so it's good enough.
select * from (
select q.*, #rank := #rank + 1 as rank
from (
select * from game_data
group by answer_name
order by rating desc
) q, (select #rank := 0) qq
) qqq
where rank between (#rank * .5) and (#rank * .8)
demo here
How does it work? First (in the innermost query) we group by your answer_name, to get your distinct results, and we order it by the rating as required.
Then in the query wrapping around that one, we give those results a ranking from 1 to however many rows are in the result. Once this level of the query completes, we know our best answer is answer 1, and our 'worst' answer is the last value of our #rank variable.
Then we get to the outermost query. We can use that #rank variable to determine our percentages, which we use to filter the where clause.
In all likelihood this will give you the same results each time you run the same query, but the values chosen are indeterminate - so it could change. If you want truly random (ie changes with each execution) that's a different kettle of fish altogether.
(note, this bit: , (select #rank := 0) qq is purely to initialise the variable)

Simple is That.
Use Group By 'id' not 'answer_name' b/c Group By not get duplicate values
SELECT * FROM game_data GROUP BY id ORDER BY rating

Related

Leaderboard position SQL optimization

I'm offering an experience leaderboard for a Discord bot I actively develop with stuff like profile cards showing one's rank. The SQL query I'm currently using works flawlessly, however I notice that this query takes a rather long processing time.
SELECT id,
discord_id,
discord_tag,
xp,
level
FROM (SELECT #rank := #rank + 1 AS id,
discord_id,
discord_tag,
xp,
level
FROM profile_xp,
(SELECT #rank := 0) r
ORDER BY xp DESC) t
WHERE discord_id = '12345678901';
The table isn't too big (roughly 20k unique records), but this query is taking anywhere between 300-450ms on average, which piles up relatively fast with a lot of concurrent requests.
I was wondering if this query can be optimized to increase performance. I've isolated this to this query, the rest of the MySQL server is responsive and swift.
I'd be happy about any hint and thanks in advance! :)
You're scanning 20,000 rows to assign "row numbers" then selecting exactly one row from it. You can use aggregation instead:
SELECT *, (
SELECT COUNT(*)
FROM profile_xp AS x
WHERE xp > profile_xp.xp
) + 1 AS rnk
FROM profile_xp
WHERE discord_id = '12345678901'
This will give you rank of the player. For dense rank use COUNT(DISTINCT xp). Create an index on xp column if necessary.
Not an answer; too long for a comment:
I usually write this kind of thing exactly the same way that you have done, because it's quick and easy, but actually there's a technical flaw with this method - although it only becomes apparent in certain situations.
By way of illustration, consider the following:
DROP TABLE IF EXISTS ints;
CREATE TABLE ints (i INT NOT NULL PRIMARY KEY);
INSERT INTO ints VALUES
(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
Your query:
SELECT a.*
, #i:=#i+1 rank
FROM ints a
JOIN (SELECT #i:=0) vars
ORDER
BY RAND() DESC;
+---+------+
| i | rank |
+---+------+
| 3 | 4 |
| 2 | 3 |
| 5 | 6 |
| 1 | 2 |
| 7 | 8 |
| 9 | 10 |
| 4 | 5 |
| 6 | 7 |
| 8 | 9 |
| 0 | 1 |
+---+------+
Look, the result set isn't 'random' at all. rank always corresponds to i
Now compare that with the following:
SELECT a.*
, #i:=#i+1 rank
FROM
( SELECT * FROM ints ORDER by RAND() DESC) a
JOIN (SELECT #i:=0) vars;
+---+------+
| i | rank |
+---+------+
| 5 | 1 |
| 2 | 2 |
| 8 | 3 |
| 7 | 4 |
| 4 | 5 |
| 6 | 6 |
| 0 | 7 |
| 1 | 8 |
| 3 | 9 |
| 9 | 10 |
+---+------+
Assuming discord_id is the primary key for the table, and you're just trying to get one entry's "rank", you should be able to take a different approach.
SELECT px.discord_id, px.discord_tag, px.xp, px.level
, 1 + COUNT(leaders.xp) AS rank
, 1 + COUNT(DISTINCT leaders.xp) AS altRank
FROM profile_xp AS px
LEFT JOIN profile_xp AS leaders ON px.xp < leaders.xp
WHERE px.discord_id = '12345678901'
GROUP BY px.discord_id, px.discord_tag, px.xp, px.level
;
Note I have "rank" and "altRank". rank should give you a similar position to what you were originally looking for; your results could have fluctuated for "ties", this rank will always put tied players at their highest "tie". If 3 records tie for 2nd place, those (queried separately with this) will show 2nd place, the next xp down would should 5th place (assuming 1 in 1st, 2,3,4 in 2nd, 5 in 5th). The altRank would "close the gaps" putting 5 in the 3rd place "group".
I would also recommend an index on xp to speed this up further.

How do I combine two queries on the same table to get a single result set in MySQL

I am not very good at sql but I am getting there. I have searched stackoverflow but I can't seem to find the solution and I hope someone out there can help me. I have a table (users) with data like the following. The book_id column is a key to another table that contains a book the user is subscribed to.
|--------|---------------------|------------------|
| id | book_id | name |
|--------|---------------------|------------------|
| 1 | 1 | jim |
| 2 | 1 | joyce |
| 3 | 1 | mike |
| 4 | 1 | eleven |
| 5 | 2 | max |
| 6 | 2 | dustin |
| 7 | 2 | lucas |
|--------|---------------------|------------------|
I have a function in my PHP code that returns two random users from a specific book id (either 1 or 2). Query one returns the result in column 1 and result two returns the results in column 2 like:
|---------------------|------------------|
| 1 | 2 |
|---------------------|------------------|
| jim | max |
| joyce | dustin |
|---------------------|------------------|
I have achieved this by running two separate queries as seen below. I want to know if it's possible to achieve this functionality with one query and how.
$random_users_with_book_id_1 = SELECT name FROM users WHERE book_id=1 LIMIT 2
$random_users_with_book_id_2 = SELECT name FROM users WHERE book_id=2 LIMIT 2
Again, I apologise if it's too specific. The query below has been closest to what I was trying to achieve.:
SELECT a.name AS book_id_1, b.name AS book_id_2
FROM users a, users b
WHERE a.book_id=1 AND b.book_id = 2
LIMIT 2
EDIT: I have created a fiddle to play around with his. I appreciate any help! Thank you!! http://sqlfiddle.com/#!9/7fcbca/1
It is easy actually :)
you can use UNION like this:
SELECT * FROM (
(SELECT * FROM user WHERE n_id=1 LIMIT 2)
UNION
(SELECT * FROM user WHERE n_id=2 LIMIT 2))
collection;
if you read this article about the documentation you can use the () to group the individual queries and the apply the union in the middle. Without the parenthesis it would still LIMIT 2 and show only the two first. Ref. "To apply ORDER BY or LIMIT to an individual SELECT, place the clause inside the parentheses that enclose the SELECT:"
If you want to combine the queries in MySQL, you can just use parentheses:
(SELECT name
FROM users
WHERE n_id = 1
LIMIT 2
) UNION ALL
(SELECT name
FROM users
WHERE n_id = 2
LIMIT 2
);
First, only use UNION if you specifically want to incur the overhead of removing duplicates. Otherwise, use UNION ALL.
Second, this does not return random rows. This returns arbitrary rows. In many cases, this might be two rows near the beginning of the data. If you want random rows, then use ORDER BY rand():
(SELECT name
FROM users
WHERE n_id = 1
ORDER by rand()
LIMIT 2
) UNION ALL
(SELECT name
FROM users
WHERE n_id = 2
ORDER BY rand()
LIMIT 2
);
There are other methods that are more efficient, but this should be fine for up to a few thousand rows.

sql - Why doesn't MAX() of SUM() work?

I am trying to understand why the SQL command of MAX(SUM(col)) gives the a syntax error. I have the two tables as below-:
+--------+--------+---------+-------+
| pname | rollno | address | score |
+--------+--------+---------+-------+
| A | 1 | CCU | 1234 |
| B | 2 | CCU | 2134 |
| C | 3 | MMA | 4321 |
| D | 4 | MMA | 1122 |
| E | 5 | CCU | 1212 |
+--------+--------+---------+-------+
Personnel Table
+--------+-------+----------+
| rollno | marks | sub |
+--------+-------+----------+
| 1 | 90 | SUB1 |
| 1 | 88 | SUB2 |
| 2 | 89 | SUB1 |
| 2 | 95 | SUB2 |
| 3 | 99 | SUB1 |
| 3 | 99 | SUB2 |
| 4 | 82 | SUB1 |
| 4 | 79 | SUB2 |
| 5 | 92 | SUB1 |
| 5 | 75 | SUB2 |
+--------+-------+----------+
Results Table
Essentially I have a details table and a results table. I want to find the name and marks of the candidate who has got the highest score in SUB1 and SUB2 combined. Basically the person with the highest aggregate marks.
I can find the summation of SUB1 and SUB2 for all candidates using the following query-:
select p.pname, sum(r.marks) from personel p,
result r where p.rollno=r.rollno group by p.pname;
It gives the following output-:
+--------+--------------+
| pname | sum(r.marks) |
+--------+--------------+
| A | 178 |
| B | 167 |
| C | 184 |
| D | 198 |
| E | 161 |
+--------+--------------+
This is fine but I need the output to be only D | 198 as he is the highest scorer. Now when I modify query like the following it fails-:
select p.pname, max(sum(r.marks)) from personel p,
result r where p.rollno=r.rollno group by p.pname;
In MySQL I get the error of Invaild Group Function.
Now searching on SO I did get my correct answer which uses derived tables. I get my answer by using the following query-:
SELECT
pname, MAX(max_sum)
FROM
(SELECT
p.pname AS pname, SUM(r.marks) AS max_sum
FROM
personel p, result r
WHERE
p.rollno = r.rollno
GROUP BY p.pname) a;
But my question is Why doesn't MAX(SUM(col)) work ?
I don't understand why max can't compute the value returned by SUM(). Now an answer on SO stated that since SUM() returns only a single value so MAX() find its meaningless to compute the value of one value, but I have tested the following query -:
select max(foo) from a;
on the Table "a" which has only one row with only one column called foo that holds an integer value. So if MAX() can't compute single values then how did this work ?
Can someone explain to me how the query processor executes the query and why I get the error of invalid group function ? From the readability point of view using MAX(SUM(col)) is perfect but it doesn't work out that way. I want to know why.
Are MAX and SUM never to be used together? I am asking because I have seen queries like MAX(COUNT(col)). I don't understand how that works and not this.
Aggregate functions require an argument that provides a value for each row in the group. Other aggregate functions don't do that.
It's not very sensical anyway. Suppose MySQL accepted MAX(SUM(col)) -- what would it mean? Well, the SUM(col) yields the sum of all non-NULL values of column col over all rows in the relevant group, which is a single number. You could take the MAX() of that to be that same number, but what would be the point?
Your approach using a subquery is different, at least in principle, because it aggregates twice. The inner aggregation, in which you perform the SUM(), computes a separate sum for each value of p.pname. The outer query then computes the maximum across all rows returned by the subquery (because you do not specify a GROUP BY in the outer query). If that's what you want, that's how you need to specify it.
The error is 1111: invalid use of group function. As for why specifically MySQL has this problem I can really only say it is part of the underlying engine itself. SELECT MAX(2) does work (in spite of a lack of a GROUP BY) but SELECT MAX(SUM(2)) does not work.
This error will occur when grouping/aggregating functions such as MAX are used in the wrong spot such as in a WHERE clause. SELECT SUM(MAX(2)) also does not work.
You can imagine that MySQL attempts to aggregate both simultaneously rather than doing things in an order of operations, i.e. it does not SUM first and then get the MAX. This is why you need to do the queries as separate steps.
Try something like this:
select max(rs.marksums) maxsum from
(
select p.pname, sum(r.marks) marksums from personel p,
result r where p.rollno=r.rollno group by p.pname
) rs
with temp_table (name, max_marks) as
(select name, sum(marks) from personel p,result r, where p.rollno = r.rollno group by p.name)
select *from temp_table where max_marks = (select max(max_marks) from temp_table);
I didn't run this. But try this one. Hope it will work :)

What is SQL to select a property and the max number of occurrences of a related property?

I have a table like this:
Table: p
+----------------+
| id | w_id |
+---------+------+
| 5 | 8 |
| 5 | 10 |
| 5 | 8 |
| 5 | 10 |
| 5 | 8 |
| 6 | 5 |
| 6 | 8 |
| 6 | 10 |
| 6 | 10 |
| 7 | 8 |
| 7 | 10 |
+----------------+
What is the best SQL to get the following result? :
+-----------------------------+
| id | most_used_w_id |
+---------+-------------------+
| 5 | 8 |
| 6 | 10 |
| 7 | 8 |
+-----------------------------+
In other words, to get, per id, the most frequent related w_id.
Note that on the example above, id 7 is related to 8 once and to 10 once.
So, either (7, 8) or (7, 10) will do as result. If it is not possible to
pick up one, then both (7, 8) and (7, 10) on result set will be ok.
I have come up with something like:
select counters2.p_id as id, counters2.w_id as most_used_w_id
from (
select p.id as p_id,
w_id,
count(w_id) as count_of_w_ids
from p
group by id, w_id
) as counters2
join (
select p_id, max(count_of_w_ids) as max_counter_for_w_ids
from (
select p.id as p_id,
w_id,
count(w_id) as count_of_w_ids
from p
group by id, w_id
) as counters
group by p_id
) as p_max
on p_max.p_id = counters2.p_id
and p_max.max_counter_for_w_ids = counters2.count_of_w_ids
;
but I am not sure at all whether this is the best way to do it. And I had to repeat the same sub-query two times.
Any better solution?
Try to use User defined variables
select id,w_id
FROM
( select T.*,
if(#id<>id,1,0) as row,
#id:=id FROM
(
select id,W_id, Count(*) as cnt FROM p Group by ID,W_id
) as T,(SELECT #id:=0) as T1
ORDER BY id,cnt DESC
) as T2
WHERE Row=1
SQLFiddle demo
Formal SQL
In fact - your solution is correct in terms of normal SQL. Why? Because you have to stick with joining values from original data to grouped data. Thus, your query can not be simplified. MySQL allows to mix non-group columns and group function, but that's totally unreliable, so I will not recommend you to rely on that effect.
MySQL
Since you're using MySQL, you can use variables. I'm not a big fan of them, but for your case they may be used to simplify things:
SELECT
c.*,
IF(#id!=id, #i:=1, #i:=#i+1) AS num,
#id:=id AS gid
FROM
(SELECT id, w_id, COUNT(w_id) AS w_count
FROM t
GROUP BY id, w_id
ORDER BY id DESC, w_count DESC) AS c
CROSS JOIN (SELECT #i:=-1, #id:=-1) AS init
HAVING
num=1;
So for your data result will look like:
+------+------+---------+------+------+
| id | w_id | w_count | num | gid |
+------+------+---------+------+------+
| 7 | 8 | 1 | 1 | 7 |
| 6 | 10 | 2 | 1 | 6 |
| 5 | 8 | 3 | 1 | 5 |
+------+------+---------+------+------+
Thus, you've found your id and corresponding w_id. The idea is - to count rows and enumerate them, paying attention to the fact, that we're ordering them in subquery. So we need only first row (because it will represent data with highest count).
This may be replaced with single GROUP BY id - but, again, server is free to choose any row in that case (it will work because it will take first row, but documentation says nothing about that for common case).
One little nice thing about this is - you can select, for example, 2-nd by frequency or 3-rd, it's very flexible.
Performance
To increase performance, you can create index on (id, w_id) - obviously, it will be used for ordering and grouping records. But variables and HAVING, however, will produce line-by-line scan for set, derived by internal GROUP BY. It isn't such bad as it was with full scan of original data, but still it isn't good thing about doing this with variables. On the other hand, doing that with JOIN & subquery like in your query won't be much different, because of creating temporery table for subquery result set too.
But to be certain, you'll have to test. And keep in mind - you already have valid solution, which, by the way, isn't bound to DBMS-specific stuff and is good in terms of common SQL.
Try this query
select p_id, ccc , w_id from
(
select p.id as p_id,
w_id, count(w_id) ccc
from p
group by id,w_id order by id,ccc desc) xxx
group by p_id having max(ccc)
here is the sqlfidddle link
You can also use this code if you do not want to rely on the first record of non-grouping columns
select p_id, ccc , w_id from
(
select p.id as p_id,
w_id, count(w_id) ccc
from p
group by id,w_id order by id,ccc desc) xxx
group by p_id having ccc=max(ccc);

Values in same row of groupwise maximum

I've got a table with the most common colors in images. It looks something like this:
file | color | count
---------------------
1 | ffefad | 166
1 | 443834 | 84
2 | 74758a | 3874
2 | abcdef | 228
2 | 876543 | 498
3 | 543432 | 3382
3 | abcdef | 483
I'm trying to get the most common color for each image. So I'd like my result to be:
file | color | count
---------------------
1 | ffefad | 166
2 | 74758a | 3874
3 | 543432 | 3382
So my problem seems to be that I need to GROUP BY the file column, but MAX() the count column. But simply
SELECT h.file, h.color, MAX(h.count) FROM histogram GROUP BY h.file
isn't working because it's indeterminate, so the color result won't match the row from the count result.
SELECT h.file, h.color, MAX(h.count) FROM histogram GROUP BY h.file, h.color
fixes the determinacy, but now every row is "unique" and all rows are returned.
I can't figure out a way to do a subquery or join, since the only "correct" values I can figure to get, file and count, are not distinct by themselves.
Perhaps I need a saner schema? It's "my" table so I can change that if need be.
SELECT tbl.file, tbl.color, tbl.count
FROM tbl
LEFT JOIN tbl as lesser
ON lesser.file = tbl.file
AND tbl.count < lesser.count
WHERE lesser.file IS NULL
order by tbl.file
select file , max(count)
FROM histogram
GROUP BY h.file
This will give the max(count) by file. Turn it into a subquery and inner join so it acts as a filter.
select h.file, h.colour, h.count
from histogram inner join
(select file , max(count) as maxcount
FROM histogram
GROUP BY h.file) a
on a.file = h.file and a.maxcount = h.count
This will respond with 2 rows if there are more than 1 colour with the same max count.