Fast group rank() function - mysql

There are various ways people try to emulate MSSQL RANK() or ROW_NUMBER() functions in MySQL, but all of them I've tried so far are slow.
I have a table that looks like this:
CREATE TABLE ratings
(`id` int, `category` varchar(1), `rating` int)
;
INSERT INTO ratings
(`id`, `category`, `rating`)
VALUES
(3, '*', 54),
(4, '*', 45),
(1, '*', 43),
(2, '*', 24),
(2, 'A', 68),
(3, 'A', 43),
(1, 'A', 12),
(3, 'B', 22),
(4, 'B', 22),
(4, 'C', 44)
;
Except it has 220,000 records. There are about 90,000 unique id's.
I wanted to rank the id's first by looking at the categories which were not * where a higher rating is a lower rank.
SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank
Output:
id category rating rank
2 A 68 1
3 A 43 2
1 A 12 3
4 B 22 1
3 B 22 2
4 C 44 1
Then I wanted to take the smallest rank an id had, and average that with the rank they have within the * category. Giving a total query of:
SELECT X1.id,
(X1.rank + X2.minrank) / 2 AS OverallRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category = '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X1
JOIN
(SELECT id,
Min(rank) AS MinRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X
GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank
Giving me
id OverallRank
3 1.5000
4 1.5000
2 2.5000
1 3.0000
This query is correct and the output I want, but it just hangs on my real table of 220,000 records. How can I optimize it? My real table has an index on id,rating and category and id,category
Edit:
Result of SHOW CREATE TABLE ratings:
CREATE TABLE `rating` (
`id` int(11) NOT NULL,
`category` varchar(255) NOT NULL,
`rating` int(11) NOT NULL DEFAULT '1500',
`rd` int(11) NOT NULL DEFAULT '350',
`vol` float NOT NULL DEFAULT '0.06',
`wins` int(11) NOT NULL,
`losses` int(11) NOT NULL,
`streak` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`streak`,`rd`,`id`,`category`),
UNIQUE KEY `id_category` (`id`,`category`),
KEY `rating` (`rating`,`rd`),
KEY `streak_idx` (`streak`),
KEY `category_idx` (`category`),
KEY `id_rating_idx` (`id`,`rating`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The PRIMARY KEY is the most common use case of queries to this table, that is why it's the clustered key. It's worth noting that the server is a raid 10 of SSDs with a 9GB/s FIO random read. So I don't suspect the indices not being clustered will affect much.
Output of (select count(distinct category) from ratings) is 50
In the interest that this could be how the data is or an oversight on me, I am included the export of the entire table. It is only 200KB zipped: https://www.dropbox.com/s/p3iv23zi0uzbekv/ratings.zip?dl=0
The first query takes 27 seconds to run

You can use temporary tables with an AUTO_INCREMENT column to generate ranks (row number).
For example - to generate ranks for the '*' category:
drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
rank int unsigned auto_increment primary key,
id int NOT NULL
) engine=memory
select null as rank, id
from ratings r
where r.category = '*'
order by r.category, r.rating desc, r.id desc;
This runs in something like 30 msec. While your approach with the selfjoin takes 45 seconds on my machine. Even with a new index on (category, rating, id) it still takes 14 seconds to run.
To generate ranks per group (per category) is a bit more complicated. We can still use an AUTO_INCREMENT column, but will need to calculate and substract an offset per category:
drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
pos int unsigned auto_increment primary key,
category varchar(50) not null,
id int NOT NULL
) engine=memory
select null as pos, category, id
from ratings r
where r.category <> '*'
order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
select category, min(pos) - 1 as `offset`
from tmp_pos
group by category;
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
This runs in about 220 msec. The selfjoin solution takes 42 sec or 13 sec with the new index.
Now you just need to combine the last query with the first temp table, to get your final result:
select t1.id, (t1.min_rank + t2.rank) / 2 as OverallRank
from (
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
) t1
join tmp_main_cat_rank t2 using(id);
Overall runtime is ~280 msec without an additional index and ~240 msec with an index on (category, rating, id).
A note to the selfjoin approach: It's an elegant solution and performs fine with a small group size. It's fast with an average group size <= 2. It can be acceptable for a group size of 10. But you have an average group size 447 (count(*) / count(distinct category)). That means every row is joined with 447 other rows (on average). You can see the impact by removing the group by clause:
SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
The result is more than 10M rows.
However - with an index on (category, rating, id) your query runs in 33 seconds on my machine.

Related

get max, min, count and mode (occurrence)

I have an items table in my database that i want my query to process the values and give me the data of the max price, min price, most recurrent max price in that specific item category and no of items (and ignore the ones that are null), so here is my items table:
id
category
min_price
max_price
1
kids
10
100
2
adult
20
200
3
both
null
null
4
adult
20
100
5
adult
50
100
6
adult
50
200
7
kids
20
100
8
both
20
100
9
kids
null
null
10
adult
10
500
11
misc
null
null
I want the query to return this result:
category
min_price
max_price
price_mode
no_items
kids
10
100
100
3
adult
20
500
200
5
both
20
100
100
2
misc
null
null
null
1
so just to further explain the adult lowest price in 20 and highest is 500 and the 100 and 200 max_price has 2 occurrences both i want to take the highest as the price_mode which is 200 in this case and the no_items is just the count of how many times adult is shown in the table.
am struggling to get the mode honestly and grouping it correctly to get the output I want.
Below is the commands to create table and feed it with data. Tried to put it in SqlFiddle but that's not working for me i don't know why.
CREATE TABLE IF NOT EXISTS `items` (
`id` int(6) unsigned NOT NULL,
`category` TEXT NOT NULL,
`min_price` FLOAT DEFAULT NULL,
`max_price` FLOAT DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `items` (`id`, `category`, `min_price`, `max_price`) VALUES
('kids', 10, 100),
('adult', 20, 200),
('both', null, null),
('adult', 20, 100),
('adult', 50, 100),
('adult', 50, 200),
('kids', 20, 100),
('both', 20, 100),
('kids', null, null),
('adult', 10, 500),
('misc', null, null);
Your create table + insert data syntax doesn't work in fiddle because your data VALUES are for just 3 columns whereby you define 4 columns in the INSERT:
INSERT INTO `items` (`id`, `category`, `min_price`, `max_price`) VALUES
('kids' , 10 , 100),
/*where's the value for `id`?*/
...
If you remove id from the INSERT syntax, it won't work as well because you've set it as PRIMARY KEY so it can't be empty. What you can do in addition to removing id from INSERT is to define AUTO_INCREMENT on the id column:
CREATE TABLE IF NOT EXISTS `items` (
`id` int(6) unsigned NOT NULL AUTO_INCREMENT,
....
Now, to get the expected result on your price_mode, you may want to try using GROUP_CONCAT() with ORDER and define which of the data in there that you want to return. Let's say you do GROUP_CONCAT(max_price ORDER BY max_price DESC) to return the set with max_price in descending order like this:
SELECT category,
MIN(min_price),
MAX(max_price),
GROUP_CONCAT(max_price ORDER BY max_price DESC),
COUNT(*)
FROM items
GROUP BY category;
Then you'll get a result like this:
category
MIN(min_price)
MAX(max_price)
GROUP_CONCAT(max_price ORDER BY max_price DESC)
COUNT(*)
adult
10
500
500,200,200,100,100
5
both
20
100
100
2
kids
10
100
100,100
3
misc
NULL
NULL
NULL
1
So, there's a consistent pattern in the GROUP_CONCAT() result that you probably can work out with. Assuming that you want the second largest value in the set, you can apply SUBSTRING_INDEX() twice to get it like this:
SELECT category,
MIN(min_price) AS min_price,
MAX(max_price) AS max_price,
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(max_price ORDER BY max_price DESC),',',2),',',-1)
AS price_mode,
COUNT(*) AS no_items
FROM items
GROUP BY category;
This return the following result:
category
min_price
max_price
price_mode
no_items
adult
10
500
200
5
both
20
100
100
2
kids
10
100
100
3
misc
NULL
NULL
NULL
1
Demo fiddle
The following is an updated suggestion after getting further clarification:
SELECT i.category,
MIN(i.min_price),
MAX(i.max_price),
v2.mp AS price_mode,
COUNT(DISTINCT i.id)
FROM items i
LEFT JOIN
(SELECT cat,
mp,
cnt,
CASE WHEN cat = #cat
THEN #rownum := #rownum + 1
ELSE #rownum:=1 END AS rownum,
#cat := cat
FROM
(SELECT category cat,
max_price mp,
COUNT(*) cnt
FROM items
GROUP BY category,
max_price) v1
CROSS JOIN (SELECT #rownum := 1,
#cat := NULL) seq
WHERE mp IS NOT NULL
ORDER BY cat, cnt DESC, mp DESC) v2
ON i.category=v2.cat
AND v2.rownum=1
GROUP BY i.category, v2.mp;
The query starts with getting the COUNT(*) value of category and max_price combination. Then generating a custom row numbering on it with a WHERE condition that doesn't return max_price with NULL after the first operation. Probably the crucial part here is the ORDER BY cat, cnt DESC, mp DESC since the row numberings are assigned based on it. Otherwise, the row numbering will mess up. Finally, LEFT JOIN the items table with it with ON i.category=v2.cat AND v2.rownum=1 condition. It's important to make sure the v2.rownum=1 is placed at ON condition instead of WHERE in order to return the last row value of misc; since the subqueries will not have the value with the present sample data.
Here's an updated fiddle for reference, including the sample of 3 adult=NULL.
Maybe this query will help
with maximumvaluecounts
as ( select
count(max_price) as c, category, max_price
from yourtable
group by category
),
maximumcountpercategory
as ( select category,max(c) as c
from maximumvaluecounts
group by category
),
modes as ( select category, max_price as modevalue
from maximumcountpercategory m1
join maximumvaluecounts m2
on m1.category=m2.category
and m1.c=m2.c
)
, others as (
select
category,
min(min_price) as min_price,
max(max_price) as max_price,
count(max_price) as no_items
from yourtable
group by category
)
select o.*, m.modevalue as price_mode
from others o join
modes m on o.category=m.category

GROUP BY and custom order

I've read through the answers on MySQL order by before group by but applying it to my query ends up with a subquery in a subquery for a rather simple case so I'm wondering if this can be simplified:
Schema with sample data
For brevity I've omitted the other fields on the members table. Also, there's many more tables joined in the actual application but those are straightforward to join. It's the membership_stack table that's giving me issues.
CREATE TABLE members (
id int unsigned auto_increment,
first_name varchar(255) not null,
PRIMARY KEY(id)
);
INSERT INTO members (id, first_name)
VALUES (1, 'Tyler'),
(2, 'Marissa'),
(3, 'Alex'),
(4, 'Parker');
CREATE TABLE membership_stack (
id int unsigned auto_increment,
member_id int unsigned not null,
sequence int unsigned not null,
team varchar(255) not null,
`status` varchar(255) not null,
PRIMARY KEY(id),
FOREIGN KEY(member_id) REFERENCES members(id)
);
-- Algorithm to determine correct team:
-- 1. Only consider rows with the highest sequence number
-- 2. Order statuses and pick the first one found:
-- (active, completed, cancelled, abandoned)
INSERT INTO membership_stack (member_id, sequence, team, status)
VALUES (1, 1, 'instinct', 'active'),
(1, 1, 'valor', 'abandoned'),
(2, 1, 'valor', 'active'),
(2, 2, 'mystic', 'abandoned'),
(2, 2, 'valor', 'completed'),
(3, 1, 'instinct', 'completed'),
(3, 2, 'valor', 'active');
I can't change the database schema because the data is synchronized with an external data source.
Query
This is what I have so far:
SELECT m.id, m.first_name, ms.sequence, ms.team, ms.status
FROM membership_stack AS ms
JOIN (
SELECT member_id, MAX(sequence) AS sequence
FROM membership_stack
GROUP BY member_id
) AS t1
ON ms.member_id = t1.member_id
AND ms.sequence = t1.sequence
RIGHT JOIN members AS m
ON ms.member_id = m.id
ORDER BY m.id, FIELD(ms.status, 'active', 'completed', 'cancelled', 'abandoned');
This works as expected but members may appear multiple times if their "most recent sequence" involves more than one team. What I need to do is aggregate again on id and select the FIRST row in each group.
However that poses some issues:
There is no FIRST() function in MySQL
This entire resultset would become a subtable (subquery), which isn't a big deal here but the queries are quite big on the application.
It needs to be compatible with ONLY_FULL_GROUP_BY mode as it is enabled on MySQL 5.7 by default. I haven't checked but I doubt that FIELD(ms.status, 'active', 'completed', 'cancelled', 'abandoned') is considered a functionally dependent field on this resultset. The query also needs to be compatible with MySQL 5.1 as that is what we are running at the moment.
Goal
| id | first_name | sequence | team | status |
|----|------------|----------|----------|-----------|
| 1 | Tyler | 1 | instinct | active |
| 2 | Marissa | 2 | valor | completed |
| 3 | Alex | 2 | valor | active |
| 4 | Parker | NULL | NULL | NULL |
What can I do about this?
Edit: It has come to my attention that some members don't belong to any team. These members should be included in the resultset with null values for those fields. Question updated to reflect new information.
You can use a correlated subquery in the WHERE clause with LIMIT 1:
SELECT m.id, m.first_name, ms.sequence, ms.team, ms.status
FROM members AS m
JOIN membership_stack AS ms ON ms.member_id = m.id
WHERE ms.id = (
SELECT ms1.id
FROM membership_stack AS ms1
WHERE ms1.member_id = ms.member_id
ORDER BY ms1.sequence desc,
FIELD(ms1.status, 'active', 'completed', 'cancelled', 'abandoned'),
ms1.id asc
LIMIT 1
)
ORDER BY m.id;
Demo: http://rextester.com/HGU18448
Update
To include members who have no entries in the membership_stack table you should use a LEFT JOIN, and move the subquery condition from the WHERE clause to the ON clause:
SELECT m.id, m.first_name, ms.sequence, ms.team, ms.status
FROM members AS m
LEFT JOIN membership_stack AS ms
ON ms.member_id = m.id
AND ms.id = (
SELECT ms1.id
FROM membership_stack AS ms1
WHERE ms1.member_id = ms.member_id
ORDER BY ms1.sequence desc,
FIELD(ms1.status, 'active', 'completed', 'cancelled', 'abandoned'),
ms1.id asc
LIMIT 1
)
ORDER BY m.id;
Demo: http://rextester.com/NPI79503
I would do this using variables.
You are looking for the one membership_stack row that is maximal for your special ordering. I'm focusing just on that. The join back to members is trivial.
select ms.*
from (select ms.*,
(#rn := if(#m = member_id, #rn + 1,
if(#m := member_id, 1, 1)
)
) as rn
from membership_stack ms cross join
(select #m := -1, #rn := 0) params
order by member_id, sequence desc,
field(ms.status, 'active', 'completed', 'cancelled', 'abandoned')
) ms
where rn = 1;
The variables is how the logic is implemented. The ordering is key to getting the right result.
EDIT:
MySQL is quite finicky about LIMIT in subqueries. It is possible that this will work:
select ms.*
from membership_stack ms
where (sequence, status) = (select ms2.sequence, ms2.status
from membership_stack ms2
where ms2.member_id = ms.member_id
order by ms2.member_id, ms2.sequence desc,
field(ms2.status, 'active', 'completed', 'cancelled', 'abandoned')
limit 1
);

get the id of the row with the least value, group by an other column

I ran into a problem trying to pull one action per user with the least priority, the priority is based on other columns content and is an integer,
This is the initial query :
SELECT
CASE
...
END AS dummy_priority,
id,
user_id
FROM
actions
Result :
id user_id priority
1 2345 1
2 2345 3
3 2999 5
4 2999 2
5 3000 10
Desired result :
id user_id priority
1 2345 1
4 2999 2
5 3000 10
Following what i want i tried
SELECT x.id, x.user_id, MIN(x.priority)
FROM (
SELECT
CASE
...
END AS priority,
id,
user_id
FROM
actions
) x
GROUP BY x.user_id
Which didn't work
Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY
clause and contains nonaggregated column 'x.id' which is not
functionally dependent on columns in GROUP BY clause;
Most examples of this I found were extracting just the user_id and priority and then doing an inner join with both of them to get the row, but I can't do that since (priority, user_id) isn't unique
A simple verifiable example would be
CREATE TABLE `actions` (
`id` int(11) NOT NULL,
`user_id` int(11) DEFAULT NULL,
`priority` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `actions` (`id`, `user_id`, `priority`) VALUES
(1, 2345, 1),
(2, 2345, 3),
(3, 2999, 5),
(4, 2999, 2),
(5, 3000, 10);
how to extract the desired result (please hold in mind that this table is a subquery)?
The proper way to do this would involve a subquery of some sort . . . and that would require repeating the case definition.
Here is another method, using the substring_index()/group_concat() trick:
SELECT SUBSTRING_INDEX(GROUP_CONCAT(x.id ORDER BY x.priority), ',', 1) as id,
x.user_id, MIN(x.priority)
FROM (SELECT (CASE ...
END) AS priority,
id, user_id
FROM actions a
) x
GROUP BY x.user_id;
And that proper way in full...
SELECT x...
, CASE...x... priority
FROM my_table x
JOIN
( SELECT user_id
, MIN(CASE...) priority
FROM my_table
GROUP
BY user_id
) y
ON y.user_id = x.user_id
AND y.priority = CASE...x...;
This should work ...
SELECT id , user_id, priority FROM actions act
INNER JOIN
(SELECT
user_id, MIN(priority) AS priority
FROM
actions
GROUP BY user_id) pri
ON act.user_id = pri.user_id AND act.priority = pri.prority

Wrong data output in SQL request

I have a table named payments
CREATE TABLE payments (
`id` INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
`student_id` INT NOT NULL,
`datetime` DATETIME NOT NULL,
`amount` FLOAT DEFAULT 0,
INDEX `student_id` (`student_id`)
);
It is necessary to create a query that is find all student_id whose sum payment is less than the biggest one. (it can be more than one user with the same biggest amount of payments)
Let assume for instance this is a test data:
== Dumping data for table payments
id-student_id-datetime-amount
|1|4|2015-06-11 00:00:00|2
|2|5|2015-06-01 00:00:00|6
|3|1|2015-06-03 00:00:00|8
|4|2|2015-06-02 00:00:00|9
|5|4|2015-06-09 00:00:00|6
|6|5|2015-06-06 00:00:00|3
|7|2|2015-06-05 00:00:00|6
|8|3|2015-06-09 00:00:00|12
|14|1|2015-06-01 00:00:00|0
|15|1|2015-06-03 00:00:00|7
|16|6|2015-06-02 00:00:00|0
|17|6|2015-06-07 00:00:00|0
|18|6|2015-06-05 00:00:00|0
Next query shows all students with their sum payments
SELECT `student_id`, SUM(amount) as `sumamount`
FROM `payments`
GROUP BY `student_id`
ORDER BY `sumamount` DESC
Here is write output of this query ordered by sumamount
student_id sumamount
1 15
2 15
3 12
5 9
4 8
6 0
BUT the problem is when I try to get the user who paid less than the biggest one it gives me the wrong answer
Here is the query to get the second user:
SELECT `student_id`, SUM(amount) as `sumamount`
FROM `payments`
GROUP BY `student_id`
HAVING `sumamount` < MAX(sumamount)
ORDER BY `sumamount` DESC
Here is the result
student_id sumamount
3 12
4 8
6 0
As we can see student_id = 5 missed and I have no idea why.
You need to calcualate MAX(sumamount) in a subquery, so that MAX is not grouped by student_id.
SELECT `student_id`, SUM(amount) as `sumamount`, maxsum
FROM `payments`
CROSS JOIN (SELECT MAX(sumamount) AS maxsum
FROM (SELECT SUM(amount) AS sumamount
FROM payments
GROUP BY student_id) t1) t2
GROUP BY `student_id`
HAVING `sumamount` < maxsum
ORDER BY `sumamount` DESC
DEMO

Sorting so that rows with matching column do stick together

I have a table with timestamped rows: say, some feed with authors:
CREATE TEMPORARY TABLE `feed` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`author` VARCHAR(255) NOT NULL,
`tm` DATETIME NOT NULL
);
I'd like to sort by tm DESC but in such a way that rows from one author do stick together.
For instance, having
INSERT INTO `feed` VALUES
( 5, 'peter', NOW()+1 ),
( 4, 'helen', NOW()-1 ),
( 3, 'helen', NOW()-2 ),
( 2, 'peter', NOW()-10 ),
( 1, 'peter', NOW()-11 );
The result set should be sorted by tm DESC, but all peter posts go first because his post is the most recent one. The next set of rows should originate from the author with the 2nd most recent post. And so on.
5 peter
2 peter
1 peter
3 helen
2 helen
First we sort authors by recent post, descending. Then, having this "rating", we sort the feed with authors sorted by recent post.
Create in line view calculating the Min Tm and then join to it.
SELECT f.*
FROM feed f
INNER JOIN (SELECT MAX(TM) MAXTM,
author
FROM Feed
GROUP BY Author)m
ON f.author = m.author
ORDER BY m.MAXTM DESC,
f.author
DEMO
You could try something like this:
select *
from feed
order by
(select max(tm) from feed f2 where f2.author = feed.author) desc,
tm desc
This sorts first by the time of the most recent post of the author, then by tm.
SELECT *
FROM `feed`
LEFT JOIN (
SELECT
#rownum:=#rownum+1 AS `rowid`,
`author`,
MAX(`tm`) AS `max_tm`
FROM (SELECT #rownum:=0) r, `feed`
GROUP BY `author`
ORDER BY `max_tm` DESC
) `feedsort` ON(`feed`.`author` = `feedsort`.`author`)
ORDER BY
`feedsort`.`rowid` ASC,
`feed`.`tm` DESC;
This solves the problem but I'm sure there's a better solution