Improve SQL query performance - mysql

I have three tables where I store actual person data (person), teams (team) and entries (athlete). The schema of the three tables is:
In each team there might be two or more athletes.
I'm trying to create a query to produce the most frequent pairs, meaning people who play in teams of two. I came up with the following query:
SELECT p1.surname, p1.name, p2.surname, p2.name, COUNT(*) AS freq
FROM person p1, athlete a1, person p2, athlete a2
WHERE
p1.id = a1.person_id AND
p2.id = a2.person_id AND
a1.team_id = a2.team_id AND
a1.team_id IN
( SELECT team.id
FROM team, athlete
WHERE team.id = athlete.team_id
GROUP BY team.id
HAVING COUNT(*) = 2 )
GROUP BY p1.id
ORDER BY freq DESC
Obviously this is a resource consuming query. Is there a way to improve it?

SELECT id
FROM team, athlete
WHERE team.id = athlete.team_id
GROUP BY team.id
HAVING COUNT(*) = 2
Performance Tip 1: You only need the athlete table here.

You might consider the following approach which uses triggers to maintain counters in your team and person tables so you can easily find out which teams have 2 or more athletes and which persons are in 2 or more teams.
(note: I've removed the surrogate id key from your athlete table in favour of a composite key which will better enforce data integrity. I've also renamed athlete to team_athlete)
drop table if exists person;
create table person
(
person_id int unsigned not null auto_increment primary key,
name varchar(255) not null,
team_count smallint unsigned not null default 0
)
engine=innodb;
drop table if exists team;
create table team
(
team_id int unsigned not null auto_increment primary key,
name varchar(255) not null,
athlete_count smallint unsigned not null default 0,
key (athlete_count)
)
engine=innodb;
drop table if exists team_athlete;
create table team_athlete
(
team_id int unsigned not null,
person_id int unsigned not null,
primary key (team_id, person_id), -- note clustered composite PK
key person(person_id) -- added index
)
engine=innodb;
delimiter #
create trigger team_athlete_after_ins_trig after insert on team_athlete
for each row
begin
update team set athlete_count = athlete_count+1 where team_id = new.team_id;
update person set team_count = team_count+1 where person_id = new.person_id;
end#
delimiter ;
insert into person (name) values ('p1'),('p2'),('p3'),('p4'),('p5');
insert into team (name) values ('t1'),('t2'),('t3'),('t4');
insert into team_athlete (team_id, person_id) values
(1,1),(1,2),(1,3),
(2,3),(2,4),
(3,1),(3,5);
select * from team_athlete;
select * from person;
select * from team;
select * from team where athlete_count >= 2;
select * from person where team_count >= 2;
EDIT
Added the following as initially misunderstood question:
Create a view which only includes teams of 2 persons.
drop view if exists teams_with_2_players_view;
create view teams_with_2_players_view as
select
t.team_id,
ta.person_id,
p.name as person_name
from
team t
inner join team_athlete ta on t.team_id = ta.team_id
inner join person p on ta.person_id = p.person_id
where
t.athlete_count = 2;
Now use the view to find the most frequently occurring person pairs.
select
p1.person_id as p1_person_id,
p1.person_name as p1_person_name,
p2.person_id as p2_person_id,
p2.person_name as p2_person_name,
count(*) as counter
from
teams_with_2_players_view p1
inner join teams_with_2_players_view p2 on
p2.team_id = p1.team_id and p2.person_id > p1.person_id
group by
p1.person_id, p2.person_id
order by
counter desc;
Hope this helps :)
EDIT 2 checking performance
select count(*) as counter from person;
+---------+
| counter |
+---------+
| 10000 |
+---------+
1 row in set (0.00 sec)
select count(*) as counter from team;
+---------+
| counter |
+---------+
| 450000 |
+---------+
1 row in set (0.08 sec)
select count(*) as counter from team where athlete_count = 2;
+---------+
| counter |
+---------+
| 112644 |
+---------+
1 row in set (0.03 sec)
select count(*) as counter from team_athlete;
+---------+
| counter |
+---------+
| 1124772 |
+---------+
1 row in set (0.21 sec)
explain
select
p1.person_id as p1_person_id,
p1.person_name as p1_person_name,
p2.person_id as p2_person_id,
p2.person_name as p2_person_name,
count(*) as counter
from
teams_with_2_players_view p1
inner join teams_with_2_players_view p2 on
p2.team_id = p1.team_id and p2.person_id > p1.person_id
group by
p1.person_id, p2.person_id
order by
counter desc
limit 10;
+----+-------------+-------+--------+---------------------+-------------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+-------------+---------+---------------------+-------+----------------------------------------------+
| 1 | SIMPLE | t | ref | PRIMARY,t_count_idx | t_count_idx | 2 | const | 86588 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t | eq_ref | PRIMARY,t_count_idx | PRIMARY | 4 | foo_db.t.team_id | 1 | Using where |
| 1 | SIMPLE | ta | ref | PRIMARY,person | PRIMARY | 4 | foo_db.t.team_id | 1 | Using index |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | foo_db.ta.person_id | 1 | |
| 1 | SIMPLE | ta | ref | PRIMARY,person | PRIMARY | 4 | foo_db.t.team_id | 1 | Using where; Using index |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | foo_db.ta.person_id | 1 | |
+----+-------------+-------+--------+---------------------+-------------+---------+---------------------+-------+----------------------------------------------+
6 rows in set (0.00 sec)
select
p1.person_id as p1_person_id,
p1.person_name as p1_person_name,
p2.person_id as p2_person_id,
p2.person_name as p2_person_name,
count(*) as counter
from
teams_with_2_players_view p1
inner join teams_with_2_players_view p2 on
p2.team_id = p1.team_id and p2.person_id > p1.person_id
group by
p1.person_id, p2.person_id
order by
counter desc
limit 10;
+--------------+----------------+--------------+----------------+---------+
| p1_person_id | p1_person_name | p2_person_id | p2_person_name | counter |
+--------------+----------------+--------------+----------------+---------+
| 221 | person 221 | 739 | person 739 | 5 |
| 129 | person 129 | 249 | person 249 | 5 |
| 874 | person 874 | 877 | person 877 | 4 |
| 717 | person 717 | 949 | person 949 | 4 |
| 395 | person 395 | 976 | person 976 | 4 |
| 415 | person 415 | 828 | person 828 | 4 |
| 287 | person 287 | 470 | person 470 | 4 |
| 455 | person 455 | 860 | person 860 | 4 |
| 13 | person 13 | 29 | person 29 | 4 |
| 1 | person 1 | 743 | person 743 | 4 |
+--------------+----------------+--------------+----------------+---------+
10 rows in set (2.02 sec)

Should there be an additional constraint a1.person_id != a2.person_id, to avoid creating a pair with the same player? This may not affect the final ordering of the results but will affect the accuracy of the count.
If possible you can add a column called athlete_count (with an index) in the team table which can be updated whenever a player is added or removed to a team and this can avoid the subquery which needs to go through the entire athlete table for finding the two player teams.
UPDATE1:
Also, if I am understanding the original query correctly, when you group by p1.id you only get the number of times a player played in a two player team and not the count of the pair itself. You may have to Group BY p1.id, p2.id.

REVISION BASED on EXACTLY TWO PER TEAM
By the inner-most pre-aggregate of exactly TWO people, I can get each team with personA and PersonB to a single row per team using MIN() and MAX(). This way, the person's IDs will always be in low-high pair setup to be compared for future teams. Then, I can query the COUNT by the common Mate1 and Mate2 across ALL teams and directly get their Names.
SELECT STRAIGHT_JOIN
p1.surname,
p1.name,
p2.surname,
p2.name,
TeamAggregates.CommonTeams
from
( select PreQueryTeams.Mate1,
PreQueryTeams.Mate2,
count(*) CommonTeams
from
( SELECT team_id,
min( person_id ) mate1,
max( person_id ) mate2
FROM
athlete
group by
team_id
having count(*) = 2 ) PreQueryTeams
group by
PreQueryTeams.Mate1,
PreQueryTeams.Mate2 ) TeamAggregates,
person p1,
person p2
where
TeamAggregates.Mate1 = p1.Person_ID
and TeamAggregates.Mate2 = p2.Person_ID
order by
TeamAggregates.CommonTeams
ORIGINAL ANSWER FOR TEAMS WITH ANY NUMBER OF TEAMMATES
I would do by the following. The inner prequery first joining all possible combinations of people on each individual team, but having person1 < person2 will eliminate counting the same person as person1 AND person2.. In addition, will prevent the reverse based on higher numbered person IDs... Such as
athlete person team
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 3 2
7 4 2
8 1 3
9 4 3
So, from team 1 you would get person pairs of
1,2 1,3 1,4 2,3 2,4 3,4
and NOT get reversed duplicates such as
2,1 3,1 4,1 3,2 4,2 4,3
nor same person
1,1 2,2 3,3 4,4
Then from team 2, you would hav pairs of
1,3 1,4 3,4
Finally in team 3 the single pair of
1,4
thus teammates 1,4 have occured in 3 common teams.
SELECT STRAIGHT_JOIN
p1.surname,
p1.name,
p2.surname,
p2.name,
PreQuery.CommonTeams
from
( select
a1.Person_ID Person_ID1,
a2.Person_ID Person_ID2,
count(*) CommonTeams
from
athlete a1,
athlete a2
where
a1.Team_ID = a2.Team_ID
and a1.Person_ID < a2.Person_ID
group by
1, 2
having CommonTeams > 1 ) PreQuery,
person p1,
person p2
where
PreQuery.Person_ID1 = p1.id
and PreQuery.Person_ID2 = p2.id
order by
PreQuery.CommonTeams

Here, Some tips to improve SQL select query performance like:
Use SET NOCOUNT ON it is help to decrease network traffic thus
improve performance.
Use fully qualified procedure name (e.g.
database.schema.objectname)
Use sp_executesql instead of execute for dynamic query
Don't use select * use select column1,column2,.. for IF EXISTS
or SELECT operation
Avoid naming user Stored Procedure like sp_procedureName Becouse,
If we use Stored Procedure name start with sp_ then SQL first
search in master db. so it can down query performance.

Related

How do I update scores in table without using a ranking function

Table name is: result
ID Name score position
1 John 40 0
2. Ali 79 0
3 Ben 50 0
4 Joe 79 0
How can I update table result to give me the table below without using rank() as it does not support by server. Pls someone should help me with the MySQL code That breaks ties just as in table below.
ID Name score position
1 John 40 4
2. Ali 79 1
3 Ben 50 3
4 Joe 79 1
In MySQL prior to version 8 try using the multiple table update syntax:
UPDATE scores t
LEFT JOIN (
SELECT t1.id, COUNT(*) + 1 AS new_position
FROM scores t1
JOIN scores t2 ON t1.score < t2.score
GROUP BY t1.id
) agg ON t.id = agg.id
SET t.position = COALESCE(agg.new_position, 1)
fiddle
Lots of ways to skin this particular animal. How about...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(ID SERIAL PRIMARY KEY
,Name VARCHAR(12) NOT NULL
,score INT NOT NULL
);
INSERT INTO my_table VALUES
(1,'John',40),
(2,'Ali',79),
(3,'Ben',50),
(4,'Joe',79);
SELECT id
, name
, score
, FIND_IN_SET(score, scores) rank
FROM my_table
CROSS
JOIN
( SELECT GROUP_CONCAT(score ORDER BY score DESC) scores
FROM my_table
) scores
+----+------+-------+------+
| id | name | score | rank |
+----+------+-------+------+
| 1 | John | 40 | 4 |
| 2 | Ali | 79 | 1 |
| 3 | Ben | 50 | 3 |
| 4 | Joe | 79 | 1 |
+----+------+-------+------+
I've not provided an UPDATE, because you wouldn't normally store derived data.
You can use correlated sub-query as follows:
update your_table t
set t.position = (select count(*) + 1 from your_table tt
where tt.score > t.score)

Join two tables matching multiple ID's to names

Fiddle here: http://sqlfiddle.com/#!9/53d3c/2/0
I have two tables, one containing Member Names and their ID Number. Let's call that table Names:
CREATE TABLE Names (
ID int,
Title text
);
INSERT INTO Names
VALUES (11,'Chad'),
(10,'Deb'),
(34,'Steph'),
(13,'Chris'),
(98,'Peter'),
(33,'Daniel'),
(78,'Christine'),
(53,'Yolanda')
;
My second table contains meeting information, where someone is a Coach and someone is a Player. Each entry is a separate line (i.e. Meeting_ID 1 has two entries, one for the coach, one for the participant). Further, there is a column identifier for if that row is for a coach or player.
CREATE TABLE Meeting_Data (
Meeting_ID int,
Player_ID int,
Coach_ID int,
field_id int
);
INSERT INTO Meeting_Data
VALUES (1,0,11,2),
(1,10,0,1),
(2,34,0,1),
(2,0,13,2),
(3,98,0,1),
(3,0,33,2),
(4,78,0,1),
(4,0,53,2)
;
What I'm trying to do is create a table that puts each Meeting on one row, and then puts the ID#s and Names of the people meeting. When I attempt this, I get one column to pull successfully and then one column of (null) values.
SELECT Meeting_ID,
Max(CASE
WHEN field_id = 1 THEN Player_ID
END) AS Player_ID,
Max(CASE
WHEN field_id = 2 THEN Coach_ID
END) AS Coach_ID,
Player_Names.Title as Player_Names,
Coach_Names.Title as Coach_Names
FROM Meeting_Data
LEFT JOIN Names Player_Names
ON Player_ID = Player_Names.ID
LEFT JOIN Names Coach_Names
ON Coach_ID = Coach_Names.ID
GROUP BY Meeting_ID
Which results in:
| Meeting_ID | Player_ID | Coach_ID | Player_Names | Coach_Names |
|------------|-----------|----------|--------------|-------------|
| 1 | 10 | 11 | Deb | (null) |
| 2 | 34 | 13 | Steph | (null) |
| 3 | 98 | 33 | Peter | (null) |
| 4 | 78 | 53 | Christine | (null) |
How about something like this (http://sqlfiddle.com/#!9/53d3c/52/0):
SELECT Meeting_ID, Player_ID, Coach_ID, Players.Title, Coaches.Title
FROM (
SELECT Meeting_ID,
MAX(Player_ID) as Player_ID,
MAX(Coach_ID) as Coach_ID
FROM Meeting_Data
GROUP BY Meeting_ID
) meeting
LEFT JOIN Names Players ON Players.ID = meeting.Player_ID
LEFT JOIN Names Coaches ON Coaches.ID = meeting.Coach_ID

mysql: How to exclude rows from table which exist in table_alias with good perfomanse?

I've sql with NOT EXIST and it works very slowly in big db:
SELECT *
FROM
(
SELECT * FROM profiles ORDER BY id DESC
/* I need this order HERE! More info: https://stackoverflow.com/q/43516402/2051938 */
) AS users
WHERE
NOT EXISTS (
SELECT *
FROM request_for_friendship
WHERE
(
request_for_friendship.from_id = 1
AND
request_for_friendship.to_id = users.id
)
OR
(
request_for_friendship.from_id = users.id
AND
request_for_friendship.to_id = 1
)
)
LIMIT 0 , 1;
And I think I need to get request_for_friendship with some WHERE and after that check NOT EXIST, like this:
SELECT users.*
FROM
(
SELECT * FROM profiles ORDER BY id DESC
) AS users,
(
SELECT *
FROM request_for_friendship
WHERE
request_for_friendship.from_id = 1
OR
request_for_friendship.to_id = 1
) AS exclude_table
WHERE
NOT EXISTS
(
SELECT *
FROM exclude_table /* #1146 - Table 'join_test.exclude_table' doesn't exist */
WHERE
request_for_friendship.from_id = users.id
OR
request_for_friendship.to_id = users.id
)
LIMIT 0 , 1;
But it doesn't work: #1146 - Table 'join_test.exclude_table' doesn't exist
My tables:
1) profiles
+----+---------+
| id | name |
+----+---------+
| 1 | WILLIAM |
| 2 | JOHN |
| 3 | ROBERT |
| 4 | MICHAEL |
| 5 | JAMES |
| 6 | DAVID |
| 7 | RICHARD |
| 8 | CHARLES |
| 9 | JOSEPH |
| 10 | THOMAS |
+----+---------+
2) request_for_friendship
+----+---------+-------+
| id | from_id | to_id |
+----+---------+-------+
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 1 | 8 |
| 5 | 4 | 1 |
| 6 | 9 | 1 |
+----+---------+-------+
How to do some like this or better for perfomance?
p.s. I need to get only 1 row from table
Demo: http://rextester.com/DTA64368
I've already tried LEFT JOIN, but I've problem with order with him. mysql: how to save ORDER BY after LEFT JOIN without reorder?
First, do not use subqueries unnecessarily. Second, split the NOT EXISTS into two conditions:
SELECT p.*
FROM profiles p
WHERE NOT EXISTS (SELECT 1
FROM request_for_friendship rff
WHERE rff.from_id = 1 AND
rff.to_id = p.id
) AND
NOT EXISTS (SELECT 1
FROM request_for_friendship rff
WHERE rff.to_id = 1 AND
rff.from_id = p.id
)
ORDER BY id DESC;
This can now make use of two indexes: request_for_friendship(to_id, from_id) and request_for_friendship(from_id, to_id). Each index is needed for one of the NOT EXISTS conditions.
I still think there's ways to optimize this as 'in' is generally slower.
SELECT *
FROM profiles p
WHERE NOT EXISTS (SELECT 1
FROM request_for_friendship
WHERE (request_for_friendship.from_id,
request_for_friendship.to_id)
in ((1,p.id),
(p.id,1))
)
Get rid of the id in request_for_friendship. It wastes space and performance. The table has a "natural" PRIMARY KEY, which I will get to in a moment.
Since it seems that the relationship seems to commutative, let's make use of that by sorting the from and to -- put the smaller id in from and the larger is to. See LEAST() and GREATEST() functions.
Then you need only one EXISTS(), not two. And have
PRIMARY KEY(from_id, to_id)
Now to rethink the purpose of the query... You are looking for the highest id that is not "related" to id #1, correct? That sounds like a LEFT JOIN.
SELECT
FROM profiles AS p
LEFT JOIN request_for_friendship AS r ON r.to = p.id AND r.from = 1
WHERE r.to IS NULL
ORDER BY id DESC
LIMIT 1;
This may run about the same speed as the EXISTS -- Both walk through profiles from the highest id, reaching into the other table to see if a row is there.
If there is no such id, then the entire profiles table will be scanned, plus a the same number of probes into the other table.

Can query be optimized: Get a records max date then join the max date's values

I've created a query that returns the results I want but I feel there must be a better way to do this. Any guidance would be appreciated.
I am trying to get all items for a specific meeting and join their max meeting date < X and join the max date's committee acronym. X is the current meeting date.
I've tried a few different queries but none, other than the one below, returned the expected results all the time.
You can see this query in action by going to rextester.
DROP TABLE IF EXISTS `committees`;
CREATE TABLE committees
(`id` int, `acronym` varchar(4))
;
INSERT INTO committees
(`id`, `acronym`)
VALUES
(1, 'Com1'),
(2, 'Com2'),
(3, 'Com3')
;
DROP TABLE IF EXISTS `meetings`;
CREATE TABLE meetings
(`id` int, `date` datetime, `committee_id` int)
;
INSERT INTO meetings
(`id`, `date`, `committee_id`)
VALUES
(1, '2017-01-01 00:00:00', 1),
(2, '2017-02-02 00:00:00', 2),
(3, '2017-03-03 00:00:00', 2)
;
DROP TABLE IF EXISTS `agenda_items`;
CREATE TABLE agenda_items
(`id` int, `name` varchar(6))
;
INSERT INTO agenda_items
(`id`, `name`)
VALUES
(1, 'Item 1'),
(2, 'Item 2'),
(3, 'Item 3')
;
DROP TABLE IF EXISTS `join_agenda_items_meetings`;
CREATE TABLE join_agenda_items_meetings
(`id` int, `agenda_item_id` int, `meeting_id` int)
;
INSERT INTO join_agenda_items_meetings
(`id`, `agenda_item_id`, `meeting_id`)
VALUES
(1, 1, 1),
(2, 1, 2),
(3, 2, 1),
(4, 3, 2),
(5, 2, 1),
(6, 1, 3)
;
SELECT agenda_items.id,
meetings.id,
meetings.date,
sub_one.max_date,
sub_two.acronym
FROM agenda_items
LEFT JOIN (SELECT ai.id AS ai_id,
me.id AS me_id,
Max(me.date) AS max_date
FROM agenda_items AS ai
JOIN join_agenda_items_meetings AS jaim
ON jaim.agenda_item_id = ai.id
JOIN meetings AS me
ON me.id = jaim.meeting_id
WHERE me.date < '2017-02-02'
GROUP BY ai_id) sub_one
ON sub_one.ai_id = agenda_items.id
LEFT JOIN (SELECT agenda_items.id AS age_id,
meetings.date AS meet_date,
committees.acronym AS acronym
FROM agenda_items
JOIN join_agenda_items_meetings
ON join_agenda_items_meetings.agenda_item_id = agenda_items.id
JOIN meetings
ON meetings.id = join_agenda_items_meetings.meeting_id
JOIN committees
ON committees.id = meetings.committee_id
WHERE meetings.date) sub_two
ON sub_two.age_id = agenda_items.id
AND sub_one.max_date = sub_two.meet_date
JOIN join_agenda_items_meetings
ON agenda_items.id = join_agenda_items_meetings.agenda_item_id
JOIN meetings
ON meetings.id = join_agenda_items_meetings.meeting_id
WHERE meetings.id = 2;
REVIEW / TESTING OF ANSWERS (REVISED):*
I've revised the testing based on the comments made.
Since I put a bounty on this question I felt I should show how I'm evaluating the answers and give some feedback. Overall I'm very grateful to all how have helped out, thank you.
For testing, I reviewed the queries against:
the initial rextester
a modified version of the initial rextester with all 4 queries for 2 separate datasets
a larger data set from my actual database
My Original Query with EXPLAIN
+----+-------------+---------------------------+------+----------------------------------------------+
| id | select_type | table | rows | Extra |
+----+-------------+---------------------------+------+----------------------------------------------+
| 1 | PRIMARY | meetings | 1 | |
| 1 | PRIMARY | join_agenda_item_meetings | 1976 | Using where; Using index |
| 1 | PRIMARY | agenda_items | 1 | Using index |
| 1 | PRIMARY | <derived2> | 1087 | |
| 1 | PRIMARY | <derived3> | 2202 | |
| 3 | DERIVED | join_agenda_item_meetings | 1976 | Using index |
| 3 | DERIVED | meetings | 1 | Using where |
| 3 | DERIVED | committees | 1 | |
| 3 | DERIVED | agenda_items | 1 | Using index |
| 2 | DERIVED | jaim | 1976 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | me | 1 | Using where |
| 2 | DERIVED | ai | 1 | Using index |
+----+-------------+---------------------------+------+----------------------------------------------+
12 rows in set (0.02 sec)
Paul Spiegel's answers.
The initial answer works and seems to be the most efficient option presented, much more than mine.
Paul Spiegel's first query pulls the fewest rows, is shorter and more readable than mine. It also doesn't need to reference a date which will be nicer when writing it as well.
+----+--------------------+-------+------+--------------------------+
| id | select_type | table | rows | Extra |
+----+--------------------+-------+------+--------------------------+
| 1 | PRIMARY | m1 | 1 | |
| 1 | PRIMARY | am1 | 1976 | Using where; Using index |
| 1 | PRIMARY | am2 | 1 | Using index |
| 1 | PRIMARY | m2 | 1 | |
| 2 | DEPENDENT SUBQUERY | am3 | 1 | Using index |
| 2 | DEPENDENT SUBQUERY | m3 | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | c3 | 1 | Using where |
+----+--------------------+-------+------+--------------------------+
7 rows in set (0.00 sec)
This query also returns the correct results when adding DISTINCT to the select statement. This query does not perform as well as the first though (but it is close).
+----+-------------+------------++------+-------------------------+
| id | select_type | table | rows | Extra |
+----+-------------+------------++------+-------------------------+
| 1 | PRIMARY | <derived2> | 5 | Using temporary |
| 1 | PRIMARY | am | 1 | Using index |
| 1 | PRIMARY | m | 1 | |
| 1 | PRIMARY | c | 1 | Using where |
| 2 | DERIVED | m1 | 1 | |
| 2 | DERIVED | am1 | 1787 | Using where; Using index |
| 2 | DERIVED | am2 | 1 | Using index |
| 2 | DERIVED | m2 | 1 | |
+----+-------------+------------+------+--------------------------+
8 rows in set (0.00 sec)
Stefano Zanini's answer
This query does return the expected results using DISTINCT. When using EXPLAIN and the number of rows being pulled this query is more efficient when compared to my original one but Paul Spiegel's is just a bit better.
+----+-------------+------------+------+---------------------------------+
| id | select_type | table | rows | Extra |
+----+-------------+------------+------+---------------------------------+
| 1 | PRIMARY | me | 1 | Using temporary; Using filesort |
| 1 | PRIMARY | rel | 1787 | Using where; Using index |
| 1 | PRIMARY | <derived2> | 1087 | |
| 1 | PRIMARY | rel2 | 1 | Using index |
| 1 | PRIMARY | me2 | 1 | Using where |
| 1 | PRIMARY | co | 1 | |
| 2 | DERIVED | t1 | 1787 | Using index |
| 2 | DERIVED | t2 | 1 | Using where |
+----+-------------+------------+------+---------------------------------+
8 rows in set (0.00 sec)
EoinS' answer
As noted in the comments, this answer works if meetings are sequential, but they may not be unfortunately.
This one is a bit crazy.. Let's do it step by step:
The first step is a basic join
set #meeting_id = 2;
select am1.meeting_id,
am1.agenda_item_id,
m1.date as meeting_date
from meetings m1
join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
where m1.id = #meeting_id;
We select the meeting (id = 2) and the corresponding agenda_item_ids. This will already return the rows we need with the first three columns.
Next step is to get the last meeting date for every agenda item. We need to join the first query with the join table and corresponding meetings (except of the one with id = 2 - am2.meeting_id <> am1.meeting_id). We only want meetings with a date before the actual meeting (m2.date < m1.date). From all those meetings we only want the latest date each agenda item. So we group by the agenda item and select max(m2.date):
select am1.meeting_id,
am1.agenda_item_id,
m1.date as meeting_date,
max(m2.date) as max_date
from meetings m1
join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
left join join_agenda_items_meetings am2
on am2.agenda_item_id = am1.agenda_item_id
and am2.meeting_id <> am1.meeting_id
left join meetings m2
on m2.id = am2.meeting_id
and m2.date < m1.date
where m1.id = #meeting_id
group by m1.id, am1.agenda_item_id;
This way we get the fourth column (max_date).
Last step is to select the acronym of the meeting with the last date (max_date). And this is the crazy part - We can use a correlated subquery in the SELECT clause. And we can use max(m2.date) for the correlation:
select c3.acronym
from meetings m3
join join_agenda_items_meetings am3 on am3.meeting_id = m3.id
join committees c3 on c3.id = m3.committee_id
where am3.agenda_item_id = am2.agenda_item_id
and m3.date = max(m2.date)
The final query would be:
select am1.meeting_id,
am1.agenda_item_id,
m1.date as meeting_date,
max(m2.date) as max_date,
( select c3.acronym
from meetings m3
join join_agenda_items_meetings am3 on am3.meeting_id = m3.id
join committees c3 on c3.id = m3.committee_id
where am3.agenda_item_id = am2.agenda_item_id
and m3.date = max(m2.date)
) as acronym
from meetings m1
join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
left join join_agenda_items_meetings am2
on am2.agenda_item_id = am1.agenda_item_id
and am2.meeting_id <> am1.meeting_id
left join meetings m2
on m2.id = am2.meeting_id
and m2.date < m1.date
where m1.id = #meeting_id
group by m1.id, am1.agenda_item_id;
http://rextester.com/JKK60222
To be true, i was surprised that you can use max(m2.date) in the subquery.
Another solution - Use the second query in a subquery (derived table). Join committees over meetings and the join table using max_date. Only keep rows with an acronym and rows without a max_date.
select t.*, c.acronym
from (
select am1.meeting_id,
am1.agenda_item_id,
m1.date as meeting_date,
max(m2.date) as max_date
from meetings m1
join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
left join join_agenda_items_meetings am2
on am2.agenda_item_id = am1.agenda_item_id
and am2.meeting_id <> am1.meeting_id
left join meetings m2
on m2.id = am2.meeting_id
and m2.date < m1.date
where m1.id = #meeting_id
group by m1.id, am1.agenda_item_id
) t
left join join_agenda_items_meetings am
on am.agenda_item_id = t.agenda_item_id
and t.max_date is not null
left join meetings m
on m.id = am.meeting_id
and m.date = t.max_date
left join committees c on c.id = m.committee_id
where t.max_date is null or c.acronym is not null;
http://rextester.com/BBMDFL23101
Using your schema I used the below query, assuming that all meetings entries are sequential:
set #mymeeting = 2;
select j.agenda_item_id, m.id, m.date, mp.date, c.acronym
from meetings m
left join join_agenda_items_meetings j on j.meeting_id = m.id
left join join_agenda_items_meetings jp on jp.meeting_id = m.id -1 and jp.agenda_item_id = j.agenda_item_id
left join meetings mp on mp.id = jp.meeting_id
left join committees c on mp.committee_id = c.id
where m.id = #mymeeting;
I create a variable just to make it easy to change meetings on the fly.
Here is a functional example in Rextester
Thanks for making your schema so easy to reproduce!
I found this problem quite challenging, and the results I achieved are not jaw-dropping, but I managed to get rid of one of the sub-queries and maybe of a few joins, and this is result:
select distinct me.ID, me.DATE, rel.AGENDA_ITEM_ID, sub.MAX_DATE, co.ACRONYM
from MEETINGS me
join JOIN_AGENDA_ITEMS_MEETINGS rel /* Note 1*/
on me.ID = rel.MEETING_ID
left join (
select t1.AGENDA_ITEM_ID, max(t2.DATE) MAX_DATE
from JOIN_AGENDA_ITEMS_MEETINGS t1
join MEETINGS t2
on t2.ID = t1.MEETING_ID
where t2.DATE < '2017-02-02'
group by t1.AGENDA_ITEM_ID
) sub
on rel.AGENDA_ITEM_ID = sub.AGENDA_ITEM_ID /* Note 2 */
left join JOIN_AGENDA_ITEMS_MEETINGS rel2
on rel2.AGENDA_ITEM_ID = rel.AGENDA_ITEM_ID /* Note 3 */
left join MEETINGS me2
on rel2.MEETING_ID = me2.ID and
sub.MAX_DATE = me2.DATE /* Note 4 */
left join COMMITTEES co
on co.ID = me2.COMMITTEE_ID
where me.ID = 2 and
(sub.MAX_DATE is null or me2.DATE is not null) /* Note 5 */
order by rel.AGENDA_ITEM_ID, rel2.MEETING_ID;
Notes
you don't need the join with AGENDA_ITEMS, since the ID is already available in the relationship table
up to this point we have current meeting, its agenda items and their "calculated" max date
we get all meetings of each agenda item...
...so that we can pick the meeting whom date matches the max date we calculated previously
this condition is needed because all the joins from rel2 on have to be left (because some agenda item may have no previous meeting and hence MAX_DATE = null) but this way me2 would give some agenda items undesired meetings.

What's the most efficient way to structure a 2-dimensional MySQL query?

I have a MySQL database with the following tables and fields:
Student (id)
Class (id)
Grade (id, student_id, class_id, grade)
The student and class tables are indexed on id (primary keys). The grade table is indexed on id (primary key) and student_id, class_id and grade.
I need to construct a query which, given a class ID, gives a list of all other classes and the number of students who scored more in that other class.
Essentially, given the following data in the grades table:
student_id | class_id | grade
--------------------------------------
1 | 1 | 87
1 | 2 | 91
1 | 3 | 75
2 | 1 | 68
2 | 2 | 95
2 | 3 | 84
3 | 1 | 76
3 | 2 | 88
3 | 3 | 71
Querying with class ID 1 should yield:
class_id | total
-------------------
2 | 3
3 | 1
Ideally I'd like this to execute in a few seconds, as I'd like it to be part of a web interface.
The issue I have is that in my database, I have over 1300 classes and 160,000 students. My grade table has almost 15 million rows and as such, the query takes a long time to execute.
Here's what I've tried so far along with the times each query took:
-- I manually stopped execution after 2 hours
SELECT c.id, COUNT(*) AS total
FROM classes c
INNER JOIN grades a ON a.class_id = c.id
INNER JOIN grades b ON b.grade < a.grade AND
a.student_id = b.student_id AND
b.class_id = 1
WHERE c.id != 1 AND
GROUP BY c.id
-- I manually stopped execution after 20 minutes
SELECT c.id,
(
SELECT COUNT(*)
FROM grades g
WHERE g.class_id = c.id AND g.grade > (
SELECT grade
FROM grades
WHERE student_id = g.student_id AND
class_id = 1
)
) AS total
FROM classes c
WHERE c.id != 1;
-- 1 min 12 sec
CREATE TEMPORARY TABLE temp_blah (student_id INT(11) PRIMARY KEY, grade INT);
INSERT INTO temp_blah SELECT student_id, grade FROM grades WHERE class_id = 1;
SELECT o.id,
(
SELECT COUNT(*)
FROM grades g
INNER JOIN temp_blah t ON g.student_id = t.student_id
WHERE g.class_id = c.id AND t.grade < g.grade
) AS total
FROM classes c
WHERE c.id != 1;
-- Same thing but with joins instead of a subquery - 1 min 54 sec
SELECT c.id,
COUNT(*) AS total
FROM classes c
INNER JOIN grades g ON c.id = p.class_id
INNER JOIN temp_blah t ON g.student_id = t.student_id
WHERE c.id != 1
GROUP BY c.id;
I also considered creating a 2D table, with students as rows and classes as columns, however I can see two issues with this:
MySQL implements a maximum column count (4096) and maximum row size (in bytes) which may be exceeded by this approach
I can't think of a good way to query that structure to get the results I need
I also considered performing these calculations as background jobs and storing the results somewhere, but for the information to remain current (it must), they would need to be recalculated every time a student, class or grade record was created or updated.
Does anyone know a more efficient way to construct this query?
EDIT: Create table statements:
CREATE TABLE `classes` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1331 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci$$
CREATE TABLE `students` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=160803 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci$$
CREATE TABLE `grades` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`student_id` int(11) DEFAULT NULL,
`class_id` int(11) DEFAULT NULL,
`grade` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_grades_on_student_id` (`student_id`),
KEY `index_grades_on_class_id` (`class_id`),
KEY `index_grades_on_grade` (`grade`)
) ENGINE=InnoDB AUTO_INCREMENT=15507698 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci$$
Output of explain on the most efficient query (the 1 min 12 sec one):
id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | c | range | PRIMARY | PRIMARY | 4 | | 683 | Using where; Using index
2 | DEPENDENT SUBQUERY | g | ref | index_grades_on_student_id,index_grades_on_class_id,index_grades_on_grade | index_grades_on_class_id | 5 | mydb.c.id | 830393 | Using where
2 | DEPENDENT SUBQUERY | t | eq_ref | PRIMARY | PRIMARY | 4 | mydb.g.student_id | 1 | Using where
Another edit - explain output for sgeddes suggestion:
+----+-------------+------------+--------+---------------+------+---------+------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+------+---------+------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 14953992 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | <derived3> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | G | ALL | NULL | NULL | NULL | NULL | 15115388 | |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+----------+----------------------------------------------+
I think this should work for you using SUM and CASE:
SELECT C.Id,
SUM(
CASE
WHEN G.Grade > C2.Grade THEN 1 ELSE 0
END
)
FROM Class C
INNER JOIN Grade G ON C.Id = G.Class_Id
LEFT JOIN (
SELECT Grade, Student_Id, Class_Id
FROM Class
JOIN Grade ON Class.Id = Grade.Class_Id
WHERE Class.Id = 1
) C2 ON G.Student_Id = C2.Student_Id
WHERE C.Id <> 1
GROUP BY C.Id
Sample Fiddle Demo
--EDIT--
In response to your comment, here is another attempt that should be much faster:
SELECT
Class_Id,
SUM(CASE WHEN Grade > minGrade THEN 1 ELSE 0 END)
FROM
(
SELECT
Student_Id,
#classToCheck:=
IF(G.Class_Id = 1, Grade, #classToCheck) minGrade ,
Class_Id,
Grade
FROM Grade G
JOIN (SELECT #classToCheck:= 0) t
ORDER BY Student_Id, IF(Class_Id = 1, 0, 1)
) t
WHERE Class_Id <> 1
GROUP BY Class_ID
And more sample fiddle.
Can you give this a try on the original data as well! It is only one join :)
select
final.class_id, count(*) as total
from
(
select * from
(select student_id as p_student_id, grade as p_grade from table1 where class_id = 1) as partial
inner join table1 on table1.student_id = partial.p_student_id
where table1.class_id <> 1 and table1.grade > partial.p_grade
) as final
group by
final.class_id;
sqlfiddle link