table a
no name
2001 jon
2002 jonny
2003 mik
2004 mike
2005 mikey
2006 tom
2007 tomo
2008 tommy
table b
code name credits courseCode
A2 JAVA 25 wer
A3 php 25 wer
A4 oracle 25 wer
B2 p.e 50 oth
B3 sport 50 oth
C2 r.e 25 rst
C3 science 25 rst
C4 networks 25 rst
table c
studentNumber grade coursecode
2003 68 A2
2003 72 A3
2003 53 A4
2005 48 A2
2005 52 A3
2002 20 A2
2002 30 A3
2002 50 A4
2008 90 B2
2007 73 B2
2007 63 B3
SELECT a.num, a.Fname,
b.courseName, b.cMAXscore, b.cCode, c.stuGrade
FROM a
INNER JOIN c
ON a.no = c.no
INNER JOIN b
ON c.moduleCode = b.cCode
INNER JOIN b
ON SUM(b.cMAXscore) / (c.stuGrade)
AND b.cMAXscore = c.stug=Grade
GROUP BY a.Fname, b.cMAXscore, b.cCode, b.courseName,c.stuGrade
"calculate and display every student name(a.Fname) and their ID number(a.num) along with their grade (c.grade) versus the coursse name(b.courseName) and the courses max score(b.cMAXscoure). "
I cant figure out how to divide the MAX by the grade, can someone help?
From the specification, it doesn't look like an aggregate function or a GROUP BY would be necessary. But the specification is ambiguous. There's no table definitions (beyond the unfortunate names and some column references).
Definitions of the tables, along with example data and an example of the desired resultset would go a long ways to removing the ambiguity.
Based on the join predicates in the OP query, I'd suggest something like this query, as a starting point:
SELECT a.Fname
, a.num
, c.grade
, b.courseName
, b.cMAXsource
FROM a
JOIN c
ON c.no = a.no
JOIN b
ON b.cCode = c.moduleCode
ORDER
BY a.Fname
, a.num
, c.grade
, b.courseName
, b.cMAXsource
It seems like that would return the specified result (based on my interpretation of the vague specification.) If that's insufficient i.e. if that doesn't return the desired resultset, then in what way does the desired result differ from the result from this query?
(For more help with your question, I suggest you setup a sqlfiddle example with tables and example data. That will make it easier for someone to help you.)
FOLLOWUP
Based on the additional information provided in the question (table definitions and example data...
To get the maximum (highest) grade for a given course, you could use a query like this:
SELECT MAX(c.grade)
FROM c
WHERE c.coursecode = 'A2'
To get the highest grade for all courses:
SELECT c.coursecode
, MAX(c.grade) AS max_grade
FROM c
GROUP BY c.coursecode
ORDER BY c.coursecode
To match the highest grade for each course to each student grade, use that previous query as an inline view in another query. Something like this:
SELECT g.studentNumber
, g.grade
, g.coursecode
, h.coursecode
, h.highest_grade
FROM c g
JOIN ( SELECT c.coursecode
, MAX(c.grade) AS highest_grade
FROM c
GROUP BY c.coursecode
) h
ON h.coursecode = g.coursecode
To perform a calculation, you can use an expression in the SELECT list of the outer query.
For example, to divide the value of one column by another, you can use the division operator:
SELECT g.studentNumber AS student_number
, g.grade AS student_grade
, g.coursecode AS student_coursecode
, h.coursecode
, h.highest_grade
, g.grade / h.highest_grade AS `student_grade_divided_by_highest_grade`
FROM c g
JOIN ( SELECT c.coursecode
, MAX(c.grade) AS highest_grade
FROM c
GROUP BY c.coursecode
) h
ON h.coursecode = g.coursecode
If you want to also return the name of the student, you can perform a join operation to (the unfortunately named) table a. Assuming that studentnumber is UNIQUE in a :
LEFT
JOIN a
ON a.studentnumber = c.studentnumber
And include a.Fname AS student_first_name in the SELECT list.
If you also need columns from table b, then join that table as well. Assuming that coursecode is UNIQUE in b:
LEFT
JOIN b
ON b.coursecode = g.courscode
Then b.credits can be referenced in an expression in the SELECT list.
Beyond that, you need to be a little more explicit about what result should be returned by the query.
If you are after a "total overall grade" for a student, you'd need to specify how that result should be obtained.
Without knowing table definations it is very hard to provide solution to your problem.
Here is my version of what you are trying to look for:
DECLARE #Student TABLE
(StudentID INT IDENTITY,
FirstName VARCHAR(255),
LastName VARCHAR(255)
);
DECLARE #Course TABLE
(CourseID INT IDENTITY,
CourseCode VARCHAR(25),
CourseName VARCHAR(255),
MaxScore INT
);
DECLARE #Grade TABLE
(ID INT IDENTITY,
CourseID INT,
StudentID INT,
Score INT
);
--Student
insert into #Student(FirstName, LastName)
values ('Test', 'B')
insert into #Student(FirstName, LastName)
values ('Test123', 'K')
--Course
insert into #Course(CourseCode, CourseName, MaxScore)
values ('MAT101', 'MATH',100.00)
insert into #Course(CourseCode, CourseName, MaxScore)
values ('ENG101', 'ENGLISH',100.00)
--Grade
insert into #Grade(CourseID, StudentID, Score)
values (1, 1,93)
insert into #Grade(CourseID, StudentID, Score)
values (1, 1,65)
insert into #Grade(CourseID, StudentID, Score)
values (1, 1,100)
insert into #Grade(CourseID, StudentID, Score)
values (2, 1,100)
insert into #Grade(CourseID, StudentID, Score)
values (2, 1,69)
insert into #Grade(CourseID, StudentID, Score)
values (2, 1,95)
insert into #Grade(CourseID, StudentID, Score)
values (1, 2,100)
insert into #Grade(CourseID, StudentID, Score)
values (1, 2,65)
insert into #Grade(CourseID, StudentID, Score)
values (1, 2,100)
insert into #Grade(CourseID, StudentID, Score)
values (2, 2,100)
insert into #Grade(CourseID, StudentID, Score)
values (2, 2,88)
insert into #Grade(CourseID, StudentID, Score)
values (2, 2,96)
SELECT a.StudentID,
a.FirstName,
a.LastName,
c.CourseCode,
SUM(b.Score) AS 'StudentScore',
SUM(c.MaxScore) AS 'MaxCourseScore',
SUM(CAST(b.Score AS DECIMAL(5, 2))) / SUM(CAST(c.MaxScore AS DECIMAL(5, 2))) AS 'Grade'
FROM #Student a
INNER JOIN #Grade b ON a.StudentID = b.StudentID
INNER JOIN #Course c ON c.CourseID = b.CourseID
GROUP BY a.StudentID,
a.FirstName,
a.LastName,
c.CourseCode;
The problem statement doesn't say anything about dividing by the max, I think you're misunderstanding it.
You need to write a subquery that gets the maximum score for each class, using MAX and GROUP BY. You can then join this with the other tables.
SELECT s.name AS student_name, c.name AS course_name, g.grade, m.max_grade
FROM student AS s
JOIN grade AS g ON s.no = g.studentNumber
JOIN course AS c ON c.code = g.courseCode
JOIN (SELECT courseCode, MAX(grade) AS max_grade
FROM grade
GROUP BY courseCode) AS m
ON m.courseCode = c.courseCode
If you did need to divide the grade by the maximum, you can use g.grade/m.max_grade.
Related
Given a company table companylist and a import-export table trades, I want to find the total exports and imports per country. I want to sort them by country name and print 0s instead of nulls if exports/imports are 0. All countries need to be present in output.
Companylist table =>
name country
abc corp congo
arcus t.g. ghana
bob timbuktu
ddr ltd ghana
none at all nothingland
xyz corp bubbleland
Y zap timbuktu
trades table
id seller buyer value
20120125 bob arcus t.g. 100
20120216 abc corp ddr ltd 30
20120217 abc corp ddr ltd 50
20121107 abc corp bob 10
20123112 arcus t.g. Y zap 30
The tables DDL -
create table if not exists companylist (name varchar(30) not null, country varchar(30) not null, unique(name));
truncate table companylist;
create table if not exists trades (id integer not null, seller varchar(30) not null, buyer varchar(30) not null, value integer
not null, unique(id));
truncate table trades;
insert into companylist(name,country) values ('bob','timbuktu');
insert into companylist(name,country) values ('Y zap','timbuktu');
insert into companylist(name,country) values ('ddr ltd','ghana');
insert into companylist(name,country) values ('arcus t.g.','ghana');
insert into companylist(name,country) values ('abc corp','congo');
insert into companylist(name, country) values ('xyz corp', 'bubbleland');
insert into companylist(name,country) values ('none at all','nothingland');
insert into trades(id,seller,buyer,value) values (20121107,'abc corp','bob',10);
insert into trades(id,seller,buyer,value) values (20123112,'arcus t.g.','Y zap',30);
insert into trades(id,seller,buyer,value) values (20120125,'bob','arcus t.g.',100);
insert into trades(id,seller,buyer,value) values (20120216,'abc corp','ddr ltd',30);
insert into trades(id,seller,buyer,value) values (20120217,'abc corp','ddr ltd',50);
So far i have
select name, country, coalesce(sum(b.value),0) as export, coalesce(sum(c.value),0) as import
from companylist a left join trades b on a.name = b.seller
left join trades c on a.name = c.buyer
group by a.country order by a.country ASC;
I think this works, but does someone have a more elegant / better / different solution? I am learning sql so any feedback helps.
Conditional aggregation is an option:
SELECT country,
SUM(CASE WHEN a.name = b.seller
THEN b.value
ELSE 0
END) as export,
SUM(CASE WHEN a.name = b.buyer
THEN b.value
ELSE 0
END) as import
FROM companylist a
LEFT join trades b ON a.name IN (b.seller, b.buyer)
GROUP BY a.country
ORDER BY a.country ASC;
fiddle
PS. companylist.name makes no sense in output - it was removed.
I would recommend union all and group by:
select c.country, sum(sales) as exports, sum(buys) as imports
from countrylist c left join
((select t.seller as name, value as sales, 0 as buys
from trades t
) union all
(select t.buyer, 0 as sales, value as buys
from trades t
)
) bs
on c.name = bs.name
group by c.country;
This should be much more efficient than Akina's solution which uses a function in the on clause (essentially an or).
TABLE [tbl_hobby]
person_id (int) , hobby_id(int)
has many records. I want to get a SQL query to find all pairs of personid who have the same hobbies( same hobby_id ).
If A has hobby_id 1, B has too, if A doesn't have hobby_id 2, B doesn't have too, we will output A & B 's person_ids.
If A and B and C reach the limits, we output A & B , B & C, A & C.
I've finished in a very very very stupid method, multiple joins the table itself and multiple sub-queries. And of course be laughed by leader.
Is there any high performance method in a SQL for this question?
I have been thinking hard for this since 36 hrs ago......
sample data in mysql dump
CREATE TABLE `tbl_hobby` (
`person_id` int(11) NOT NULL,
`hobby_id` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `tbl_hobby` (`person_id`, `hobby_id`) VALUES
(1, 1),(1, 2),(1, 3),(1, 4),(1, 5),(2, 2),
(2, 3),(2, 4),(3, 1),(3, 2),(3, 3),(3, 4),
(4, 1),(4, 3),(4, 4),(5, 1),(5, 5),(5, 9),
(6, 2),(6, 3),(6, 4),(7, 1),(7, 3),(7, 7),
(8, 2),(8, 3),(8, 4),(9, 1),(9, 2),(9, 3),
(9, 4),(10, 1),(10, 5),(10, 9),(10, 11);
COMMIT;
Expert result: (2 and 6 and 8 same, 3 and 9 same)
2,6
2,8
6,8
3,9
Order of result records and order of the two number in one record is not important. Result record in one column or in two columns are all accepted since it can be easily concated or seperated.
Aggregate per person to get strings of their hobbies. Then aggregate per hobby list find out which belong to more than one person.
select hobbies, group_concat(person_id order by person_id) as persons
from
(
select person_id, group_concat(hobby_id order by hobby_id) as hobbies
from tbl_hobby
group by person_id
) persons
group by hobbies
having count(*) > 1
order by hobbies;
This gives a a list of persons per hobby. Which is the easiest way to output a solution as we would otherwise have to build all possible pairs.
UPDATE: If you want pairs, you'll have to query the table twice:
select p1.person_id as person 1, p2.person_id as person2
from
(
select person_id, group_concat(hobby_id order by hobby_id) as hobbies
from tbl_hobby
group by person_id
) p1
join
(
select person_id, group_concat(hobby_id order by hobby_id) as hobbies
from tbl_hobby
group by person_id
) p2 on p2.person_id > p1.person_id and p2.hobbies = p1.hobbies
order by person1, person2;
Alternative version, without using any proprietary string handling:
select distinct t1.person_id, t2.person_id
from tbl_hobby t1
join tbl_hobby t2
on t1.person_id < t2.person_id
where 2 = all (select count(*)
from tbl_hobby
where person_id in (t1.person_id, t2.person_id)
group by hobby_id);
Perhaps less efficient, but portable!
Using MySQL, how do I get the total items and total revenue for each manager's team? Suppose I have these 3 different tables (parent-child-grandchild):
Employee1 is under Supervisor1, and they are both under Manager1, and so on, but real data are in random arrangement. Color-coded the numbers to visualize which gets added.
I want my query to output the total items and total revenue of each manager's team like:
To easily create the table:
DROP TABLE IF EXISTS manager;
CREATE TABLE manager (id int, name varchar(55), no_of_items int, revenue int);
INSERT INTO manager (id, name, no_of_items, revenue)
VALUES
(1 , 'Manager1' , 10 , 100),
(2 , 'Manager2' , 20 , 200),
(3 , 'Manager3' , 30 , 300);
DROP TABLE IF EXISTS supervisor;
CREATE TABLE supervisor (id int, name varchar(55), manager_id int, no_of_items int, revenue int);
INSERT INTO supervisor (id, name, manager_id, no_of_items, revenue)
VALUES
(4 , 'Sup1' , 1, 100 , 1000),
(5 , 'Sup2' , 2, 200 , 2000),
(6 , 'Sup3' , 3, 300 , 3000);
DROP TABLE IF EXISTS employee;
CREATE TABLE employee (id int, name varchar(55), supervisor_id int, no_of_items int, revenue int);
INSERT INTO employee (id, name, supervisor_id, no_of_items, revenue)
VALUES
(7 , 'Emp1' , 4, 400 , 4000),
(8 , 'Emp2' , 5, 500 , 5000),
(9 , 'Emp3' , 4, 600 , 6000);
SQL Fiddle
Utilizing a combination of Nested Subqueries and UNION ALL, you can use the following:
SELECT inner_nest.manager_id,
inner_nest.name,
SUM(inner_nest.total_items) AS total_items,
SUM(inner_nest.total_revenue) AS total_revenue
FROM (
SELECT id as manager_id,
name,
SUM(no_of_items) AS total_items,
SUM(revenue) AS total_revenue
FROM manager
GROUP BY id
UNION ALL
SELECT m.id as manager_id,
m.name,
SUM(s.no_of_items) AS total_items,
SUM(s.revenue) AS total_revenue
FROM manager m
INNER JOIN supervisor s ON s.manager_id = m.id
GROUP BY m.id
UNION ALL
SELECT m.id as manager_id,
m.name,
SUM(e.no_of_items) AS total_items,
SUM(e.revenue) AS total_revenue
FROM manager m
INNER JOIN supervisor s ON s.manager_id = m.id
INNER JOIN employee e ON e.supervisor_id = s.id
GROUP BY m.id
) AS inner_nest
GROUP BY inner_nest.manager_id
Try this: http://sqlfiddle.com/#!9/7a0aef/20
select id,name,COALESCE(sum(distinct m),0)+COALESCE(sum(distinct s),0)+COALESCE(sum(e),0)
as total_item,COALESCE(sum(distinct mv),0)+COALESCE(sum(distinct sv),0)+COALESCE(sum(ev),0)
as total_revenue
from
(
select m.id,m.name,m.no_of_items as m,s.no_of_items as s,e.no_of_items as e,
m.revenue as mv,s.revenue as sv,e.revenue as ev
from manager m left join supervisor s
on m.id=s.manager_id
left join employee e on s.id=e.supervisor_id)a
group by id,name
A single query with the appropriate joins does the trick... In pseudo code (sorry, I'm typing on my phone)
Select a.manager_id, (a.total_items+b.total_items+c.total_items) as totalitems, (a.total_revenue+b.total_revenue+c.total_revenue) as totalrevenue
From parent_table a
Join child_table b on a.manager_id=b.manager.id
Join grandchild_table c on b.sup_id=c.sup_id
Group by a.manager_id
Some adjustments may be needed, but this should definitely point you in your way
SELECT StudentID, Fname, LName, S_LessonNumber, LessonName, Date, Cost
FROM STUDENT_2
JOIN LESSON ON S_LessonNumber = LessonNumber
NATURAL JOIN STUDENT_1
WHERE StudentID = '1001'
The resulting table I get with this query is as follows,
When attempting to display the total amount paid, and the total number of lessons taken, using the following query, I was only able return one row.
SELECT StudentID, Fname, LName, S_LessonNumber, LessonName, Date,
Cost,COUNT( DISTINCT S_LessonNumber ) , SUM( Cost )
FROM STUDENT_2
JOIN LESSON ON S_LessonNumber = LessonNumber
NATURAL JOIN STUDENT_1
WHERE StudentID = '1001'
Is there a way that I can return all 4 rows with the values for COUNT(DISTINCT S_LessonNumber) and SUM(Cost) repeated.
The desired output is as follows:
StudentID FName LName S_LessonNumber LessonName Date Cost COUNT SUM
1001 Hannibal Lecter 7 C--- --- 15 4 60
1001 Hannibal Lecter 6 Wa-- --- 15 4 60
1001 Hannibal Lecter 5 Tri-- --- 15 4 60
1001 Hannibal Lecter 1 Cha- --- 15 4 60
Aggregate functions will always return 1 row. If using subqueries is not a problem, you can do:
SELECT StudentID, Fname, LName, S_LessonNumber, LessonName, Date, Cost,
(SELECT COUNT( DISTINCT S_LessonNumber ) FROM STUDENT_2 JOIN LESSON ON S_LessonNumber = LessonNumber NATURAL JOIN STUDENT_1 WHERE StudentID = '1001') AS COUNT,
(SELECT SUM( Cost ) FROM STUDENT_2 JOIN LESSON ON S_LessonNumber = LessonNumber NATURAL JOIN STUDENT_1 WHERE StudentID = '1001') AS SUM
FROM STUDENT_2
JOIN LESSON ON S_LessonNumber = LessonNumber
NATURAL JOIN STUDENT_1
WHERE StudentID = '1001'
Check it here: http://rextester.com/SBUQ82088
create table if not exists fstudents (id int, fname text, lname text);
create table if not exists fstudents2 (student_id int, lesson_number int, lesson_date date, cost int);
create table if not exists flessons (number int, name text);
insert into fstudents values (1001, 'Anibal', 'Lecter');
insert into flessons values (1, 'Cha Cha');
insert into flessons values (2, 'Waltz');
insert into flessons values (3, 'Country');
insert into flessons values (4, 'Triple2');
insert into fstudents2 values (1001, 1, '2016-10-06', 15);
insert into fstudents2 values (1001, 2, '2016-10-07', 15);
insert into fstudents2 values (1001, 3, '2016-10-08', 15);
insert into fstudents2 values (1001, 4, '2016-10-09', 15);
Use a subquery to count an sum, and then join it to main query.
select st2.student_id, st.fname, st.lname, ls.name, st2.lesson_date, st2.cost, agg.courses, agg.total_cost
from fstudents2 st2
join fstudents st on st2.student_id = st.id
join flessons ls on st2.lesson_number = ls.number
join (select st3.student_id, count(st3.lesson_number) courses, sum(st3.cost) total_cost
from fstudents2 st3
group by st3.student_id) agg on agg.student_id = st2.student_id
where
st2.student_id = 1001;
I have to find a pair of students who take exactly the same classes from table that has studentID and courseID.
studentID | courseID
1 1
1 2
1 3
2 1
3 1
3 2
3 3
Query should return (1, 3).
The result also should not have duplicate rows such as (1,3) and (3,1).
Given sample data:
CREATE TABLE student_course (
student_id integer,
course_id integer,
PRIMARY KEY (student_id, course_id)
);
INSERT INTO student_course (student_id, course_id)
VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3) ;
Use array aggregation
One option is to use a CTE to join on the ordered lists of courses each student is taking:
WITH student_coursearray(student_id, courses) AS (
SELECT student_id, array_agg(course_id ORDER BY course_id)
FROM student_course
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a INNER JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id > b.student_id;
array_agg is actually part of the SQL standard, as is the WITH common-table expression syntax. Neither are supported by MySQL so you'll have to express this a different way if you want to support MySQL.
Find missing course pairings per-student
Another way to think about this would be "for every student pairing, find out if one is taking a class the other is not". This would lend its self to a FULL OUTER JOIN, but it's pretty awkward to express. You have to determine the pairings of student IDs of interest, then for each pairing do a full outer join across the set of classes each takes. If there are any null rows then one took a class the other didn't, so you can use that with a NOT EXISTS filter to exclude such pairings. That gives you this monster:
WITH student_id_pairs(left_student, right_student) AS (
SELECT DISTINCT a.student_id, b.student_id
FROM student_course a
INNER JOIN student_course b ON (a.student_id > b.student_id)
)
SELECT left_student, right_student
FROM student_id_pairs
WHERE NOT EXISTS (
SELECT 1
FROM (SELECT course_id FROM student_course WHERE student_id = left_student) a
FULL OUTER JOIN (SELECT course_id FROM student_course b WHERE student_id = right_student) b
ON (a.course_id = b.course_id)
WHERE a.course_id IS NULL or b.course_id IS NULL
);
The CTE is optional and may be replaced by a CREATE TEMPORARY TABLE AS SELECT ... or whatever if your DB doesn't support CTEs.
Which to use?
I'm very confident that the array approach will perform better in all cases, particularly because for a really large data set you can take the WITH expression, create a temporary table from the query instead, add an index on (courses, student_id) to it and do crazy-fast equality searching that'll well and truly pay off the cost of the index creation time. You can't do that with the subquery joins approach.
select courses,group_concat(studentID) from
(select studentID,
group_concat(courseID order by courseID) as courses
from Table1 group by studentID) abc
group by courses having courses like('%,%');
fiddle
Test case:
I created a somewhat realistic test case:
CREATE TEMP TABLE student_course (
student_id integer
,course_id integer
,PRIMARY KEY (student_id, course_id)
);
INSERT INTO student_course
SELECT *
FROM (VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3)) v
-- to include some non-random values in test
UNION ALL
SELECT DISTINCT student_id, normal_rand((random() * 30)::int, 1000, 35)::int
FROM generate_series(4, 5000) AS student_id;
DELETE FROM student_course WHERE random() > 0.9; -- create some dead tuples
ANALYZE student_course; -- needed for temp table
Note the use of normal_rand() to populate the dummy table with a normal distribution of values. It's shipped with the tablefunc module, and since i am going to use that further down anyway ...
Also note the bold emphasis on the numbers I am going to manipulate for the benchmark to simulate various test cases.
Plain SQL
The question is rather basic and unclear. Find the first two students with matching courses? Or find all? Find couples of them or groups of students sharing the same courses?
Craig answers to:
Find all couples sharing the same courses.
C1 - Craig's first query
Plain SQL With a CTE and grouping by arrays, slightly formatted:
WITH student_coursearray(student_id, courses) AS (
SELECT student_id, array_agg(course_id ORDER BY course_id)
FROM student_course
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a
JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id < b.student_id
ORDER BY a.student_id, b.student_id;
The second query in Craig's answer dropped out of the race right away. With more than just a few rows, performance quickly deteriorates badly. The CROSS JOIN is poison.
E1 - Improved version
There is one major weakness, ORDER BY per aggregate is a bad performer, so I rewrote with ORDER BY in a subquery:
WITH cte AS (
SELECT student_id, array_agg(course_id) AS courses
FROM (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM cte a
JOIN cte b USING (courses)
WHERE a.student_id < b.student_id
ORDER BY 1,2;
E2 - Alternative interpretation of question
I think the generally more useful case is:
Find all students sharing the same courses.
So I return arrays of students with matching courses.
WITH s AS (
SELECT student_id, array_agg(course_id) AS courses
FROM (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
GROUP BY student_id
)
SELECT array_agg(student_id)
FROM s
GROUP BY courses
HAVING count(*) > 1
ORDER BY array_agg(student_id);
F1 - Dynamic PL/pgSQL function
To make this generic and fast I wrapped it into a plpgsql function with dynamic SQL:
CREATE OR REPLACE FUNCTION f_same_set(_tbl regclass, _id text, _match_id text)
RETURNS SETOF int[] AS
$func$
BEGIN
RETURN QUERY EXECUTE format(
$f$
WITH s AS (
SELECT %1$I AS id, array_agg(%2$I) AS courses
FROM (SELECT %1$I, %2$I FROM %3$s ORDER BY 1, 2) s
GROUP BY 1
)
SELECT array_agg(id)
FROM s
GROUP BY courses
HAVING count(*) > 1
ORDER BY array_agg(id)
$f$
,_id, _match_id, _tbl
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_same_set('student_course', 'student_id', 'course_id');
Works for any table with numeric columns. It's trivial to extend for other data types, too.
crosstab()
For a relatively small number of courses (and arbitrarily big number of students) crosstab() provided by the additional tablefunc module is another option in PostgreSQL. More general info here:
PostgreSQL Crosstab Query
Simple case
A simple case for the simple example in the question, much like explained in the linked answer:
SELECT array_agg(student_id)
FROM crosstab('
SELECT student_id, course_id, TRUE
FROM student_course
ORDER BY 1'
,'VALUES (1),(2),(3)'
)
AS t(student_id int, c1 bool, c2 bool, c3 bool)
GROUP BY c1, c2, c3
HAVING count(*) > 1;
F2 - Dynamic crosstab function
For the simple case, the crosstab variant was faster, so I build a plpgsql function with dynamic SQL and included it in the test. Functionally identical with F1.
CREATE OR REPLACE FUNCTION f_same_set_x(_tbl regclass, _id text, _match_id text)
RETURNS SETOF int[] AS
$func$
DECLARE
_ids int[]; -- for array of match_ids (course_id in example)
BEGIN
-- Get list of match_ids
EXECUTE format(
'SELECT array_agg(DISTINCT %1$I ORDER BY %1$I) FROM %2$s',_match_id, _tbl)
INTO _ids;
-- Main query
RETURN QUERY EXECUTE format(
$f$
SELECT array_agg(%1$I)
FROM crosstab('SELECT %1$I, %2$I, TRUE FROM %3$s ORDER BY 1'
,'VALUES (%4$s)')
AS t(student_id int, c%5$s bool)
GROUP BY c%6$s
HAVING count(*) > 1
ORDER BY array_agg(student_id)
$f$
,_id
,_match_id
,_tbl
,array_to_string(_ids, '),(') -- values
,array_to_string(_ids, ' bool,c') -- column def list
,array_to_string(_ids, ',c') -- names
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_same_set_x('student_course', 'student_id', 'course_id');
Benchmark
I tested on my small PostgreSQL test server.
PostgreSQL 9.1.9 on Debian Linux on an ~ 6 years old AMD Opteron Server. I ran 5 test sets with the above settings and each of the presented queries. Best of 5 with EXPLAIN ANALYZE.
I used these values for the bold numbers in the above test case to populate:
nr. of students / max. nr. of courses / standard deviation (results in more distinct course_ids)
1. 1000 / 30 / 35
2. 5000 / 30 / 50
3. 10000 / 30 / 100
4. 10000 / 10 / 10
5. 10000 / 5 / 5
C1
1. Total runtime: 57 ms
2. Total runtime: 315 ms
3. Total runtime: 663 ms
4. Total runtime: 543 ms
5. Total runtime: 2345 ms (!) - deteriorates with many pairs
E1
1. Total runtime: 46 ms
2. Total runtime: 251 ms
3. Total runtime: 529 ms
4. Total runtime: 338 ms
5. Total runtime: 734 ms
E2
1. Total runtime: 45 ms
2. Total runtime: 245 ms
3. Total runtime: 515 ms
4. Total runtime: 218 ms
5. Total runtime: 143 ms
F1 victor
1. Total runtime: 14 ms
2. Total runtime: 77 ms
3. Total runtime: 166 ms
4. Total runtime: 80 ms
5. Total runtime: 54 ms
F2
1. Total runtime: 62 ms
2. Total runtime: 336 ms
3. Total runtime: 1053 ms (!) crosstab() deteriorates with many distinct values
4. Total runtime: 195 ms
5. Total runtime: 105 ms (!) but performs well with fewer distinct values
The PL/pgSQL function with dynamic SQL, sorting rows in a subquery is clear victor.
Naive relational division implementation, with CTE:
WITH pairs AS (
SELECT DISTINCT a.student_id AS aaa
, b.student_id AS bbb
FROM student_course a
JOIN student_course b ON a.course_id = b.course_id
)
SELECT *
FROM pairs p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.aaa
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.bbb
AND nx2.course_id = nx1.course_id
)
)
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.aaa
AND nx2.course_id = nx1.course_id
)
)
;
The same, without CTE's:
SELECT *
FROM (
SELECT DISTINCT a.student_id AS aaa
, b.student_id AS bbb
FROM student_course a
JOIN student_course b ON a.course_id = b.course_id
) p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.aaa
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.bbb
AND nx2.course_id = nx1.course_id
)
)
AND NOT EXISTS (
SELECT * FROM student_course nx1
WHERE nx1.student_id = p.bbb
AND NOT EXISTS (
SELECT * FROM student_course nx2
WHERE nx2.student_id = p.aaa
AND nx2.course_id = nx1.course_id
)
)
;
The non-CTE version is faster, obviously.
Process to get this done in mysql
Create table student_course_agg
(
student_id int,
courses varchar(150)
);
INSERT INTO student_course_agg
select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS
GROUP BY 1;
SELECT master.student_id m_student_id,child.student_id c_student_id
FROM student_course_agg master
JOIN student_course_ag child
ON master.student_id<child.student_id and master.courses=child.courses;
Direct query.
SELECT master.student_id m_student_id,child.student_id c_student_id
FROM (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS
GROUP BY 1) master
JOIN (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS
GROUP BY 1) child
ON master.studentID <child.studentID and master.courses=child.courses;