SQL multiple JOINs or subqueries but avoid cartesian product

SQL multiple JOINs or subqueries but avoid cartesian product - mysql

I want to realize an SQL database for a game. There are a number of players that participate in different tournaments. For each tournament, a player has a separate account. All games are listed in one large table in which the tournament accounts are used to describe winner, loser, along with the score of the game.
The schema is given in http://sqlfiddle.com/#!9/55378a or here again
CREATE TABLE `players` (
`id` int NOT NULL,
`name` varchar(5),
PRIMARY KEY (`id`)
);
CREATE TABLE `tournamentAccounts` (
`tId` int NOT NULL,
`playerId` int NOT NULL,
`handicap` int NOT NULL DEFAULT 10,
PRIMARY KEY (`tId`)
);
CREATE TABLE `games` (
`gameId` int NOT NULL,
`winnerTId` int NOT NULL,
`loserTId` int NOT NULL,
`score` int NOT NULL DEFAULT 0,
PRIMARY KEY (`gameId`)
);
INSERT INTO `players` (`id`, `name`) VALUES
(1, 'a'), (2, 'b'), (3, 'c');
INSERT INTO `tournamentAccounts` (`tId`, `playerId`, `handicap`) VALUES
(1, 1, 10), (2, 1, 2), (3, 2, 0);
INSERT INTO `games` (`gameId`, `winnerTId`, `loserTId`, `score`) VALUES
(1, 1, 3, 3), (2, 1, 3, 2), (3, 3, 1, 6);
What I want to achieve: List for a specific player all tournament scores, i.e. handicap + scorepoints of won games - scorepoints of lost games. For the given inputs, the result set should contain two rows with total scores 9 (for tId=1) and 2 (for tId=2), respectively. The example here is simplified, as in my example there are more conditions to match between the tournamentAccounts and games tables (e.g. time slots etc.), but I guess I can extend it myself once I understood the basic approach :-)
My approaches until now failed as I cannot get a nice JOIN or subqueries to work (I would like to avoid stored procedures).
Attempt 1: straight forward join
SELECT t.*, (t.handicap +COALESCE(SUM(w.score),0) -COALESCE(SUM(l.score),0)) AS score
FROM tournamentAccounts t
LEFT JOIN games w ON w.winnerTId = t.tId
LEFT JOIN games l ON l.loserTId = t.tId
WHERE playerId = 1
GROUP BY t.tId
Although this returns the correct number of rows, the double LEFT JOIN causes a cartesian product as it seems: the two won games are joined with the lost game into two datasets, hence 10 + 3 - 6 + 2 - 6. This effect obviously becomes worse the more matching rows I have in the games table.
Attempt 2: UNION with JOIN (similar to sql avoid cartesian product)
SELECT SUM(COALESCE(x.aa,0))
FROM
((SELECT -l.score AS aa FROM games l LEFT JOIN tournamentAccounts t ON l.loserTId = t.tId WHERE t.playerId = 1)
UNION
(SELECT w.score AS aa FROM games w LEFT JOIN tournamentAccounts t ON w.winnerTId = t.tId WHERE t.playerId = 1)) x
With this I get the proper score value summed up, however it is not yet combined with the corresponding handicap value, and also I don't know how to extend from here to cover all tournament accounts of that player (here, I just took a small snapshot of data) in an SQL manner.

I would just make the games portion of your query into a union, not the whole thing:
SELECT t.*, (t.handicap +COALESCE(SUM(win_score),0) -COALESCE(SUM(loss_score),0)) AS score
FROM tournamentAccounts t
LEFT JOIN (
SELECT w.winnerTId AS tId, w.score AS win_score, 0 AS loss_score FROM games w
UNION ALL
SELECT l.loserTId, 0, l.score FROM games l
) games_won_or_lost ON games_won_or_lost.tId=t.tId
WHERE playerId = 1
GROUP BY t.tId
The other alternative is to undo the effects of the cartesian product. You know the win score is too high by a factor of the number of lost games, so replace SUM(w.score) with ROUND(SUM(w.score)/GREATEST(COUNT(DISTINCT l.gameId),1)). And similarly, SUM(l.score) becomes ROUND(SUM(l.score)/GREATEST(COUNT(DISTINCT w.gameId),1)).
fiddle

How about following:-
SELECT t.*, (t.handicap + coalesce(wscore,0) - coalesce(lscore,0)) AS score
FROM tournamentAccounts t
LEFT JOIN (
select sum(score) wscore, winnerTId wid
from games
group by winnerTid
) as w ON w.wid = t.tid
left join (
select sum(score) lscore, loserTid lid
from games
group by loserTid
) as l ON l.lid = t.tid
where playerId = 1
I got the result as
tId playerId handicap score
1 1 10 9
2 1 2 2

Related

SQL question. Find the two person having same hobbies in one table

TABLE [tbl_hobby]
person_id (int) , hobby_id(int)
has many records. I want to get a SQL query to find all pairs of personid who have the same hobbies( same hobby_id ).
If A has hobby_id 1, B has too, if A doesn't have hobby_id 2, B doesn't have too, we will output A & B 's person_ids.
If A and B and C reach the limits, we output A & B , B & C, A & C.
I've finished in a very very very stupid method, multiple joins the table itself and multiple sub-queries. And of course be laughed by leader.
Is there any high performance method in a SQL for this question?
I have been thinking hard for this since 36 hrs ago......
sample data in mysql dump
CREATE TABLE `tbl_hobby` (
`person_id` int(11) NOT NULL,
`hobby_id` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `tbl_hobby` (`person_id`, `hobby_id`) VALUES
(1, 1),(1, 2),(1, 3),(1, 4),(1, 5),(2, 2),
(2, 3),(2, 4),(3, 1),(3, 2),(3, 3),(3, 4),
(4, 1),(4, 3),(4, 4),(5, 1),(5, 5),(5, 9),
(6, 2),(6, 3),(6, 4),(7, 1),(7, 3),(7, 7),
(8, 2),(8, 3),(8, 4),(9, 1),(9, 2),(9, 3),
(9, 4),(10, 1),(10, 5),(10, 9),(10, 11);
COMMIT;
Expert result: (2 and 6 and 8 same, 3 and 9 same)
2,6
2,8
6,8
3,9
Order of result records and order of the two number in one record is not important. Result record in one column or in two columns are all accepted since it can be easily concated or seperated.

Aggregate per person to get strings of their hobbies. Then aggregate per hobby list find out which belong to more than one person.
select hobbies, group_concat(person_id order by person_id) as persons
from
(
select person_id, group_concat(hobby_id order by hobby_id) as hobbies
from tbl_hobby
group by person_id
) persons
group by hobbies
having count(*) > 1
order by hobbies;
This gives a a list of persons per hobby. Which is the easiest way to output a solution as we would otherwise have to build all possible pairs.
UPDATE: If you want pairs, you'll have to query the table twice:
select p1.person_id as person 1, p2.person_id as person2
from
(
select person_id, group_concat(hobby_id order by hobby_id) as hobbies
from tbl_hobby
group by person_id
) p1
join
(
select person_id, group_concat(hobby_id order by hobby_id) as hobbies
from tbl_hobby
group by person_id
) p2 on p2.person_id > p1.person_id and p2.hobbies = p1.hobbies
order by person1, person2;

Alternative version, without using any proprietary string handling:
select distinct t1.person_id, t2.person_id
from tbl_hobby t1
join tbl_hobby t2
on t1.person_id < t2.person_id
where 2 = all (select count(*)
from tbl_hobby
where person_id in (t1.person_id, t2.person_id)
group by hobby_id);
Perhaps less efficient, but portable!

AVG with LIMIT and GROUP BY

I'm looking to make a SQL query, but I can't do it... and I can't find an example like mine.
I have a simple table People with 3 columns, 7 records :
I'd like to get for each team, the average points of 2 bests people.
My Query:
SELECT team
, (SELECT AVG(point)
FROM People t2
WHERE t1.team = t2.team
ORDER
BY point DESC
LIMIT 2) as avg
FROM People t1
GROUP
BY team
Current result: (average on all people of each team)
Apparently, it's not possible to use a limit into subquery. "ORDER BY point DESC LIMIT 2" is ignored.
Result expected:
I want the average points of 2 bests people (with highest points) for each team, not the average points of all people of each team.
How can I do that? If anyone has any idea..
I'm on MySQL Database
Link of Fiddle : http://sqlfiddle.com/#!9/8c80ef/1
Thanks !

You can try this.
try to make a order number by a subquery, which order by point desc.
then only get top 2 row by each team, if you want to get other top number just modify the number in where clause.
CREATE TABLE `People` (
`id` int(11) NOT NULL,
`name` varchar(20) NOT NULL,
`team` varchar(20) NOT NULL,
`point` int(4) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
INSERT INTO `People` (`id`, `name`, `team`, `point`) VALUES
(1, 'Luc', 'Jupiter', 10),
(2, 'Marie', 'Saturn', 0),
(3, 'Hubert', 'Saturn', 0),
(4, 'Albert', 'Jupiter', 50),
(5, 'Lucy', 'Jupiter', 50),
(6, 'William', 'Saturn', 20),
(7, 'Zeus', 'Saturn', 40);
ALTER TABLE `People`
ADD PRIMARY KEY (`id`);
ALTER TABLE `People`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=8;
Query 1:
SELECT team,avg(point) totle
FROM People t1
where (
select count(*)
from People t2
where t2.id >= t1.id and t1.team = t2.team
order by t2.point desc
) <=2 ## if you want to get other `top` number just modify this number
group by team
Results:
| team | totle |
|---------|-------|
| Jupiter | 50 |
| Saturn | 30 |

This is a pain in MySQL. If you want the two highest point values, you can do:
SELECT p.team, AVG(p2.point)
FROM people p
WHERE p.point >= (SELECT DISTINCT p2.point
FROM people p2
WHERE p2.team = p.team
ORDER BY p2.point DESC
LIMIT 1, 1 -- get the second one
);
Ties make this tricky, and your question isn't clear on what to do about them.

Fast group rank() function

There are various ways people try to emulate MSSQL RANK() or ROW_NUMBER() functions in MySQL, but all of them I've tried so far are slow.
I have a table that looks like this:
CREATE TABLE ratings
(`id` int, `category` varchar(1), `rating` int)
;
INSERT INTO ratings
(`id`, `category`, `rating`)
VALUES
(3, '*', 54),
(4, '*', 45),
(1, '*', 43),
(2, '*', 24),
(2, 'A', 68),
(3, 'A', 43),
(1, 'A', 12),
(3, 'B', 22),
(4, 'B', 22),
(4, 'C', 44)
;
Except it has 220,000 records. There are about 90,000 unique id's.
I wanted to rank the id's first by looking at the categories which were not * where a higher rating is a lower rank.
SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank
Output:
id category rating rank
2 A 68 1
3 A 43 2
1 A 12 3
4 B 22 1
3 B 22 2
4 C 44 1
Then I wanted to take the smallest rank an id had, and average that with the rank they have within the * category. Giving a total query of:
SELECT X1.id,
(X1.rank + X2.minrank) / 2 AS OverallRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category = '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X1
JOIN
(SELECT id,
Min(rank) AS MinRank
FROM
(SELECT g1.id,
g1.category,
g1.rating,
Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
g1.category,
g1.rating
ORDER BY g1.category,
rank) X
GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank
Giving me
id OverallRank
3 1.5000
4 1.5000
2 2.5000
1 3.0000
This query is correct and the output I want, but it just hangs on my real table of 220,000 records. How can I optimize it? My real table has an index on id,rating and category and id,category
Edit:
Result of SHOW CREATE TABLE ratings:
CREATE TABLE `rating` (
`id` int(11) NOT NULL,
`category` varchar(255) NOT NULL,
`rating` int(11) NOT NULL DEFAULT '1500',
`rd` int(11) NOT NULL DEFAULT '350',
`vol` float NOT NULL DEFAULT '0.06',
`wins` int(11) NOT NULL,
`losses` int(11) NOT NULL,
`streak` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`streak`,`rd`,`id`,`category`),
UNIQUE KEY `id_category` (`id`,`category`),
KEY `rating` (`rating`,`rd`),
KEY `streak_idx` (`streak`),
KEY `category_idx` (`category`),
KEY `id_rating_idx` (`id`,`rating`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The PRIMARY KEY is the most common use case of queries to this table, that is why it's the clustered key. It's worth noting that the server is a raid 10 of SSDs with a 9GB/s FIO random read. So I don't suspect the indices not being clustered will affect much.
Output of (select count(distinct category) from ratings) is 50
In the interest that this could be how the data is or an oversight on me, I am included the export of the entire table. It is only 200KB zipped: https://www.dropbox.com/s/p3iv23zi0uzbekv/ratings.zip?dl=0
The first query takes 27 seconds to run

You can use temporary tables with an AUTO_INCREMENT column to generate ranks (row number).
For example - to generate ranks for the '*' category:
drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
rank int unsigned auto_increment primary key,
id int NOT NULL
) engine=memory
select null as rank, id
from ratings r
where r.category = '*'
order by r.category, r.rating desc, r.id desc;
This runs in something like 30 msec. While your approach with the selfjoin takes 45 seconds on my machine. Even with a new index on (category, rating, id) it still takes 14 seconds to run.
To generate ranks per group (per category) is a bit more complicated. We can still use an AUTO_INCREMENT column, but will need to calculate and substract an offset per category:
drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
pos int unsigned auto_increment primary key,
category varchar(50) not null,
id int NOT NULL
) engine=memory
select null as pos, category, id
from ratings r
where r.category <> '*'
order by r.category, r.rating desc, r.id desc;
drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
select category, min(pos) - 1 as `offset`
from tmp_pos
group by category;
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
This runs in about 220 msec. The selfjoin solution takes 42 sec or 13 sec with the new index.
Now you just need to combine the last query with the first temp table, to get your final result:
select t1.id, (t1.min_rank + t2.rank) / 2 as OverallRank
from (
select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id
) t1
join tmp_main_cat_rank t2 using(id);
Overall runtime is ~280 msec without an additional index and ~240 msec with an index on (category, rating, id).
A note to the selfjoin approach: It's an elegant solution and performs fine with a small group size. It's fast with an average group size <= 2. It can be acceptable for a group size of 10. But you have an average group size 447 (count(*) / count(distinct category)). That means every row is joined with 447 other rows (on average). You can see the impact by removing the group by clause:
SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
The result is more than 10M rows.
However - with an index on (category, rating, id) your query runs in 33 seconds on my machine.

GROUP BY and custom order

I've read through the answers on MySQL order by before group by but applying it to my query ends up with a subquery in a subquery for a rather simple case so I'm wondering if this can be simplified:
Schema with sample data
For brevity I've omitted the other fields on the members table. Also, there's many more tables joined in the actual application but those are straightforward to join. It's the membership_stack table that's giving me issues.
CREATE TABLE members (
id int unsigned auto_increment,
first_name varchar(255) not null,
PRIMARY KEY(id)
);
INSERT INTO members (id, first_name)
VALUES (1, 'Tyler'),
(2, 'Marissa'),
(3, 'Alex'),
(4, 'Parker');
CREATE TABLE membership_stack (
id int unsigned auto_increment,
member_id int unsigned not null,
sequence int unsigned not null,
team varchar(255) not null,
`status` varchar(255) not null,
PRIMARY KEY(id),
FOREIGN KEY(member_id) REFERENCES members(id)
);
-- Algorithm to determine correct team:
-- 1. Only consider rows with the highest sequence number
-- 2. Order statuses and pick the first one found:
-- (active, completed, cancelled, abandoned)
INSERT INTO membership_stack (member_id, sequence, team, status)
VALUES (1, 1, 'instinct', 'active'),
(1, 1, 'valor', 'abandoned'),
(2, 1, 'valor', 'active'),
(2, 2, 'mystic', 'abandoned'),
(2, 2, 'valor', 'completed'),
(3, 1, 'instinct', 'completed'),
(3, 2, 'valor', 'active');
I can't change the database schema because the data is synchronized with an external data source.
Query
This is what I have so far:
SELECT m.id, m.first_name, ms.sequence, ms.team, ms.status
FROM membership_stack AS ms
JOIN (
SELECT member_id, MAX(sequence) AS sequence
FROM membership_stack
GROUP BY member_id
) AS t1
ON ms.member_id = t1.member_id
AND ms.sequence = t1.sequence
RIGHT JOIN members AS m
ON ms.member_id = m.id
ORDER BY m.id, FIELD(ms.status, 'active', 'completed', 'cancelled', 'abandoned');
This works as expected but members may appear multiple times if their "most recent sequence" involves more than one team. What I need to do is aggregate again on id and select the FIRST row in each group.
However that poses some issues:
There is no FIRST() function in MySQL
This entire resultset would become a subtable (subquery), which isn't a big deal here but the queries are quite big on the application.
It needs to be compatible with ONLY_FULL_GROUP_BY mode as it is enabled on MySQL 5.7 by default. I haven't checked but I doubt that FIELD(ms.status, 'active', 'completed', 'cancelled', 'abandoned') is considered a functionally dependent field on this resultset. The query also needs to be compatible with MySQL 5.1 as that is what we are running at the moment.
Goal
| id | first_name | sequence | team | status |
|----|------------|----------|----------|-----------|
| 1 | Tyler | 1 | instinct | active |
| 2 | Marissa | 2 | valor | completed |
| 3 | Alex | 2 | valor | active |
| 4 | Parker | NULL | NULL | NULL |
What can I do about this?
Edit: It has come to my attention that some members don't belong to any team. These members should be included in the resultset with null values for those fields. Question updated to reflect new information.

You can use a correlated subquery in the WHERE clause with LIMIT 1:
SELECT m.id, m.first_name, ms.sequence, ms.team, ms.status
FROM members AS m
JOIN membership_stack AS ms ON ms.member_id = m.id
WHERE ms.id = (
SELECT ms1.id
FROM membership_stack AS ms1
WHERE ms1.member_id = ms.member_id
ORDER BY ms1.sequence desc,
FIELD(ms1.status, 'active', 'completed', 'cancelled', 'abandoned'),
ms1.id asc
LIMIT 1
)
ORDER BY m.id;
Demo: http://rextester.com/HGU18448
Update
To include members who have no entries in the membership_stack table you should use a LEFT JOIN, and move the subquery condition from the WHERE clause to the ON clause:
SELECT m.id, m.first_name, ms.sequence, ms.team, ms.status
FROM members AS m
LEFT JOIN membership_stack AS ms
ON ms.member_id = m.id
AND ms.id = (
SELECT ms1.id
FROM membership_stack AS ms1
WHERE ms1.member_id = ms.member_id
ORDER BY ms1.sequence desc,
FIELD(ms1.status, 'active', 'completed', 'cancelled', 'abandoned'),
ms1.id asc
LIMIT 1
)
ORDER BY m.id;
Demo: http://rextester.com/NPI79503

I would do this using variables.
You are looking for the one membership_stack row that is maximal for your special ordering. I'm focusing just on that. The join back to members is trivial.
select ms.*
from (select ms.*,
(#rn := if(#m = member_id, #rn + 1,
if(#m := member_id, 1, 1)
)
) as rn
from membership_stack ms cross join
(select #m := -1, #rn := 0) params
order by member_id, sequence desc,
field(ms.status, 'active', 'completed', 'cancelled', 'abandoned')
) ms
where rn = 1;
The variables is how the logic is implemented. The ordering is key to getting the right result.
EDIT:
MySQL is quite finicky about LIMIT in subqueries. It is possible that this will work:
select ms.*
from membership_stack ms
where (sequence, status) = (select ms2.sequence, ms2.status
from membership_stack ms2
where ms2.member_id = ms.member_id
order by ms2.member_id, ms2.sequence desc,
field(ms2.status, 'active', 'completed', 'cancelled', 'abandoned')
limit 1
);

Retrieve one id referred by two ids

I'm stuck with the following problem:
SQL query for the table:
CREATE TABLE IF NOT EXISTS `thread_users` (
`thread_id` bigint(20) unsigned NOT NULL,
`user_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`thread_id`,`user_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Let's say that I have those data:
INSERT INTO `thread_users` (`thread_id`, `user_id`) VALUES
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(2, 3),
(3, 1),
(3, 4);
I need to retrieve the thread_id referred by only 2 ids: X & Y (both known).
With the above data, I want to be able to retrieve the thread_id where only user_id = 1 & user_id = 2 are present.
What i Know for sure about this table:
If a thread is composed by only 2 users, there is no other threads containing only those two ids. (It's check outside mysql before the insertion)
A user can't be present in a thread more than once. (primary key)
What i have thinking of to resolve this problem:
Sum up (user_id 1 + user_id 2) search for SUMs equal to that result + (user_id = X OR user_id = Y). But i haven't been able to write correctly this query AND I also need to check the number of user_id in that thread...
Obviously: searching id where the number of user_id on threads are equal to 2 and where user_id are equals to X & Y.
Thanks for the help guys!

SELECT tu1.thread_id
FROM thread_users AS tu1
INNER JOIN thread_users AS tu2
ON tu1.thread_id = tu2.thread_id
AND tu1.user_id <> tu2.user_id
LEFT OUTER JOIN thread_users AS tu3
ON tu1.thread_id = tu3.thread_id
AND tu1.user_id <> tu3.user_id
AND tu2.user_id <> tu3.user_id
WHERE tu1.user_id = 1
AND tu2.user_id = 2
AND tu3.user_id IS NULL

Something like this
SELECT thread_id FROM thread_users WHERE user_id IN(1,2) GROUP BY thread_id
HAVING COUNT(user_id)=2
SQL fiddle

I think this is a bit simpler than the JOIN example, and also a bit faster:
SELECT `thread_id`, GROUP_CONCAT(`user_id`) AS this_match FROM `thread_users`
GROUP BY `thread_id` HAVING this_match = '1,2'

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

SQL multiple JOINs or subqueries but avoid cartesian product - mysql

Related

SQL question. Find the two person having same hobbies in one table

AVG with LIMIT and GROUP BY

Fast group rank() function

GROUP BY and custom order

Retrieve one id referred by two ids

Categories

Resources