MySQL - How to GROUP BY more columns and SUM() each group? - mysql

currently my query looks like this:
SELECT alias.name killer,
SUM(kill_stats.amount) amount
FROM kill_stats
JOIN pickup ON pickup.logfile = 'CTFCL-20130813-1456-shutdown2'
JOIN account ON account.steam_id = '0:1:705272'
JOIN player victim ON victim.pickup_id = pickup.id AND victim.account_id = account.id
JOIN player killer ON killer.pickup_id = pickup.id AND kill_stats.killer_id = killer.id
JOIN alias ON killer.alias_id = alias.id
WHERE kill_stats.victim_id = victim.id AND NOT killer.team_id = victim.team_id
GROUP BY kill_stats.killer_id
ORDER BY amount DESC
kill_stats table layout:
CREATE TABLE `kill_stats` (
`killer_id` int(11) UNSIGNED NOT NULL,
`victim_id` int(11) UNSIGNED NOT NULL,
`weapon_id` int(11) UNSIGNED NOT NULL,
`conced` bit(1) NOT NULL,
`fc` bit(1) NOT NULL,
`amount` int(11) UNSIGNED NOT NULL,
PRIMARY KEY(`killer_id`, `victim_id`, `weapon_id`, `conced`, `fc`),
CONSTRAINT `Ref_Killer` FOREIGN KEY (`killer_id`)
REFERENCES `player`(`id`)
ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT `Ref_Victim` FOREIGN KEY (`victim_id`)
REFERENCES `player`(`id`)
ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT `Ref_Weapon` FOREIGN KEY (`weapon_id`)
REFERENCES `weapon`(`id`)
ON DELETE CASCADE
ON UPDATE CASCADE
)
ENGINE=INNODB
CHARACTER SET utf8
COLLATE utf8_unicode_ci ;
Here a more readable view replacing the foreign keys with dummy data:
+-----------+-----------+-----------+--------+----+--------+
| killer_id | victim_id | weapon_id | conced | fc | amount |
+-----------+-----------+-----------+--------+----+--------+
| Josephine | Frank | RPG | NO | NO | 14 |
+-----------+-----------+-----------+--------+----+--------+
| Josephine | Frank | Shotgun | YES | NO | 5 |
+-----------+-----------+-----------+--------+----+--------+
| Josephine | Frank | Shotgun | NO | NO | 3 |
+-----------+-----------+-----------+--------+----+--------+
| Miguel | Frank | Knife | NO | NO | 1 |
+-----------+-----------+-----------+--------+----+--------+
Using this example table the query above would return a table like this:
+-----------+--------+
| killer | amount |
+-----------+--------+
| Josephine | 22 |
+-----------+--------+
| Miguel | 1 |
+-----------+--------+
What I would like the output to be is:
+-----------+-------------+--------------+-----------------+
| killer | total_kills | conced_kills | victim_had_flag |
+-----------+-------------+--------------+-----------------+
| Josephine | 22 | 5 | 0 |
+-----------+-------------+--------------+-----------------+
| Miguel | 1 | 0 | 0 |
+-----------+-------------+--------------+-----------------+
Showing how often a certain player was killed by other players, the total amount of times that they killed him, the amount of conced kills and how often the player carried the flag when he got killed by them.
I'm not really sure how to achieve that, I have tried GROUP BY kill_stats.killer_id, kill_stats.conced but the result is:
+-----------+--------+
| killer | amount |
+-----------+--------+
| Josephine | 14 | -> the ones with kill_stats.conced = NO
+-----------+--------+
| Josephine | 5 | -> the ones with kill_stats.conced = YES
+-----------+--------+
| Miguel | 1 | -> the ones with kill_stats.conced = NO (only row for that `killer_id`)
+-----------+--------+
I get multiple rows for killer_id and I want one row per killer_id holding all the data.
Ike Walker's solution was almost what I have been looking for, the final query to make it work as I wanted is:
SELECT alias.name killer,
SUM(kill_stats.amount) as total_kills ,
SUM(CASE WHEN kill_stats.conced then kill_stats.amount ELSE 0 END) as conced_kills ,
SUM(CASE WHEN kill_stats.fc then kill_stats.amount ELSE 0 END) as victim_had_flag
FROM kill_stats
JOIN pickup ON pickup.logfile = 'CTFCL-20130813-1456-shutdown2'
JOIN account ON account.steam_id = '0:1:705272'
JOIN player victim ON victim.pickup_id = pickup.id AND victim.account_id = account.id
JOIN player killer ON killer.pickup_id = pickup.id AND kill_stats.killer_id = killer.id
JOIN alias ON killer.alias_id = alias.id
WHERE kill_stats.victim_id = victim.id AND NOT killer.team_id = victim.team_id
GROUP BY kill_stats.killer_id
ORDER BY amount DESC
The difference to his solution which finally put me on the right track is that I always needed to SUM() kill_stats.amount into the new column conced_kills for each row that has the BIT(1) column set to 1.

The standard way to do this is by combining SUM() with CASE for your secondary counts.
Here's your example query rewritten this way to give you the output you are looking for:
SELECT alias.name killer,
SUM(kill_stats.amount) as total_kills ,
SUM(CASE WHEN kill_stats.conced then 1 ELSE 0 END) as conced_kills ,
SUM(CASE WHEN kill_stats.fc then 1 ELSE 0 END) as victim_had_flag
FROM kill_stats
JOIN pickup ON pickup.logfile = 'CTFCL-20130813-1456-shutdown2'
JOIN account ON account.steam_id = '0:1:705272'
JOIN player victim ON victim.pickup_id = pickup.id AND victim.account_id = account.id
JOIN player killer ON killer.pickup_id = pickup.id AND kill_stats.killer_id = killer.id
JOIN alias ON killer.alias_id = alias.id
WHERE kill_stats.victim_id = victim.id AND NOT killer.team_id = victim.team_id
GROUP BY kill_stats.killer_id
ORDER BY amount DESC

Try changing your query's SELECT list to this:
SELECT alias.name killer,
SUM(kill_stats.amount) amount,
SUM(kill_stats.conced) conced_kills,
SUM(kill_stats.fc) victim_had_flag
... and then pick up with the FROM kill_stats and finish it exactly as you posted it. I'm assuming that conced will be 1 if true and 0 if false; same with fc. I'm also assuming that fc indicates "victim had flag".

Related

How can I efficiently store a 2-way "like" system similar to Tinder?

On Tinder, when 2 members like each other, they are a "match" and are able to communicate. If only one member likes another, then it's not a match.
I'm trying to store this "Like" system in MySQL but can't figure out the best way to do it that's efficient. This is my setup right now.
mysql> desc likes_likes;
+--------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| from_user_id | int(11) | NO | MUL | NULL | |
| to_user_id | int(11) | NO | MUL | NULL | |
| value | int(11) | NO | | NULL | |
| created_at | datetime | NO | | NULL | |
| updated_at | datetime | YES | | NULL | |
+--------------+----------+------+-----+---------+----------------+
6 rows in set (0.00 sec)
To find my matches, I would query something like...
SELECT to_user_id FROM likes_likes WHERE from_user_id = my_id AND value = 1 AND .... I don't know how to join the same table from here.
How do I perform the query on this table? If it's not efficient, what's a better structure to store this model?
1 is like, 0 is not like. Those are the only 2 values.
SELECT A.from_user_id AS userA, B.from_user_id AS userB
FROM likes_likes A
JOIN likes_likes B
ON A.from_user_id = B.to_user_id
AND A.to_user_id = B.from_user_id
AND A.id <> B.id
WHERE A.value = 1
AND B.value = 1
To find matches you can use a regular join with alias:
SELECT l1.from_user_id user1, l2.from_user_id user2
FROM likes_likes l1
INNER JOIN likes_likes l2 ON
l2.from_user_id = l1.to_user_id AND
l1.to_user_id = l2.from_user_id AND
l1.value = 1 AND l2.value = 1
The first condition checks whether the person user1 has liked or not liked user2 and that user2 has liked at least one other person.
The second condition completes the check so that we now have two persons who have expressed an opinion about each other.
The last two checks make sure that they both like each other :)
Here's a way using group by least(),greatest() to get each unique pair of users into a group and then checking if there are 2 rows per group
select least(from_user_id,to_user_id), greatest(from_user_id,to_user_id)
from likes_likes
where value = 1
-- and my_id in (from_user_id,to_user_id)
group by least(from_user_id,to_user_id), greatest(from_user_id,to_user_id)
having count(*) = 2
If it's possible to have multiple likes from the same user to another user (i.e. user 'A' likes user 'B' twice) then use having count(distinct from_user_id) = 2
Do you actually need value? If there is no row there is no like. From this query you should get 1 for a match and 0 for no mutual match.
SELECT
COUNT(*)
FROM
likes_like i_like_you
JOIN likes_like you_like_me ON i_like_you.to_user_id = you_like_me.from_user_id
WHERE
i_like_you.from_user_id = #my_id
AND you_like_me.from_user_id = #your_id
Is there any reason for id? It seems like the pair (from_user_id, to_user_id) should be UNIQUE, hence could be the 'natural' PRIMARY KEY.
I have yet to see any good argument for needing value.
So the table has shrunk to
CREATE TABLE likes_likes (
from_user_id ...,
to_user_id ...,
created_at ...,
updated_at ...,
PRIMARY KEY(from_user_id, to_user_id), -- serves as the necessary INDEX.
) ENGINE=InnoDB;
SELECT A.from_user_id AS userA,
B.from_user_id AS userB
FROM likes_likes A
JOIN likes_likes B
ON A.from_user_id = B.to_user_id
AND A.to_user_id = B.from_user_id
(I'm assuming you disallow a person liking himself.)

Sorting left join results on large open schema tables

I am designing an open schema database with the following table definitions
mysql> desc orders;
+-------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| json | text | NO | | NULL | |
+-------+---------+------+-----+---------+----------------+
mysql> desc ordersnames;
+-------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(330) | NO | UNI | NULL | |
+-------+--------------+------+-----+---------+----------------+
with an index on name
mysql> desc orderskeys;
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| reference | int(11) | NO | MUL | NULL | |
| nameref | int(11) | NO | MUL | NULL | |
| value | varchar(330) | NO | | NULL | |
+-----------+--------------+------+-----+---------+----------------+
with indices on:
reference,nameref,value
nameref,value
reference
All json fields (1 dimension only) have entry in the orderskeys table per existing field, whereby nameref is a reference to the field name as defined in ordersname.
I would typically query like this:
SELECT
orderskeysdeliveryPostcode.value deliveryPostcode,
orders.ID,
orderskeysCN.value CN
FROM
orders
JOIN ordersnames as ordersnamesCN
on ordersnamesCN.name = 'CN'
JOIN orderskeys as orderskeysCN
on orderskeysCN.nameref = ordersnamesCN.ID
and orderskeysCN.reference = orders.ID
and orderskeysCN.value = '10094'
JOIN ordersnames as ordersnamesdeliveryPostcode
on ordersnamesdeliveryPostcode.name = 'deliveryPostcode'
JOIN orderskeys as orderskeysdeliveryPostcode
on orderskeysdeliveryPostcode.nameref = ordersnamesdeliveryPostcode.ID
and orderskeysdeliveryPostcode.reference = orders.ID
order by deliveryPostcode
limit 0,1000
yielding a result set like this
+------------------+--------+-------+
| deliveryPostcode | ID | CN |
+------------------+--------+-------+
| NULL | 251018 | 10094 |
| NULL | 157153 | 10094 |
| NULL | 95419 | 10094 |
| B-5030 | 172944 | 10094 |
+------------------+--------+-------+
-> lightning fast even with 400k + orders records
However, not all record do contain all fields, so the above query will not yield the records that do not have a 'deliveryPostcode field', so I have to query like this
SELECT
orderskeysdeliveryPostcode.value deliveryPostcode,
orders.ID,
orderskeysCN.value CN
FROM
orders
JOIN ordersnames as ordersnamesCN
on ordersnamesCN.name = 'CN'
JOIN orderskeys as orderskeysCN
on orderskeysCN.nameref = ordersnamesCN.ID
and orderskeysCN.reference = orders.ID
and orderskeysCN.value = '10094'
JOIN ordersnames as ordersnamesdeliveryPostcode
on ordersnamesdeliveryPostcode.name = 'deliveryPostcode'
LEFT JOIN orderskeys as orderskeysdeliveryPostcode
on orderskeysdeliveryPostcode.nameref = ordersnamesdeliveryPostcode.ID
and orderskeysdeliveryPostcode.reference = orders.ID
limit 0,1000
-> equally fast, but as soon as I add an ORDER BY clause on the key value from a left joined table, mysql wants to do the sorting externally (temporary, filesort) instead of using an existing index.
SELECT
orderskeysdeliveryPostcode.value deliveryPostcode,
orders.ID,
orderskeysCN.value CN
FROM
orders
JOIN ordersnames as ordersnamesCN
on ordersnamesCN.name = 'CN'
JOIN orderskeys as orderskeysCN
on orderskeysCN.nameref = ordersnamesCN.ID
and orderskeysCN.reference = orders.ID
and orderskeysCN.value = '10094'
JOIN ordersnames as ordersnamesdeliveryPostcode
on ordersnamesdeliveryPostcode.name = 'deliveryPostcode'
LEFT JOIN orderskeys as orderskeysdeliveryPostcode
on orderskeysdeliveryPostcode.nameref = ordersnamesdeliveryPostcode.ID
and orderskeysdeliveryPostcode.reference = orders.ID
ORDER BY deliveryPostCode
limit 0,1000
-> very slow ...
In fact the sorting operation itself is not much different , as all NULL values for column deliveryPostcode would be at the beginning (ASC) or the end (DESC) while the rest of the dataset would have the same order as with JOIN instead of LEFT JOIN.
How can I query (and order) such tables efficiently? Do I need different relations or indices ?
Much obliged ...
With INNER JOINs, to reduce the number of lookups, MySQL is going to start with the table with the fewest rows (see the EXPLAIN result to see which table MySQL starts with).
If you order by anything other than a column in that first table, or there is no index to satisfy the ORDER BY clause on that first table, MySQL is going to have to do a filesort.
The use of a temporary table is much more likely when text columns are involved, and not just an in-memory temporary table, but a dreadful on-disk temporary table.
Use STRAIGHT_JOIN to force the order that MySQL performs inner joins.
I am not sure what logic do you have in some parts of your query.
I think it still can be optimized.
But just to resolve the issue you have, try just switch it to RIGHT JOIN for now:
SELECT
orderskeysdeliveryPostcode.value deliveryPostcode,
o.id,
o.CN
FROM orderskeys as orderskeysdeliveryPostcode
INNER JOIN ordersnames as ord_n
on ord_n.id = orderskeysdeliveryPostcode.nameref
AND ord_n.name = 'deliveryPostcode'
RIGHT JOIN (
SELECT
orders.ID,
orderskeysCN.CN
FROM
orders
LEFT JOIN
(SELECT
orderskeys.value as CN,
orderskeys.reference
FROM
orderskeys
INNER JOIN ordersnames as ordersnamesCN
ON ordersnamesCN.id = orderskeys.nameref
AND ordersnamesCN.name = 'CN'
WHERE orderskeys.value = '12209'
) as orderskeysCN
ON
orderskeysCN.reference = orders.ID
limit 0,1000
) as o
on
orderskeysdeliveryPostcode.reference = o.ID
ORDER BY deliveryPostCode;
and here is sqlfiddle we can play with. Just need you to add data inserts there.

What's the most efficient way to structure a 2-dimensional MySQL query?

I have a MySQL database with the following tables and fields:
Student (id)
Class (id)
Grade (id, student_id, class_id, grade)
The student and class tables are indexed on id (primary keys). The grade table is indexed on id (primary key) and student_id, class_id and grade.
I need to construct a query which, given a class ID, gives a list of all other classes and the number of students who scored more in that other class.
Essentially, given the following data in the grades table:
student_id | class_id | grade
--------------------------------------
1 | 1 | 87
1 | 2 | 91
1 | 3 | 75
2 | 1 | 68
2 | 2 | 95
2 | 3 | 84
3 | 1 | 76
3 | 2 | 88
3 | 3 | 71
Querying with class ID 1 should yield:
class_id | total
-------------------
2 | 3
3 | 1
Ideally I'd like this to execute in a few seconds, as I'd like it to be part of a web interface.
The issue I have is that in my database, I have over 1300 classes and 160,000 students. My grade table has almost 15 million rows and as such, the query takes a long time to execute.
Here's what I've tried so far along with the times each query took:
-- I manually stopped execution after 2 hours
SELECT c.id, COUNT(*) AS total
FROM classes c
INNER JOIN grades a ON a.class_id = c.id
INNER JOIN grades b ON b.grade < a.grade AND
a.student_id = b.student_id AND
b.class_id = 1
WHERE c.id != 1 AND
GROUP BY c.id
-- I manually stopped execution after 20 minutes
SELECT c.id,
(
SELECT COUNT(*)
FROM grades g
WHERE g.class_id = c.id AND g.grade > (
SELECT grade
FROM grades
WHERE student_id = g.student_id AND
class_id = 1
)
) AS total
FROM classes c
WHERE c.id != 1;
-- 1 min 12 sec
CREATE TEMPORARY TABLE temp_blah (student_id INT(11) PRIMARY KEY, grade INT);
INSERT INTO temp_blah SELECT student_id, grade FROM grades WHERE class_id = 1;
SELECT o.id,
(
SELECT COUNT(*)
FROM grades g
INNER JOIN temp_blah t ON g.student_id = t.student_id
WHERE g.class_id = c.id AND t.grade < g.grade
) AS total
FROM classes c
WHERE c.id != 1;
-- Same thing but with joins instead of a subquery - 1 min 54 sec
SELECT c.id,
COUNT(*) AS total
FROM classes c
INNER JOIN grades g ON c.id = p.class_id
INNER JOIN temp_blah t ON g.student_id = t.student_id
WHERE c.id != 1
GROUP BY c.id;
I also considered creating a 2D table, with students as rows and classes as columns, however I can see two issues with this:
MySQL implements a maximum column count (4096) and maximum row size (in bytes) which may be exceeded by this approach
I can't think of a good way to query that structure to get the results I need
I also considered performing these calculations as background jobs and storing the results somewhere, but for the information to remain current (it must), they would need to be recalculated every time a student, class or grade record was created or updated.
Does anyone know a more efficient way to construct this query?
EDIT: Create table statements:
CREATE TABLE `classes` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1331 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci$$
CREATE TABLE `students` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=160803 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci$$
CREATE TABLE `grades` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`student_id` int(11) DEFAULT NULL,
`class_id` int(11) DEFAULT NULL,
`grade` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_grades_on_student_id` (`student_id`),
KEY `index_grades_on_class_id` (`class_id`),
KEY `index_grades_on_grade` (`grade`)
) ENGINE=InnoDB AUTO_INCREMENT=15507698 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci$$
Output of explain on the most efficient query (the 1 min 12 sec one):
id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | c | range | PRIMARY | PRIMARY | 4 | | 683 | Using where; Using index
2 | DEPENDENT SUBQUERY | g | ref | index_grades_on_student_id,index_grades_on_class_id,index_grades_on_grade | index_grades_on_class_id | 5 | mydb.c.id | 830393 | Using where
2 | DEPENDENT SUBQUERY | t | eq_ref | PRIMARY | PRIMARY | 4 | mydb.g.student_id | 1 | Using where
Another edit - explain output for sgeddes suggestion:
+----+-------------+------------+--------+---------------+------+---------+------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+------+---------+------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 14953992 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | <derived3> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | G | ALL | NULL | NULL | NULL | NULL | 15115388 | |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+----------+----------------------------------------------+
I think this should work for you using SUM and CASE:
SELECT C.Id,
SUM(
CASE
WHEN G.Grade > C2.Grade THEN 1 ELSE 0
END
)
FROM Class C
INNER JOIN Grade G ON C.Id = G.Class_Id
LEFT JOIN (
SELECT Grade, Student_Id, Class_Id
FROM Class
JOIN Grade ON Class.Id = Grade.Class_Id
WHERE Class.Id = 1
) C2 ON G.Student_Id = C2.Student_Id
WHERE C.Id <> 1
GROUP BY C.Id
Sample Fiddle Demo
--EDIT--
In response to your comment, here is another attempt that should be much faster:
SELECT
Class_Id,
SUM(CASE WHEN Grade > minGrade THEN 1 ELSE 0 END)
FROM
(
SELECT
Student_Id,
#classToCheck:=
IF(G.Class_Id = 1, Grade, #classToCheck) minGrade ,
Class_Id,
Grade
FROM Grade G
JOIN (SELECT #classToCheck:= 0) t
ORDER BY Student_Id, IF(Class_Id = 1, 0, 1)
) t
WHERE Class_Id <> 1
GROUP BY Class_ID
And more sample fiddle.
Can you give this a try on the original data as well! It is only one join :)
select
final.class_id, count(*) as total
from
(
select * from
(select student_id as p_student_id, grade as p_grade from table1 where class_id = 1) as partial
inner join table1 on table1.student_id = partial.p_student_id
where table1.class_id <> 1 and table1.grade > partial.p_grade
) as final
group by
final.class_id;
sqlfiddle link

how to group by with a sql subqueries

I can't think clearly at the moment, I want to return counts by station_id, an example of output would be:
station 1 has 3 fb post, 6 linkedin posts, 5 email posts
station 2 has 3 fb post, 6 linkedin posts, 5 email posts
So I need to group by the station id, my table structure is
CREATE TABLE IF NOT EXISTS `posts` (
`post_id` bigint(11) NOT NULL auto_increment,
`station_id` varchar(25) NOT NULL,
`user_id` varchar(25) NOT NULL,
`dated` datetime NOT NULL,
`type` enum('fb','linkedin','email') NOT NULL,
PRIMARY KEY (`post_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=x ;
The query I have so far is returning station 0 as having 2 linkedin posts when it has one (2 in the db tho)
SELECT Station_id, (select count(*) FROM posts WHERE type = 'linkedin') AS linkedin_count, (select count(*) FROM posts WHERE type = 'fb') AS fb_count, (select count(*) FROM posts WHERE type = 'email') AS email_count FROM `posts` GROUP BY station_id;
Or, the fastest way, avoiding joins and subselects to get it in the exact format you want:
SELECT
station_id,
SUM(CASE WHEN type = 'linkedin' THEN 1 ELSE 0 END) AS 'linkedin',
SUM(CASE WHEN type = 'fb' THEN 1 ELSE 0 END) AS 'fb',
SUM(CASE WHEN type = 'email' THEN 1 ELSE 0 END) AS 'email'
FROM posts
GROUP BY station_id;
Outputs:
+------------+----------+------+-------+
| station_id | linkedin | fb | email |
+------------+----------+------+-------+
| 1 | 3 | 2 | 5 |
| 2 | 2 | 0 | 1 |
+------------+----------+------+-------+
You may also want to put an index on there to speed it up
ALTER TABLE posts ADD INDEX (station_id, type);
Explain output:
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| 1 | SIMPLE | posts | index | NULL | station_id | 28 | NULL | 13 | Using index |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
As implied by gnif's answer, having three correlated sub_queries has a performance over-head. Depending on the DBMS you're using, it could perform similarly to having a self join three times.
gnif's methodology ensures that the table is only parsed once, without the need for joins, correlated sub_queries, etc.
The immediately obvious down-side of gnif's answer is that you don't ever get records for 0's. If there are no fb types, you just don't get a record. If that is not an issue, I'd go with his answer. If it is an issue, however, here is a version with similar methodology to gnif, but matching your output format...
SELECT
station_id,
SUM(CASE WHEN type = 'linkedin' THEN 1 ELSE 0 END) AS linkedin_count,
SUM(CASE WHEN type = 'fb' THEN 1 ELSE 0 END) AS fb_count,
SUM(CASE WHEN type = 'email' THEN 1 ELSE 0 END) AS email_count
FROM
posts
GROUP BY
station_id
Give this a go:
SELECT station_id, type, count(*) FROM posts GROUP BY station_id, type
The output format will be a little different to what your attempting to get, but it should provide the statistics your trying to retrieve. Also since its a single query it is much faster.
-- Edit, added example result set
+------------+----------+----------+
| station_id | type | count(*) |
+------------+----------+----------+
| 1 | fb | 2 |
| 1 | linkedin | 3 |
| 1 | email | 5 |
| 2 | linkedin | 2 |
| 2 | email | 1 |
+------------+----------+----------+
try this:
SELECT p.Station_id,
(select count(*) FROM posts WHERE type = 'linkedin' and station_id=p.station_id) AS linkedin_count,
(select count(*) FROM posts WHERE type = 'fb' and station_id=p.station_id) AS fb_count,
(select count(*) FROM posts WHERE type = 'email' and station_id=p.station_id) AS email_count
FROM `posts` p GROUP BY station_id

Improve SQL query performance

I have three tables where I store actual person data (person), teams (team) and entries (athlete). The schema of the three tables is:
In each team there might be two or more athletes.
I'm trying to create a query to produce the most frequent pairs, meaning people who play in teams of two. I came up with the following query:
SELECT p1.surname, p1.name, p2.surname, p2.name, COUNT(*) AS freq
FROM person p1, athlete a1, person p2, athlete a2
WHERE
p1.id = a1.person_id AND
p2.id = a2.person_id AND
a1.team_id = a2.team_id AND
a1.team_id IN
( SELECT team.id
FROM team, athlete
WHERE team.id = athlete.team_id
GROUP BY team.id
HAVING COUNT(*) = 2 )
GROUP BY p1.id
ORDER BY freq DESC
Obviously this is a resource consuming query. Is there a way to improve it?
SELECT id
FROM team, athlete
WHERE team.id = athlete.team_id
GROUP BY team.id
HAVING COUNT(*) = 2
Performance Tip 1: You only need the athlete table here.
You might consider the following approach which uses triggers to maintain counters in your team and person tables so you can easily find out which teams have 2 or more athletes and which persons are in 2 or more teams.
(note: I've removed the surrogate id key from your athlete table in favour of a composite key which will better enforce data integrity. I've also renamed athlete to team_athlete)
drop table if exists person;
create table person
(
person_id int unsigned not null auto_increment primary key,
name varchar(255) not null,
team_count smallint unsigned not null default 0
)
engine=innodb;
drop table if exists team;
create table team
(
team_id int unsigned not null auto_increment primary key,
name varchar(255) not null,
athlete_count smallint unsigned not null default 0,
key (athlete_count)
)
engine=innodb;
drop table if exists team_athlete;
create table team_athlete
(
team_id int unsigned not null,
person_id int unsigned not null,
primary key (team_id, person_id), -- note clustered composite PK
key person(person_id) -- added index
)
engine=innodb;
delimiter #
create trigger team_athlete_after_ins_trig after insert on team_athlete
for each row
begin
update team set athlete_count = athlete_count+1 where team_id = new.team_id;
update person set team_count = team_count+1 where person_id = new.person_id;
end#
delimiter ;
insert into person (name) values ('p1'),('p2'),('p3'),('p4'),('p5');
insert into team (name) values ('t1'),('t2'),('t3'),('t4');
insert into team_athlete (team_id, person_id) values
(1,1),(1,2),(1,3),
(2,3),(2,4),
(3,1),(3,5);
select * from team_athlete;
select * from person;
select * from team;
select * from team where athlete_count >= 2;
select * from person where team_count >= 2;
EDIT
Added the following as initially misunderstood question:
Create a view which only includes teams of 2 persons.
drop view if exists teams_with_2_players_view;
create view teams_with_2_players_view as
select
t.team_id,
ta.person_id,
p.name as person_name
from
team t
inner join team_athlete ta on t.team_id = ta.team_id
inner join person p on ta.person_id = p.person_id
where
t.athlete_count = 2;
Now use the view to find the most frequently occurring person pairs.
select
p1.person_id as p1_person_id,
p1.person_name as p1_person_name,
p2.person_id as p2_person_id,
p2.person_name as p2_person_name,
count(*) as counter
from
teams_with_2_players_view p1
inner join teams_with_2_players_view p2 on
p2.team_id = p1.team_id and p2.person_id > p1.person_id
group by
p1.person_id, p2.person_id
order by
counter desc;
Hope this helps :)
EDIT 2 checking performance
select count(*) as counter from person;
+---------+
| counter |
+---------+
| 10000 |
+---------+
1 row in set (0.00 sec)
select count(*) as counter from team;
+---------+
| counter |
+---------+
| 450000 |
+---------+
1 row in set (0.08 sec)
select count(*) as counter from team where athlete_count = 2;
+---------+
| counter |
+---------+
| 112644 |
+---------+
1 row in set (0.03 sec)
select count(*) as counter from team_athlete;
+---------+
| counter |
+---------+
| 1124772 |
+---------+
1 row in set (0.21 sec)
explain
select
p1.person_id as p1_person_id,
p1.person_name as p1_person_name,
p2.person_id as p2_person_id,
p2.person_name as p2_person_name,
count(*) as counter
from
teams_with_2_players_view p1
inner join teams_with_2_players_view p2 on
p2.team_id = p1.team_id and p2.person_id > p1.person_id
group by
p1.person_id, p2.person_id
order by
counter desc
limit 10;
+----+-------------+-------+--------+---------------------+-------------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+-------------+---------+---------------------+-------+----------------------------------------------+
| 1 | SIMPLE | t | ref | PRIMARY,t_count_idx | t_count_idx | 2 | const | 86588 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t | eq_ref | PRIMARY,t_count_idx | PRIMARY | 4 | foo_db.t.team_id | 1 | Using where |
| 1 | SIMPLE | ta | ref | PRIMARY,person | PRIMARY | 4 | foo_db.t.team_id | 1 | Using index |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | foo_db.ta.person_id | 1 | |
| 1 | SIMPLE | ta | ref | PRIMARY,person | PRIMARY | 4 | foo_db.t.team_id | 1 | Using where; Using index |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | foo_db.ta.person_id | 1 | |
+----+-------------+-------+--------+---------------------+-------------+---------+---------------------+-------+----------------------------------------------+
6 rows in set (0.00 sec)
select
p1.person_id as p1_person_id,
p1.person_name as p1_person_name,
p2.person_id as p2_person_id,
p2.person_name as p2_person_name,
count(*) as counter
from
teams_with_2_players_view p1
inner join teams_with_2_players_view p2 on
p2.team_id = p1.team_id and p2.person_id > p1.person_id
group by
p1.person_id, p2.person_id
order by
counter desc
limit 10;
+--------------+----------------+--------------+----------------+---------+
| p1_person_id | p1_person_name | p2_person_id | p2_person_name | counter |
+--------------+----------------+--------------+----------------+---------+
| 221 | person 221 | 739 | person 739 | 5 |
| 129 | person 129 | 249 | person 249 | 5 |
| 874 | person 874 | 877 | person 877 | 4 |
| 717 | person 717 | 949 | person 949 | 4 |
| 395 | person 395 | 976 | person 976 | 4 |
| 415 | person 415 | 828 | person 828 | 4 |
| 287 | person 287 | 470 | person 470 | 4 |
| 455 | person 455 | 860 | person 860 | 4 |
| 13 | person 13 | 29 | person 29 | 4 |
| 1 | person 1 | 743 | person 743 | 4 |
+--------------+----------------+--------------+----------------+---------+
10 rows in set (2.02 sec)
Should there be an additional constraint a1.person_id != a2.person_id, to avoid creating a pair with the same player? This may not affect the final ordering of the results but will affect the accuracy of the count.
If possible you can add a column called athlete_count (with an index) in the team table which can be updated whenever a player is added or removed to a team and this can avoid the subquery which needs to go through the entire athlete table for finding the two player teams.
UPDATE1:
Also, if I am understanding the original query correctly, when you group by p1.id you only get the number of times a player played in a two player team and not the count of the pair itself. You may have to Group BY p1.id, p2.id.
REVISION BASED on EXACTLY TWO PER TEAM
By the inner-most pre-aggregate of exactly TWO people, I can get each team with personA and PersonB to a single row per team using MIN() and MAX(). This way, the person's IDs will always be in low-high pair setup to be compared for future teams. Then, I can query the COUNT by the common Mate1 and Mate2 across ALL teams and directly get their Names.
SELECT STRAIGHT_JOIN
p1.surname,
p1.name,
p2.surname,
p2.name,
TeamAggregates.CommonTeams
from
( select PreQueryTeams.Mate1,
PreQueryTeams.Mate2,
count(*) CommonTeams
from
( SELECT team_id,
min( person_id ) mate1,
max( person_id ) mate2
FROM
athlete
group by
team_id
having count(*) = 2 ) PreQueryTeams
group by
PreQueryTeams.Mate1,
PreQueryTeams.Mate2 ) TeamAggregates,
person p1,
person p2
where
TeamAggregates.Mate1 = p1.Person_ID
and TeamAggregates.Mate2 = p2.Person_ID
order by
TeamAggregates.CommonTeams
ORIGINAL ANSWER FOR TEAMS WITH ANY NUMBER OF TEAMMATES
I would do by the following. The inner prequery first joining all possible combinations of people on each individual team, but having person1 < person2 will eliminate counting the same person as person1 AND person2.. In addition, will prevent the reverse based on higher numbered person IDs... Such as
athlete person team
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 3 2
7 4 2
8 1 3
9 4 3
So, from team 1 you would get person pairs of
1,2 1,3 1,4 2,3 2,4 3,4
and NOT get reversed duplicates such as
2,1 3,1 4,1 3,2 4,2 4,3
nor same person
1,1 2,2 3,3 4,4
Then from team 2, you would hav pairs of
1,3 1,4 3,4
Finally in team 3 the single pair of
1,4
thus teammates 1,4 have occured in 3 common teams.
SELECT STRAIGHT_JOIN
p1.surname,
p1.name,
p2.surname,
p2.name,
PreQuery.CommonTeams
from
( select
a1.Person_ID Person_ID1,
a2.Person_ID Person_ID2,
count(*) CommonTeams
from
athlete a1,
athlete a2
where
a1.Team_ID = a2.Team_ID
and a1.Person_ID < a2.Person_ID
group by
1, 2
having CommonTeams > 1 ) PreQuery,
person p1,
person p2
where
PreQuery.Person_ID1 = p1.id
and PreQuery.Person_ID2 = p2.id
order by
PreQuery.CommonTeams
Here, Some tips to improve SQL select query performance like:
Use SET NOCOUNT ON it is help to decrease network traffic thus
improve performance.
Use fully qualified procedure name (e.g.
database.schema.objectname)
Use sp_executesql instead of execute for dynamic query
Don't use select * use select column1,column2,.. for IF EXISTS
or SELECT operation
Avoid naming user Stored Procedure like sp_procedureName Becouse,
If we use Stored Procedure name start with sp_ then SQL first
search in master db. so it can down query performance.