Finding collaborations in MySQL - mysql

I have a database of people and projects. How can I find the names of people who collaborated with a given person, and on how many projects?
For example, I want to find the collaborators of Jimmy from the database:
+----------+--------+
| project | person |
+----------+--------+
| datamax | Jimmy |
| datamax | Ashley |
| datamax | Martin |
| cocoplus | Jimmy |
| cocoplus | Ashley |
| glassbox | Jimmy |
| glassbox | Martin |
| powerbin | Jimmy |
| powerbin | Ashley |
+----------+--------+
The result would look something like this:
Jimmy's collaborations:
+--------+----------------+
| person | collaborations |
+--------+----------------+
| Ashley | 3 |
| Martin | 2 |
+--------+----------------+

Join the table with itself, group by the person field:
SELECT u2.person, COUNT(u1.project) AS collaborations
FROM users u1
JOIN users u2 ON u2.project = u1.project
WHERE u1.person != u2.person AND u1.person = 'Jimmy'
GROUP BY u2.person;
The query selects the projects in which Jimmy participated from u1. The rows from u2 are filtered by the rows from u1. Duplicate entries, where the users from both tables match, are filtered with WHERE clause. Finally, the result set is grouped by person, and the COUNT function calculates the number of rows per group.
Performance
Note, an index for person and project columns (or two separate indexes) will significantly improve performance of the query above. Specific index configuration depends on the table structure. Although, I think the following is quite enough for a table with two varchar fields for person and project, for instance:
ALTER TABLE users ADD INDEX `project` (`project`(10));
ALTER TABLE users ADD INDEX `person` (`person`(10));
Normalization
However, I would rather store persons and projects in separate tables with their numeric IDs. A third table could play the role of connector: person_id - project_id. In other words, I recommend normalization. With normalized tables, you will not need to build bloated indexes for the text fields.
Normalized tables may look as follows:
CREATE TABLE users (
id int unsigned NOT NULL AUTO_INCREMENT,
name varchar(200) NOT NULL DEFAULT '',
PRIMARY KEY(`id`),
-- This index is needed, if you want to fetch users by names
INDEX name (name(8))
);
CREATE TABLE projects (
id int unsigned NOT NULL AUTO_INCREMENT,
name varchar(100) NOT NULL DEFAULT '',
PRIMARY KEY(`id`)
);
CREATE TABLE collaborations (
project_id int unsigned NOT NULL DEFAULT 0,
user_id int unsigned NOT NULL DEFAULT 0,
PRIMARY KEY(`project_id`, `user_id`)
);
The query for the normalized structures will look a little bit more complex:
-- In practice, the user ID is retrieved from the calling process
-- (such as POST/GET HTTP requests, for instance).
SET #user_id := (SELECT id FROM users WHERE name LIKE 'Jimmy');
SELECT u.name person, COUNT(p.id) collaborations
FROM collaborations c
JOIN collaborations c2 USING(project_id)
JOIN users u ON u.id = c2.user_id
JOIN projects p ON p.id = c2.project_id
WHERE c.user_id = #user_id AND c.user_id != c2.user_id
GROUP BY c2.user_id;
But it will be fast, and the space required for the indexes will be significantly smaller, especially for large data sets.
Original answer
To fetch the total number of projects for each person, use COUNT function with GROUP BY clause:
SELECT person, COUNT(*) AS collaborations
FROM users
GROUP BY person;

Related

How to get one extra record for LEFT JOIN to represent a record not include on the left joined table

I have a database with two tables one table (shops) has an admin user column and the other a user with less privileges. I plan to LEFT JOIN the table of the user with less privileges. When I retrieve the data, the records for the admin user must be on a separate row and must have NULL values for the left joined table followed by records of users with less privileges (records of the left joined table) if any. I am using MySQL.
I have looked into the UNION commands but I don't think it can help. Please see the results bellow of what I need.
Thank you.
SELECT *
FROM shops LEFT JOIN users USING(shop_id)
WHERE shop_id = 1 AND (admin_id = 1 OR user_id = 1);
+---------+----------+---------+
| shop_id | admin_id | user_id |
+---------+----------+---------+
| 1 | 1 | NULL | <-- Need this one extra record
| 1 | 1 | 1 |
| 1 | 1 | 2 |
| 1 | 1 | 3 |
+---------+----------+---------+
Here is an example structure of the databases and some sample data:
CREATE SCHEMA test DEFAULT CHARACTER SET utf8 ;
USE test;
CREATE TABLE admin(
admin_id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY(admin_id)
);
CREATE TABLE shops(
shop_id INT NOT NULL AUTO_INCREMENT,
admin_id INT NOT NULL,
PRIMARY KEY(shop_id),
CONSTRAINT fk_shop_admin FOREIGN KEY(admin_id) REFERENCES admin (admin_id)
);
CREATE TABLE users(
user_id INT NOT NULL AUTO_INCREMENT,
shop_id INT NOT NULL,
CONSTRAINT fk_user_shop FOREIGN KEY(shop_id) REFERENCES admin (shop_id)
);
-- Sample data
INSERT INTO admin() VALUES ();
INSERT INTO shops(admin_id) VALUES (1);
INSERT INTO users(shop_id) VALUES (1),(1),(1);
I think you need union all:
select s.shop_id, s.admin_id, null as user_id
from shops s
where s.shop_id = 1
union all
select s.shop_id, s.admin_id, u.user_id
from shops s join
users u
on s.shop_id = u.shop_id
where shop_id = 1;
Put your where condition in On clause
SELECT *
FROM shops LEFT JOIN users on shops.shop_id=users.shop_id and (admin_id = 1 OR user_id = 1)
WHERE shops.shop_id = 1

Two LEFT JOINs in SQL does not preserve data

This query:
SELECT contacts.name, accounts.account
FROM contacts
LEFT JOIN deals
ON contacts.id = deals.contact_id
LEFT JOIN
accounts ON accounts.deal_id = deals.id;
returns:
+------+-------------------+
| name | account |
+------+-------------------+
| Bob | fun deal account |
| Bob | NULL |
| John | NULL |
+------+-------------------+
But I expected:
+------+-------------------+
| name | account |
+------+-------------------+
| Bob | fun deal account |
| Bob | fun deal account |
| John | NULL |
+------+-------------------+
The first LEFT JOIN behaves correctly. Since there are two deals for Bob, Bob correctly shows up twice in result set. But the second LEFT JOIN does not behave right, because the account should have been carried over twice for both Bob records, but instead there is a NULL for the second bob.
The schema:
CREATE TABLE contacts(
id int AUTO_INCREMENT,
name VARCHAR(50),
Primary Key(id)
)
INSERT INTO contacts VALUES('Bob');
INSERT INTO contacts(name) VALUES('John');
CREATE TABLE deals(
id int AUTO_INCREMENT,
name VARCHAR(20),
contact_id int,
FOREIGN KEY(contact_id) REFERENCES contacts(id),
Primary Key(id)
);
INSERT INTO deals(name, contact_id) VALUES('cool deal',1);
INSERT INTO deals(name, contact_id) VALUES('another cool deal',1);
CREATE TABLE accounts(
id int AUTO_INCREMENT,
account VARCHAR(50),
deal_id int,
FOREIGN KEY(deal_id) REFERENCES deals(id),
PRIMARY KEY (id)
)
INSERT INTO accounts(account, deal_id) VALUES('fun deal account', 1);
Why doesn't the second LEFT JOIN give desired behavior and how can I get the 'fun deal account' account to show up for both Bobs?
Bob have two deals but deals.id is auto_increment so fun deal account only match the first row in deals table, the cool deal.
You need to add INSERT INTO accounts(account, deal_id) VALUES('fun deal account', 2); too
In case of doubts, decompose your query.
The first LEFT JOIN could be this:
SELECT contacts.id as contact_id, contacts.name, deals.id as deals_id, deals.name
FROM contacts
LEFT JOIN deals ON contacts.id = deals.contact_id
Which results in :
contact_id name deals_id name
1 Bob 1 cool deal
1 Bob 2 another cool deal
2 John NULL NULL
The second LEFT JOIN is:
LEFT JOIN accounts ON accounts.deal_id = deals.id
So the result given is logical given your data, you have only one account with deal_id=1 so it matches the first row where deals.id=1 : "cool deal" .
I think your mistake is on the last part of your query, the query you wanted is :
SELECT contacts.name, accounts.account FROM contacts LEFT JOIN deals ON contacts.id = deals.contact_id LEFT JOIN accounts ON accounts.deal_id = deals.contact_id
"accounts.deal_id = deals.contact_id" instead of "accounts.deal_id = deals.id" is the deal (pun intended) to have your expected result.

select statement with only rows which have set true in second table

i have two tables
activity
id | user_id | time | activity_id
1 | 1 | | 3
2 | 1 | | 1
and preferences
user_id | running | cycling | driving
1 | TRUE | FALSE | FALSE
i need result set of
id | user_id | time |
2 | 1 | |
i only need rows from first table whose values are set true in preferences table.
e.g activity_id for running is 1 which is set true in preferences table, so it returns while others doesn't.
If you can edit the schema, it would be better like this:
activity
id | name
1 | running
2 | cycling
3 | driving
user_activity
id | user_id | time | activity_id
1 | 1 | | 3
2 | 1 | | 1
preferences
user_id | activity_id
1 | 1
A row in preferences indicates a TRUE value from your schema. No row indicates a FALSE.
Then your query would simply be:
SELECT ua.id, ua.user_id, ua.time
FROM user_activity ua
JOIN preferences p ON ua.user_id = p.user_id
AND ua.activity_id = p.activity_id
If you want to see the activity name in the results:
SELECT ua.id, ua.user_id, ua.time, activity.name
FROM user_activity ua
JOIN preferences p ON ua.user_id = p.user_id
AND ua.activity_id = p.activity_id
JOIN activity ON ua.activity_id = activity.id
What I would probably do is join the tables on a common column, looks like user_id is a common column in this case, which gives access to the columns in both tables to query against in the where clause of the query.
Which type of join depends on what information you want from preferences
Handy Visual Guide for joins
So you could query
SELECT * FROM activity LEFT JOIN preferences ON activity.user_id = preferences.user_id WHERE preferences.columnIWantToBeTrue = true
I'm using left join since you mentioned you want the values from the first table based on the second table.
Mike B has the right answer. The relational model relates rows together by common values.
You've got a table named activity with an id column which looks like the primary key. The column name activity_id would typically be the name of a column in another table that is a foreign key to the activity table, referencing activity.id.
It looks like you've used the activity_id column in the activity table as a reference to either "running", "cycling" or "driving".
It's possible to match activity.activity_id = 1 with "running", but this is a bizarre design.
Here's an example query:
SELECT a.id
, a.user_id
, a.time
FROM activity a
JOIN preferences p
ON p.user_id = a.user_id
AND ( ( p.running = 'TRUE' AND a.activity_id = 1 )
OR ( p.cycling = 'TRUE' AND a.activity_id = 2 )
OR ( p.driving = 'TRUE' AND a.activity_id = 3 )
)
But, again, this is a bizarre design.
As a start, each table in your database should have rows that represent either an entity (a person, place, thing, concept or event that can be uniquely identified, is important, and we need to store information about), or a relationship between the entities.
From the limited information we have about your use case, the entities appear to be "user", an "activity_type" (running, cycling, driving), an "activity" (an amount of time, for a user and an activity_type) and some user "preference" about which activity_types the user prefers.
See the answer from Mark B for a possible schema design.

Should I redesign my tables or can I make this work?

Right now I'm working on expanding my website to new functionality. I want to enable notifications from different sources. Similar to groups and people on facebook. Here is my table layout right now.
course_updates
id | CRN (id of course) | update_id
------------------------------------
courses
id | course_name | course_subject | course_number
-------------------------------------------------
users
id | name | facebook_name
---------------------------------------------------
user_updates
id | user_id | update_id
------------------------
updates
id | timestamp | updateObj
---------------------------
What I would like to be able to do is take course_update and user_updates in one query and join them with updates along with the correct information for the tables. So for course_updates i would want course_name, course_subject, etc. and for user_updates i would want the username and facebook name. This honestly probably belongs in two separate queries, but I would like to arrange everything by the timestamp of the updates table, and I feel like sorting everything in php would be inefficient. What is the best way to do this? I would need a way to distinguish between notification types if i were to use something like a union because user_updates and course_updates can store a reference to the same column in updates. Any ideas?
You might not need updates table at all. You can include timestamp columns to course_updates and user_updates tables
CREATE TABLE course_updates
(
`id` int,
`CRN` int,
`timestamp` datetime -- or timestamp type
);
CREATE TABLE user_updates
(
`id` int,
`user_id` int,
`timestamp` datetime -- or timestamp type
);
Now to get an ordered and column-wise unified resultset of all updates you might find it convenient to pack update details for each update type in a delimited string (using CONCAT_WS()) in one column (let's call it details), inject a column to distinguish an update type (lets call it obj_type) and use UNION ALL
SELECT 'C' obj_type, u.id, u.timestamp,
CONCAT_WS('|',
c.id,
c.course_name,
c.course_subject,
c.course_number) details
FROM course_updates u JOIN courses c
ON u.CRN = c.id
UNION ALL
SELECT 'U' obj_type, u.id, u.timestamp,
CONCAT_WS('|',
s.id,
s.name,
s.facebook_name) details
FROM user_updates u JOIN users s
ON u.user_id = u.id
ORDER BY timestamp DESC
Sample output:
| OBJ_TYPE | ID | TIMESTAMP | DETAILS |
-------------------------------------------------------------------------
| C | 3 | July, 30 2013 22:00:00+0000 | 3|Course3|Subject3|1414 |
| U | 2 | July, 11 2013 14:00:00+0000 | 1|Name1|FB Name1 |
| U | 2 | July, 11 2013 14:00:00+0000 | 3|Name3|FB Name3 |
...
Here is SQLFiddle demo
You can then easily explode details values while you iterate over the resultset in php.
I don't think you should mix both of those concepts (user and course) together in a query. They have different number of columns and relate to different concepts.
I think you really should use two queries. One for users and one for courses.
SELECT courses.course_name, courses.course_subject, courses.course_number,
updates.updateObj,updates.timestamp
FROM courses, updates, course_updates
WHERE courses.id = course_updates.course_id
AND course_updates.udpate_id = updates.id
ORDER BY updates.timestamp;
SELECT users.name,users.facebook_name,updates.updateObj,updates.timestamp
FROM users ,updates, user_updates
WHERE users.id = user_updates.user_id
AND user_updates.update_id = updates.id
ORDER BY updates.timestamp;
If you are going to merge the two table you need to keep in mind 2 things:
Number of columns should ideally be the same
There should be a way to distinguish the source of the data.
Here is one way you could do this:
SELECT * FROM
(SELECT courses.course_name as name, courses.course_subject as details,
updates.updateObj as updateObj, updates.timestamp as timestamp,
"course" as type
FROM courses, updates, course_updates
WHERE courses.id = course_updates.course_id
AND course_updates.udpate_id = updates.id)
UNION ALL
SELECT users.name as name,users.facebook_name as details,
updates.updateObj as updateObj,updates.timestamp as timestamp,
"user" as type
FROM users ,updates, user_updates
WHERE users.id = user_updates.user_id
AND user_updates.update_id = updates.id) as out_table
ORDER BY out_table.timestamp DESC
The type will let you distinguish between user and course updates and could be used by your front end to differently colour the rows. The course_id does not appear in this but you can add it, just keep in mind that you will have to add some dummy text to the user select statement to ensure both queries return the same number of rows. Note that in case there is an update referring to both user and course, it will appear twice.
You could also order by type to differentiate user and course data.

MySQL - Selecting related items in a table

I'm tracking user visits to course pages on our website. I'm doing this so that for any given course (aka product) I can pull up a list of the top other course pages that users visited, who also visited the current page - just like Amazon's "Customers Who Viewed This Item Also Viewed" feature.
What I have is working, but as the data collected continues to grow, the query times are getting considerably slower and slower. I've now got approx 300k records and the queries are taking 2+ seconds each. We're expecting to start trimming the data when we reach about 2M records, but given the performance problems we're currently facing, I don't think this will be possible. I would like to know if there is a better approach to how I'm doing this.
Here's the gory details...
I've got a simple three column InnoDB table containing the user id, course number and a timestamp. The user id and course number fields are indexed, as is the user id/course number combined. Here's the table schema:
CREATE TABLE IF NOT EXISTS `coursetracker` (
`user` varchar(38) NOT NULL COMMENT 'user guid',
`course` char(8) NOT NULL COMMENT 'subject code and course number',
`visited` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'last visited time',
UNIQUE KEY `ndx_user_course` (`user`,`course`),
KEY `ndx_user` (`user`),
KEY `ndx_course` (`course`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='tracking user visits to courses';
Data in the table looks like this:
user | course | visited
=======================================|==========|====================
{00001A4C-1DE0-C4FB-0770-A758A167B97E} | OFFC2000 | 2013-01-19 23:18:03
{00001FB0-179E-1E28-F499-65451E5C1465} | FSCT8481 | 2013-01-30 13:12:29
{0000582C-5959-EF2B-0637-B5326A504F95} | COMP1409 | 2013-01-13 16:09:42
{0000582C-5959-EF2B-0637-B5326A504F95} | COMP2051 | 2013-01-13 16:20:41
{0000582C-5959-EF2B-0637-B5326A504F95} | COMP2870 | 2013-01-13 16:25:41
{0000582C-5959-EF2B-0637-B5326A504F95} | COMP2920 | 2013-01-13 16:24:40
{00012C64-2CA1-66DD-5DDC-B3714BFC91C3} | COMM0005 | 2013-02-18 21:32:36
{00012C64-2CA1-66DD-5DDC-B3714BFC91C3} | COMM0029 | 2013-02-18 21:34:04
{00012C64-2CA1-66DD-5DDC-B3714BFC91C3} | COMM0030 | 2013-02-18 21:34:50
{00019F46-6664-28DD-BCCD-FA6810B4EBB8} | COMP1409 | 2013-01-16 15:48:49
A sample query that I'm using to get the related courses to any given course (COMP1409 in this example), looks like this:
SELECT `course`,
count(`course`) c
FROM `coursetracker`
WHERE `user` IN
(SELECT `user`
FROM `coursetracker`
WHERE `course` = 'COMP1409')
AND `course` != 'COMP1409'
GROUP BY `course`
ORDER BY c DESC LIMIT 10
The results of this query look like this:
course | c
=========|====
COMP1451 | 470
COMP1002 | 367
COMP2613 | 194
COMP1850 | 158
COMP1630 | 156
COMP2617 | 126
COMP2831 | 119
COMP2614 | 95
COMP1911 | 79
COMP1288 | 76
So, everything above works exactly as I'd like, other than the performance. The table is so simple that there's nothing left to index. The SQL query results in the data that I'm looking for. I'm out of ideas on how to do this faster. I'd appreciate any feedback on the approach.
You could try with a join instead:
SELECT c1.`course`,
count(c1.`course`) as c
FROM `coursetracker` c1
INNER JOIN `coursetracker` c2
ON c1.`user` = c2.`user`
WHERE c2.`course` = 'COMP1409'
AND c1.`course` != 'COMP1409'
GROUP BY c1.`course`
ORDER BY c DESC LIMIT 10
hard to tell without seeing your EXPLAIN, but maybe joining the table to itself will be faster?
SELECT `course`, count(`course`) c
FROM `coursetracker` c
INNER JOIN `coursetracker` c2 ON c.user = c2.user
WHERE c2.`course` = 'COMP1409'
AND c.`course` != 'COMP1409'
GROUP BY `course`
ORDER BY c DESC LIMIT 10