Can I be selective on what rows I join on in MySQL - mysql

Suppose I have two tables, people and emails. emails has a person_id, an address, and an is_primary:
people:
id
emails:
person_id
address
is_primary
To get all email addresses per person, I can do a simple join:
select * from people join emails on people.id = emails.person_id
What if I only want (at most) one row from the right table for each row in the left table? And, if a particular person has multiple emails and one is marked as is_primary, is there a way to prefer which row to use when joining?
So, if I have
people: emails:
------ -----------------------------------------
| id | | id | person_id | address | is_primary |
------ -----------------------------------------
| 1 | | 1 | 1 | a#b.c | true |
| 2 | | 2 | 1 | b#b.c | false |
| 3 | | 3 | 2 | c#b.c | true |
| 4 | | 4 | 4 | d#b.c | false |
------ -----------------------------------------
is there a way to get this result:
------------------------------------------------
| people.id | emails.id | address | is_primary |
------------------------------------------------
| 1 | 1 | a#b.c | true |
| 2 | 3 | c#b.c | true | // chosen over b#b.c because it's primary
| 3 | null | null | null | // no email for person 3
| 4 | 4 | d#b.c | false | // no primary email for person 4
------------------------------------------------

You got it a bit wrong, how left/right joins work.
This join
select * from people join emails on people.id = emails.person_id
will get you every column from both tables for all records that match your ON condition.
The left join
select * from people left join emails on people.id = emails.person_id
will give you every record from people, regardless if there's a corresponding record in emails or not. When there's not, the columns from the emails table will just be NULL.
If a person has multiple emails, multiple records will be in the result for this person. Beginners often wonder then, why the data has duplicated.
If you want to restrict the data to the rows where is_primary has the value 1, you can do so in the WHERE clause when you're doing an inner join (your first query, although you ommitted the inner keyword).
When you have a left/right join query, you have to put this filter in the ON clause. If you would put it in the WHERE clause, you would turn the left/right join into an inner join implicitly, because the WHERE clause would filter the NULL rows that I mentioned above. Or you could write the query like this:
select * from people left join emails on people.id = emails.person_id
where (emails.is_primary = 1 or emails.is_primary is null)
EDIT after clarification:
Paul Spiegel's answer is good, therefore my upvote, but I'm not sure if it performs well, since it has a dependent subquery. So I created this query. It may depend on your data though. Try both answers.
select
p.*,
coalesce(e1.address, e2.address) AS address
from people p
left join emails e1 on p.id = e1.person_id and e1.is_primary = 1
left join (
select person_id, address
from emails e
where id = (select min(id) from emails where emails.is_primary = 0 and emails.person_id = e.person_id)
) e2 on p.id = e2.person_id

Use a correlated subquery with LIMIT 1 in the ON clause of the LEFT JOIN:
select *
from people p
left join emails e
on e.person_id = p.id
and e.id = (
select e1.id
from emails e1
where e1.person_id = e.person_id
order by e1.is_primary desc, -- true first
e1.id -- If e1.is_primary is ambiguous
limit 1
)
order by p.id
sqlfiddle

Related

SQL left join: how to return the newest from tableB and grouped by another field

I've been trying for two days, without luck.
I have the following simplified tables in my database:
customers:
| id | name |
| 1 | andrea |
| 2 | marco |
| 3 | giovanni |
access:
| id | name_id | date |
| 1 | 1 | 5000 |
| 2 | 1 | 4000 |
| 3 | 2 | 1500 |
| 4 | 2 | 3000 |
| 5 | 2 | 1000 |
| 6 | 3 | 6000 |
| 7 | 3 | 2000 |
I want to return all the names with their last access date.
At first I tried simply with
SELECT * FROM customers LEFT JOIN access ON customers.id =
access.name_id
But I got 7 rows instead of 3 as expected. So I understood I need to use GROUP BY statemet as the following:
SELECT * FROM customers LEFT JOIN access ON customers.id =
access.name_id GROUP BY customers.id
As far I know, GROUP BY combines using a random row. In fact I got unordered access dates with several tests.
Instead I need to group every customer id with its corresponding latest access! How this can be done?
You have to get the latest date from the access table with a group by on the the name_id, then join this result with the customer table. Here is the query:
select c.id, c.name, a.last_access_date from customers c left join
(select id, name_id, max(access_date) last_access_date from access group by name_id) a
on c.id=a.name_id;
Here is a DEMO on sqlfiddle.
I think this is what you'd like to achieve:
SELECT c.id, c.name, max(a.date) last_access
FROM customers c
LEFT JOIN access a ON c.id = a.name_id
GROUP BY c.id, c.name
The LEFT join will return all entries in table customers regardless if the join criteria (c.id = a.name_id) is satisfied. This means that you might get some NULL entries.
Example:
Simply add a new row in the customers table (id: 4, name: manuela). The output will have 4 rows and the newest row will be (id: 4, last_access: null)
I would do this using a correlated subquery in the ON clause:
SELECT a.*, c.*
FROM customers c LEFT JOIN
access a
ON c.id = a.name_id AND
a.DATE = (SELECT MAX(a2.date) FROM access a2 WHERE a2.name_id = a.name_id);
If this statement is true:
I need to group every customer id with its corresponding latest access! How this can be done?
Then you can simply do:
select a.name_id, max(a2.date)
from access a
group by a.name_id;
You do not need the customers table because:
All customers are in access, so the left join is not necessary.
You need no columns from customers.

Optimization SQL for getting data from two joined tables (usernames for user-from-id and user-to-id msgs from two tables)

I have table "msgs" with messages between users (their ids):
+--------+-------------+------------+---------+---------+
| msg_id |user_from_id | user_to_id | message | room_id |
+--------+-------------+------------+---------+---------+
| 1 | 1 | 4 |Hello! | 2 |
| 2 | 1 | 5 |Hi there | 1 |
| 3 | 2 | 1 |CU soon | 2 |
| 4 | 3 | 7 |nice... | 1 |
+--------+-------------+------------+---------+---------+
I also have two tables with users names.
Table: user1
+--------+----------+
|user_id |user_name |
+--------+----------+
| 5 | Ann |
| 6 | Sam |
| 7 | Michael |
+--------+----------+
Table: user2
+--------+----------+
|user_id |user_name |
+--------+----------+
| 1 | John |
| 2 | Alice |
| 3 | Tom |
| 4 | Jane |
+--------+----------+
I need to get usernames for two users IDs in every row. Every user-id can be in first or second table with usernames.
I wrote this SQL query:
SELECT DISTINCT
m.msg_id,
m.user_from_id,
CASE WHEN c1.user_name IS NULL THEN c3.user_name ELSE c1.user_name END AS from_name,
m.user_to_id,
CASE WHEN c2.user_name IS NULL THEN c4.user_name ELSE c2.user_name END AS to_name,
m.message
FROM msgs m
LEFT JOIN users1 c1 ON c1.user_id=m.user_from_id
LEFT JOIN users1 c2 ON c2.user_id=m.user_to_id
LEFT JOIN users2 c3 ON c3.user_id=m.user_from_id
LEFT JOIN users2 c4 ON c4.user_id=m.user_to_id
WHERE m.room_id=1
LIMIT 0, 8
It works.
Execute query to get raw data without usernames (without any join) tooks about ~0.1 sec. But it's enough to join only one usernames table (user1 or user2 only) to get this data in about ~6.2 sec. (with join one table). I have quite a lot rows in this tables: 35K rows in msgs, 0.5K in user1, 25K in user2.
Executing query with join two tables (with all this data) is impossible.
How to optimize this query? I just need usernames for user_ids in first "msgs" table.
There are potentially many differences between the queries with and without the joins. I am going to assume that the ids have the appropriate indexes -- primary keys automatically do. If not, then check that.
The obvious solution is to use the original query as a subquery:
SELECT m.msg_id, m.user_from_id,
(CASE WHEN c1.user_name IS NULL THEN c3.user_name ELSE c1.user_name
END) AS from_name,
m.user_to_id,
(CASE WHEN c2.user_name IS NULL THEN c4.user_name ELSE c2.user_name
END) AS to_name,
m.message
FROM (SELECT m.*
FROM msgs m
WHERE m.room_id = 1
LIMIT 0, 8
) m LEFT JOIN
users1 c1
ON c1.user_id = m.user_from_id LEFT JOIN
users1 c2
ON c2.user_id = m.user_to_id LEFT JOIN
users2 c3
ON c3.user_id = m.user_from_id LEFT JOIN
users2 c4
ON c4.user_id = m.user_to_id;
For most data structures, the distinct is also unnecessary.
This also makes (the reasonable assumption) that user_id is unique in the users tables.
Also, use of LIMIT without ORDER BY is highly discouraged. The particular rows you get are indeterminate and might change from one execution to the next.

Incorrect results from three table join

I have these three tables in my database:
tblCustomer (id,name,address)
tblLoan (id,customerId,LoanAmount,date)
tblPayment (id,customerId,ReceivedAmount,date)
I want to find the total loanAmount for a customer and how much they have paid.
I wrote this query:
SELECT c.fname, SUM(l.amount), SUM(p.amount)
FROM tblCustomer c
JOIN tblLoan l ON (l.customerId = c.id)
JOIN tblPayment p ON (p.customerId = c.id)
WHERE c.id = 3;
It returns results but they are incorrect.
First, as others have mentioned, your syntax is likely incorrect because you do not have matching column names, but you said you had incorrect results, so I would assume that's not your problem as you were able to run your query..
The problem that I think you are most likely having is that by joining the two tables together like that, rows appear twice for each customer. Am I correct in assuming that your 'incorrect' results are double what you would expect? Let me illustrate for those who don't understand. Consider this data set, with shortened column values:
tblCustomer:
| id | name |
+----+------+
| 1 | Adam |
| 2 | John |
| 3 | Jane |
tblLoan, and for simplicity we'll say the payment table looks the same:
| customerID | loanAmount |
+------------+------------+
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
| 3 | 300 |
| 2 | 200 |
If I perform the following query (without summing values, just getting the values I want:
SELECT c.id, c.name, l.loanAmount, p.receivedAmount
FROM tblCustomer c
JOIN tblLoan l ON l.customerid = c.id
JOIN tblPayment p ON p.customerid = c.id
WHERE c.id = 3;
It returns this result set:
| id | name | loanAmount | receivedAmount |
+----+------+------------+----------------+
| 3 | Jane | 100 | 100 |
| 3 | Jane | 100 | 300 |
| 3 | Jane | 300 | 100 |
| 3 | Jane | 300 | 300 |
So notice that because we're joining two tables based on a relationship to a third table, were actually creating a cartesian product which is causing the problem. So, what I recommend you do is use subqueries for these two tables. One subquery will pull the loan values, one the payment values, and you can join those together on the id value.
It will look like this:
SELECT t.id, t.totalLoan, w.totalReceived
FROM(SELECT c.id, SUM(l.loanAmount) AS totalLoan
FROM tblCustomer c
JOIN tblLoan l ON l.customerid = c.id
WHERE c.id = 3) t
JOIN(SELECT c.id, SUM(p.receivedAmount) AS totalReceived
FROM tblCustomer c
JOIN tblPayment p ON p.customerid = c.id
WHERE c.id = 3) w
ON t.id = w.id;
And this should give you the values you want. Here is what I tested on SQL Fiddle.
FYI, YOUR COLUMN NAMES ARE WRONG!!!
There is no such column named fname in table tblCustomer
There is no such column named amount in table tblLoan
There is no such column named amount in table tblPayment
You won't get the right result if you don't have the appropriate column names. Even when using aliases, your column name should be EXACTLY THE SAME as in your database table. That's because, you are aliasing TABLES in JOIN queries, not COLUMNS.
So, re-write your query in the following way:
SELECT c.name, SUM(l.LoanAmount), SUM(p.ReceivedAmount)
FROM tblCustomer c
JOIN tblLoan l ON l.customerId = c.id
JOIN tblPayment p ON p.customerId = c.id
WHERE c.id = 3
Note that there's no need to get brackets around the ON clause in JOIN.

MySQL selective GROUP BY, using the maximal value

I have the following (simplified) three tables:
user_reservations:
id | user_id |
1 | 3 |
1 | 3 |
user_kar:
id | user_id | szak_id |
1 | 3 | 1 |
2 | 3 | 2 |
szak:
id | name |
1 | A |
2 | B |
Now I would like to count the reservations of the user by the 'szak' name, but I want to have every user counted only for one szak. In this case, user_id has 2 'szak', and if I write a query something like:
SELECT sz.name, COUNT(*) FROM user_reservations r
LEFT JOIN user_kar k ON k.user_id = r.user_id
LEFT JOIN szak s ON r.szak_id = r.id
It will return two rows:
A | 2 |
B | 2 |
However I want to every reservation counted to only one szak (lets say the highest id only). I tried MAX(k.id) with HAVING, but seems uneffective.
I would like to know if there is a supported method for that in MySQL, or should I first pick all the user ID-s on the backend site first, check their maximum kar.user_id, and then count only with those, removing them from the id list, when the given szak is counted, and then build the data back together on the backend side?
Thanks for the help - I was googling around for like 2 hours, but so far, I found no solution, so maybe you could help me.
Something like this?
SELECT sz.name,
Count(*)
FROM (SELECT r.user_id,
Ifnull(Max(k.szak_id), -1) AS max_szak_id
FROM user_reservations r
LEFT OUTER JOIN user_kar k
ON k.user_id = r.user_id
GROUP BY r.user_id) t
LEFT OUTER JOIN szak sz
ON sz.id = t.max_szak_id
GROUP BY sz.name;

Joining 3 tables with n:m relationship, want to see nonmatching rows

For this problem, consider the following 3 tables:
Event
id (pk)
title
Event_Category
event_id (pk, fk)
category_id (pk, fk)
Category
id (pk)
description
Pretty trivial I guess... :) Each event can fall into zero or more categories, in total there are 4 categories.
In my application, I want to view and edit the categories for a specific event. Graphically, the event will be shown together with ALL categories and a checkbox indicating whether the event falls into the category. Changing and saving the choice will result in modifocation of the intermediate table Event_Category.
But first: how to select this for a specific event? The query I need will in fact always return 4 rows, the number of categories present.
Following returns only the entries for the categories the event with id=11 falls into. Experimenting with outer joins did not give more rows in the result.
SELECT e.id, c.omschrijving
FROM Event e
INNER JOIN Event_Categorie ec ON e.id = ec.event_id
INNER JOIN Categorie c ON c.id = ec.categorie_id
WHERE e.id = 11
Or should I start with the Category table in the query? Hope for some hints :)
TIA, Klaas
UPDATE:
Yes I did but still have not found the answer. But I have simplified the issue by omitting the Event table from the query because this table is only used to view the Event descriptions.
SELECT * from Categorie c LEFT JOIN Event_Categorie ec ON c.id = ec.categorie_id WHERE ec.event_id = 11;
The simplified 2-table query only uses the lookup table and the link table but still returns only 2 rows instead of the total of 4 rows in the Categorie table.
My guess would be that the WHERE clause is applied after the joining, so the rows not joined to the link table are excluded. In my application I solved the issues by using a subquery but I still would like to know what is the best solution.
What you want is the list of all categories, plus information about whether that category is in the list of categories of your event.
So, you can do:
SELECT
*
FROM
Category
LEFT JOIN Event_Category ON category_id = id
WHERE
event_id = 11
and event_id column will be NULL on the categories that are not part of your event.
You can also create a column (named has_category below) that you will use to see if the event has this category instead of comparing with NULL:
SELECT
*,
event_id IS NOT NULL AS has_category
FROM
Category
LEFT JOIN Event_Category ON category_id = id
WHERE
event_id = 11
EDIT: This seems exactly what you say you are doing on your edit. I tested it and it seems correct. Are you sure you are running this query, and that rows with NULL are not somehow ignored?
The query
SELECT * FROM Categorie;
returns 4 rows:
+----+--------------+-------------------------------------+--------------------------------------+
| id | omschrijving | afbeelding | afbeelding_klein |
+----+--------------+-------------------------------------+--------------------------------------+
| 1 | Creatief | images/categorieen/creatief420k.jpg | images/categorieen/creatief190k.jpg |
| 2 | Sportief | images/categorieen/sportief420k.jpg | images/categorieen/sportief190kr.jpg |
| 4 | Culinair | images/categorieen/culinair420k.jpg | images/categorieen/culinair190k.jpg |
| 5 | Spirit | images/categorieen/spirit420k.jpg | images/categorieen/spirit190k.jpg |
+----+--------------+-------------------------------------+--------------------------------------+
4 rows in set (0.00 sec)
BUT:
The query
SELECT *
FROM Categorie
LEFT JOIN Event_Categorie ON categorie_id = id
WHERE event_id = 11;
returns 2 rows:
+----+--------------+-------------------------------------+-------------------------------------+----------+--------------+
| id | omschrijving | afbeelding | afbeelding_klein | event_id | categorie_id |
+----+--------------+-------------------------------------+-------------------------------------+----------+--------------+
| 1 | Creatief | images/categorieen/creatief420k.jpg | images/categorieen/creatief190k.jpg | 11 | 1 |
| 4 | Culinair | images/categorieen/culinair420k.jpg | images/categorieen/culinair190k.jpg | 11 | 4 |
+----+--------------+-------------------------------------+-------------------------------------+----------+--------------+
2 rows in set (0.00 sec)
So I still need the subquery... and the LEFT JOIN is not effective in showing all rows of the CAtegorie table, regardless whether there is a match with the link table.
This query, however, does what I want it to do:
SELECT *
FROM Categorie c
LEFT JOIN (SELECT * FROM Event_Categorie ec WHERE ec.event_id = 11 ) AS subselect ON subselect.categorie_id = c.id;
Result:
+----+--------------+-------------------------------------+--------------------------------------+----------+--------------+
| id | omschrijving | afbeelding | afbeelding_klein | event_id | categorie_id |
+----+--------------+-------------------------------------+--------------------------------------+----------+--------------+
| 1 | Creatief | images/categorieen/creatief420k.jpg | images/categorieen/creatief190k.jpg | 11 | 1 |
| 2 | Sportief | images/categorieen/sportief420k.jpg | images/categorieen/sportief190kr.jpg | NULL | NULL |
| 4 | Culinair | images/categorieen/culinair420k.jpg | images/categorieen/culinair190k.jpg | 11 | 4 |
| 5 | Spirit | images/categorieen/spirit420k.jpg | images/categorieen/spirit190k.jpg | NULL | NULL |
+----+--------------+-------------------------------------+--------------------------------------+----------+--------------+
4 rows in set (0.00 sec)
The issue is that you have filtered the results by the eventid. As you can see in your results, two of the categories (Sportief and Spirit) do not have events. So the correct SQL statement (using SQL Server syntax; some translation may be required) is:
SELECT *
FROM Categorie
LEFT JOIN Event_Categorie ON categorie_id = id
WHERE (event_id IS NULL) OR (event_id = 11);
Finally I found the right query, no subselect is necessary. But the WHERE clause works after the joining and therefore is no part of the join anymore. THe solution is extending the ON clause with an extra condition. Now all 4 rows are returned with NULL for the non-matching Categories!
SELECT *
FROM Categorie
LEFT JOIN Event_Categorie ON categorie_id = id AND event_id = 11;
So the bottom line is that putting an extra condition in the ON clause has different effect than filtering out rows by the same condition in the WHERE clause!