I am studying select queries for MySQL join functions.
I am slightly confused on the below query. I understand the below statement to join attributes from multiple tables with the ON clause, and then filter the results set with the WHERE clause.
Is this correct? What other functionality does this provide? Are there better alternatives?
The tables, attributes, and schema are not relevant to this question, specifically just the ON and WHERE interaction. Thanks in advance for any insight you can provide, appreciated.
SELECT DISTINCT Movies.title
FROM Rentals
INNER JOIN Customers
INNER JOIN Copies
INNER JOIN Movies ON Rentals.customerNum=Customers.customerNum
AND Rentals.barcode=Copies.barcode
AND Copies.movieNum=Movies.movieNum
WHERE Customers.givenName='Chad'
AND Customers.familyName='Black';
INNER JOIN (and the outer joins) are binary operators that should be followed by an ON clause. Your particular syntax works in MySQL, but will not work in any other database (because it is missing two ON clauses).
I would recommend writing the query as:
SELECT DISTINCT m.title
FROM Movies m JOIN
Copies co
ON co.movieNum = m.movieNum JOIN
Rentals r
ON r.barcode = co.barcode JOIN
Customers c
ON c.customerNum = r.customerNum
WHERE c.givenName = 'Chad' AND
c.familyName = 'Black';
You should always put the JOIN conditions in the ON clause, with one ON per JOIN. This also introduces table aliases, which make the query easier to write and to read.
The WHERE clause has additional filtering conditions. These could also be in ON clauses, but I think the query reads better with them in the WHERE clause. You can glance at the query and see: "We are getting something from a bunch of tables for Chad Black".
Ordinary inner JOIN operations only generate result rows for table rows matching their ON condition. They suppress any rows that don't match. That means you can move the contents of ON clauses to your WHERE clause and get the same result set. Still, don't do that; JOINs are easier to understand when they have ON clauses.
If you use LEFT JOIN, a kind of outer join, you get rows from the first table you mention that don't match any rows in the second table according to the ON clause.
SELECT a.name, b.name
FROM a
LEFT JOIN b ON a.a_id = b.a_id
gives you are result set containing all rows of a with NULL values in b.name indicating that the ON condition did not match.
Related
I've a query like below,
SELECT
c.testID,
FROM a
INNER JOIN b ON a.id=b.ID
INNER JOIN c ON b.r_ID=c.id
WHERE c.test IS NOT NULL;
Can this query be optimized further?, I want inner join between three tables to happen only if it meets the where clause.
Where clause works as filter on the data what appears after all JOINs,
whereas if you use same restriction to JOIN clause itself then it will be optimized in sense of avoiding filter after join. That is, join on filtered data instead.
SELECT c.testID,
FROM a
INNER JOIN b ON a.id = b.ID
INNER JOIN c ON b.r_ID = c.id AND c.test IS NOT NULL;
Moreover, you must create an index for the column test in table c to speed up the query.
Also, learn EXPLAIN command to the queries for best results.
Try the following:
SELECT
c.testID
FROM c
INNER JOIN b ON c.test IS NOT NULL AND b.r_ID=c.testID
INNER JOIN a ON a.id=b.r_ID;
I changed the order of the joins and conditions so that the first statement to be evaluated is c.test IS NOT NULL
Disclaimer: You should use the explain command in order to see the execution.
I'm pretty sure that even the minor change I just did might have no difference due to the MySql optimizer that work on all queries.
See the MySQL Documentation: Optimizing Queries with EXPLAIN
Three queries Compared
Have a look at the following fiddle:
https://www.db-fiddle.com/f/fXsT8oMzJ1H31FwMHrxR3u/0
I ran three different queries and in the end, MySQL optimized and ran them the same way.
Three Queries:
EXPLAIN SELECT
c.testID
FROM c
INNER JOIN b ON c.test IS NOT NULL AND b.r_ID=c.testID
INNER JOIN a ON a.id=b.r_ID;
EXPLAIN SELECT c.testID
FROM a
INNER JOIN b ON a.id = b.r_id
INNER JOIN c ON b.r_ID = c.testID AND c.test IS NOT NULL;
EXPLAIN SELECT
c.testID
FROM a
INNER JOIN b ON a.id=b.r_ID
INNER JOIN c ON b.r_ID=c.testID
WHERE c.test IS NOT NULL;
All tables should have a PRIMARY KEY. Assuming that id is the PRIMARY KEY for the tables that it is in, then you need these secondary keys for maximal performance:
c: INDEX(test, test_id, id) -- `test` must be first
b: INDEX(r_ID)
Both of those are useful and "covering".
Another thing to note: b and a is virtually unused in the query, so you may as well write the query this way:
SELECT c.testID,
FROM c
WHERE c.test IS NOT NULL;
At that point, all you need is INDEX(test, testID).
I suspect you "simplified" your query by leaving out some uses of a and b. Well, I simplified it from there, just as the Optimizer should have done. (However, elimination of tables is an optimization that it does not do; it figures that is something the user would have done.)
On the other hand, b and a are not totally useless. The JOIN verify that there are corresponding rows, possibly many such rows, in those tables. Again, I think you had some other purpose.
I have a query that looks like this:
select `adverts`.*
from `adverts`
inner join `advert_category` on `advert_category`.`advert_id` = `adverts`.`id`
inner join `advert_location` on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
and `advert_category`.`category_id` = ?
order by `updated_at` desc
The problem here is I have a huge database and this response is absolutely ravaging my database.
What I really need is to do the first join, and then do there where clause. This will whittle down my response from like 100k queries to less than 10k, then I want to do the other join, in order to whittle down the responses again so I can get the advert_location on the category items.
Doing it as is just isn't viable.
So, how do I go about using a join and a where condition, and then after getting that response doing a further join with a where condition?
Thanks
This is your query, written a bit simpler so I can read it:
select a.*
from adverts a inner join
advert_category ac
on ac.advert_id = a.id inner join
advert_location al
on al.advert_id = a.id
where al.location_id = ? and
ac.category_id = ?
order by a.updated_at desc;
I am speculating that advert_category and advert_locations have multiple rows per advert. In that case, you are getting a Cartesian product for each advert.
A better way to write the query uses exists:
select a.*
from adverts a
where exists (select 1
from advert_location al
where al.advert_id = a.id and al.location_id = ?
) and
exists (select 1
from advert_category ac
where ac.advert_id = a.id and ac.category_id = ?
)
order by a.updated_at desc;
For this version, you want indexes on advert_location(advert_id, location_id), advert_category(advert_id, category_id), and probably advert(updated_at, id).
You can write the 1st join in a Derived Table including a WHERE-condition and then do the 2nd join (but a decent optimizer might resolve the Derived Table again and do what he thinks is best based on statistics):
select adverts.*
from
(
select `adverts`.*
from `adverts`
inner join `advert_category`
on `advert_category`.`advert_id` =`adverts`.`id`
where `advert_category`.`category_id` = ?
) as adverts
inner join `advert_location`
on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
order by `updated_at` desc
MySQL will reorder inner joins for you during optimization, regardless of how you wrote them in your query. Inner join is the same in either direction (in algebra this is called commutative), so this is safe to do.
You can see the result of join reordering if you use EXPLAIN on your query.
If you don't like the order MySQL chose for your joins, you can override it with this kind of syntax:
from `adverts`
straight_join `advert_category` ...
https://dev.mysql.com/doc/refman/5.7/en/join.html says:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order.
Once the optimizer has decided on the join order, it always does one join at a time, in that order. This is called the nested join method.
There isn't really any way to "do the join then do the where clause". Conditions are combined together when looking up rows for joined tables. But this is a good thing, because you can then create a compound index that helps match rows based on both join conditions and where conditions.
PS: When asking query optimization question, you should include the EXPLAIN output, and also run SHOW CREATE TABLE <tablename> for each table, and include the result. Then we don't have to guess at the columns and indexes in your table.
Below is copied from high performance mysql book:
select film.film_id from sakila.film
left outer join sakila.film_actor using(film_id)
where film_actor.film_id is null;
I could not understand what it is doing.
Does the where clause filter for film_actor before joining. If so, how does join performs (film_id is null already, how does it join with film using film_id)
It's a standard SQL pattern for finding parent rows that have no children, in this case films that don't have an actor.
It works because missed left joins have all nulls in the missed joined row, and the where clause is evaluated after the join is made. Specifying a column that can't be null in reality in the joined row as being null returns only mussed joins.
Note also that you don't need distinct, because there is only ever one such row returned for missed joins.
Is it better to have SQL condition in the JOIN clause or in the WHERE clause ? Is SQL engine optimized for either way ? Does it depend on the engine ?
Is it always possible to replace condition in the JOIN clause by a condition in the WHERE clause ?
Example here to illustrate what i mean with condition
SELECT role_.name
FROM user_role
INNER JOIN user ON user_role.user_id_ = user.id AND
user_role.user_id_ = #user_id
INNER JOIN role ON user_role.role_id = role_.id
vs.
SELECT role_.name
FROM user_role
INNER JOIN user ON user_role.user_id_ = user.id
INNER JOIN role ON user_role.role_id = role_.id
WHERE user.id = #user_id
SQL condition in JOIN clause and in WHERE condition are equivalent if INNER JOIN is used.
Otherwise if any other JOIN is used like LEFT/RIGHT than after matching rows based on condition , another step occurs which is addition of OUTER ROWS , ie non matching rows .
WHERE condition simply filters out all non matching rows.
See this thread
Having the non-key condition in the join clause is not only OK, it is preferable especially in this query, because you can avoid some joins to other tables that are further joined to the table to which the on clause belongs.
Where clause is evaluated after all joins have been made - it's a filter on the result set. But by putting the condition in the join clause, you can stop the rows being joined at he time they're bing joined.
In your case it makes no difference, because you don't have any following tables, but I use his technique often to gain performance in my queries.
By looking at the plan generated for both the queries we can see that having the condition in the INNER JOIN or WHERE clause generates the same plan.
But the problem with using the condition in the WHERE clause you'll not be able to handle OUTER JOINs
I have something similar to the following:
SELECT c.id
FROM contact AS c
WHERE c.id IN (SELECT s.contact_id
FROM sub_table AS s
LEFT JOIN contact_sub AS c2 ON (s.id = c2.sub_field)
WHERE c2.phone LIKE '535%')
ORDER BY c.name
The problem is that the query takes a very very very long time (>2minutes), but if I take the subquery, run it separately, implode the ids and insert them into the main query, it runs in well less than 1 second, including the data retrival and implosion.
I have checked the explains on both methods and keys are being used appropriately and the same ways. The subquery doesn't return more than 200 IDs.
What could be causing the subquery method to take so much longer?
BTW, I know the query above can be written with joins, but the query I have can't be--this is just a simplified version.
Using MySQL 5.0.22.
Sounds suspiciously like MySQL bug #32665: Query with dependent subquery is too slow.
What happens if you try it like this?
SELECT c.id
FROM contact AS c
INNER JOIN (SELECT s.contact_id
FROM sub_table AS s
LEFT JOIN contact_sub AS c2 ON (s.id = c2.sub_field)
WHERE c2.phone LIKE '535%') subq ON subq.contact_id=c.id
ORDER BY c.name
Assuming that the result of s.contact_id is unique. You can add distinct to the subquery if it is not.
I always use uncorrelated subqueries this way rather than using the IN operator in the where clause.
Have you checked the Execution Plan for the query? This will usually show you the problem.
Can't you do another join instead of a subquery?
SELECT c.id
FROM contact AS c
JOIN sub_table AS s on c.id = s.contact_id
LEFT JOIN contact_sub AS cs ON (s.id = cs.sub_field)
WHERE cs.phone LIKE '535%'
ORDER BY c.name
Since the subquery is referring to a field sub_field in the outer select, it has to be run once for each row in the outer table - the results for the inner query will change with each row in the outer table.
It's a correlated subquery. It runs once for each row in the outer select. (I think. You have two tables with the same correlation name, I'm assuming that's a typo. That you say it can't be rewritten as a join means it's correlated. )
Ok, I'm going to give you something to try. You say that the subquery is not correlated, and that you still can't join on it. And that it you take the output of the subquery, and lexically substitute that for the subquery, the main query runs much faster.
So try this: make the subquery into a view: create view foo followed by the text of the subquery. Then rewrite the main query to get rid of the "IN" clause and instead join to the view.
How's the timing on that?