INNER JOIN condition in WHERE clause or ON clause? - mysql

I mistyped a query today, but it still worked and gave the intended result. I meant to run this query:
SELECT e.id FROM employees e JOIN users u ON u.email=e.email WHERE u.id='139840'
but I accidentally ran this query
SELECT e.id FROM employees e JOIN users u ON u.email=e.email AND u.id='139840'
(note the AND instead of WHERE in the last clause)
and both returned the correct employee id from the user id.
What is the difference between these 2 queries? Does the second form only join members of the 2 tables meeting the criteria, whereas the first one would join the entire table, and then run the query? Is one more or less efficient than the other? Is it something else I am missing?
Thanks!

For inner joins like this they are logically equivalent. However, you can run in to situations where a condition in the join clause means something different than a condition in the where clause.
As a simple illustration, imagine you do a left join like so;
select x.id
from x
left join y
on x.id = y.id
;
Here we're taking all the rows from x, regardless of whether there is a matching id in y. Now let's say our join condition grows - we're not just looking for matches in y based on the id but also on id_type.
select x.id
from x
left join y
on x.id = y.id
and y.id_type = 'some type'
;
Again this gives all the rows in x regardless of whether there is a matching (id, id_type) in y.
This is very different, though:
select x.id
from x
left join y
on x.id = y.id
where y.id_type = 'some type'
;
In this situation, we're picking all the rows of x and trying to match to rows from y. Now for rows for which there is no match in y, y.id_type will be null. Because of that, y.id_type = 'some type' isn't satisfied, so those rows where there is no match are discarded, which effectively turned this in to an inner join.
Long story short: for inner joins it doesn't matter where the conditions go but for outer joins it can.

In the case of an INNER JOIN, the two queries are semantically the same, meaning they are guaranteed to have the same results. If you were using an OUTER join, the meaning of the two queries could be very different, with different results.
Performance-wise, I would expect that these two queries would result in the same execution plan. However, the query engine might surprise you. The only way to know is to view the execution plans for the two queries.

The optimizer will treat them the same. You can do an EXPLAIN to prove it to yourself.
Therefore, write the one that is clearer.
SELECT e.id
FROM employees e JOIN users u ON u.email=e.email
WHERE u.id='139840'

If it were an outer join instead of inner, you'd get unintended results, but when using an inner join it makes no real difference whether you use additional join criteria instead of a WHERE clause.
Performance-wise they are most likely identical, but can't be certain.

I brought this up with my colleagues on our team at work. This response is a bit SQL Server centered and not MySQL. However, the optimizer should have similarities in operation between SQL and MySQL..
Some thoughts:
Essentially, if you have to add a WHERE, there are additional table scans done to verify equality for each condition (This goes up by orders of magnitude with an AND or dataset, an OR, the decision is cast at the first true condition) – if you have one id pointer in the example given it is extremely quick conversely, if you have to find all of the records that belong to a company or department it becomes more obscure as you may have multiples of records. If you can apply the equals condition, it is far more effective when working with an AuditLog or EventLog table that has zillions of rows. One would not really see the large benefits of this on small tables (at around 200,000 rows or so).
From: Allesandro Alpi
http://suxstellino.wordpress.com/2013/01/07/sql-server-logical-query-processing-summary/
From: Itzik Ben-Gan
http://tsql.solidq.com/books/insidetsql2008/Logical%20Query%20Processing%20Poster.pdf

Related

MySQL performance - cross join vs left join

I am wondering how MySQL (or its underlying engine) processes the queries.
There are two set queries below (one uses left join and the other one uses cross join), which eventually will give the same result.
My question is, how come the processing time of the two sets of queries are similar?
What I expected is that the first set query will run quicker because the computer is dealing with left join so the size of the "table" won't be expanding, while the second set of queries makes the size of the "table" (what I assume is that the computer needs to get the result of the cross-join from multiple tables before it can go ahead and do the where clause) relatively larger.
select s.*, a.score as score_01, b.score as score_02
from student s
left join (select \* from sc where cid = '01') a using (sid)
left join (select \* from sc where cid = '02') b using (sid)
where a.score > b.score;
select s.*, a.score as score_01, b.score as score_02
from student s
,(select * from sc where cid = '01') a
,(select * from sc where cid = '02') b
where a.score > b.score and a.sid = b.sid and s.sid = a.sid;
I tried both sets of queries and expected the processing time for the first set query will be shorter, but it is not the case.
Add this to sc:
INDEX(sid, cid, score)
Better yet, if you have a useless id on side replace it with
PRIMARY KEY(sid, cid)`
(Assuming that pair is Unique.)
With either of those fixes, I expect both of your queries run at similar speed, and faster than currently.
For further discussion, please provide SHOW CREATE TABLE.
Addressing some of the Comments
MySQL ignores the keywords INNER, OUTER, and CROSS. So, it up to the WHERE to figure whether it is "inner" or "outer".
MySQL throws the ON and WHERE conditions together (except when it matters for LEFT), then decides what is used for filtering (WHERE) so it may be able to do that first. Then other conditions (which belonged in ON) help it get to the 'next' table.
So... Please use ON to say how the tables are related; use WHERE for filtering. (And don't use the old comma-join.)
That is, MySQL will [usually] look at one table at a time, doing a "Nested Loop Join" (NLJ) to get to the next.
There are many possible ways to evaluate a JOIN; MySQL ponders which one might be best, then uses that.
The order of non-LEFT JOINs does not matter, nor does the order of expressions AND'd together in WHERE.
In some situations, a HAVING expression can (and is) moved to the WHERE clause.
Although FROM comes before WHERE, the two get somewhat tangled up together. But, in general, the clauses are required to be in a certain order, and that order is logically the order that things have to happen in.
It is up to the Optimizer to combine steps. For example
WHERE a = 1
ORDER BY b
and the table has INDEX(a,b) -- The index will be used to do both, essentially at the same time. Ditto for
SELECT a, MAX(b)
...
GROUP BY a
ORDER BY a
can hop through the BTree index on (a,b) and deliver the results without an extra sort pass for either the GROUP BY or ORDER BY.
SELECT x is executed after WHERE y = 'abc' -- Well, in some sense it is. But if you have INDEX(y,x), the Optimizer is smart enough to grab the x values while it is performing the WHERE.
When a WHERE references more than one table of a JOIN, the Optimizer has a quandary. Which table should it start its NLJ with? It has some statistics to help make the decision, but it does not always get it right. It will usually
filter on one of the tables
NLJ to get to the next table, meanwhile throwing in any WHERE clauses for that table in with the ON clause.
Repeat for other tables.
When there is both a WHERE and an ORDER BY, the Optimizer will usually filter filter, then sort. But sometimes (not always correctly) it will decide to use an index for the ORDER BY (thereby eliminating the sort) and filter as it reads the table. LIMIT, which is logically done last further muddies the decision.
MySQL does not have FULL OUTER JOIN. It can be simulated with two JOIN and a UNION. (It is only very rarely needed.)

SQLJoin Results

Select * from a join b on a.id=b.id and a.vol<5
Select * from a join b on a.id=b.id where a.vol<5
Do they produce the same results?
If they don't produce the same results, a has 1000 rows, b jas 100 rows, how many rows will each produce?
I would say yes, it does.
A "Join" implies an "Inner Join" so it doesn't matter if you have an "and" in the join or a "Where" after the join.
It would be different if it was an "outer Join" Specifying a "Where" with an outer joined table will turn the join into an "Inner Join" or simply "Join"
Hope that made sense
For an INNER JOIN, like the simple query you have here, they are the same.
For an OUTER JOIN, they might not be the same.
For example, take these two queries:
select * from orders o left join orderlines ol on ol.order_id = o.id where o.id=12345
and
select * from orders o left join orderlines ol on ol.order_id = o.id and o.id=12345
The first query will give you data on order #12345 and it's lines, if any. The second query will give you data from all orders, but only order #12345 will have any item data.
This also illustrates how the two options have different semantic meanings. Even if they produce the same results, the two queries from your question have different semantic meanings, which might be important as an application grows over time.
I think you satisfied from answers but I want to mention about another side of this usage.
This two method generates the same result but compiler uses the different techniques to get the result.
Of course, different technique generates different results. But when ? It is very hard to illustrate the stiation but I will try to explain.
Think that we have two table but first table has isDeleted column for records. This application does not deletes the rows and get just updates the IsDeleted column and ignored that records.
In first case if you do not filter records in ON operator and you filtered it in where criteria. These records will be included in other joins and you will calculate the result wrong. Think that you joined this table Amounts table. The result is wrong because deleted records included and then you filtered them in where criteria.
This difference can lead to very big mistakes specially in queries which has many joins.
I wish I succeded the explanation. I m not good at. :)

MySQL creating temp table then join faster than left join

I have a LEFT JOIN that is very expensive:
    select X.c1, COUNT(Y.c3) from X LEFT JOIN Y on X.c1=Y.c2 group by X.c1;
After several minutes (20+), it still does not finish. But I want all rows in X. So I really do need a LEFT JOIN at some point.
It appears that I can hack my way around this to return the result set I am looking for by using a temp table in less than two minutes. I first trim down table Y so that it only contains rows in the join.
CREATE TEMPORARY TABLE IF NOT EXISTS table2 AS 
(select X.c1 as t, COUNT(Y.c2) as c from X
INNER JOIN Y where X.c1=Y.c2 group by X.c1);
select X.c1, table2.c from X 
LEFT JOIN table2 on X.c1 = table2.t; 
This finishes in under two minutes.
My questions are:
1) Are they equivalent?
2) Why is the second so much faster (why doesn't MySQL do this kind of optimization), meaning, do I need to do these kinds mysql?
EDIT: additional info: C1, C2 are BIGINTS. C1 is unique but there can be many C2s that all point to the same C1. As far as I know, I have not indexed any tables. X.C1 is an _id column that Y.c2 refers to.
Try indexing X.c1 and Y.c2 and running your original query.
It's hard to tell why your 1st query runs slower without the indexes without comparing the query plans from both queries (you can get the query plan by running your queries with explain at the beginning) but I suspect it's because the 2nd table contains many rows that do not have a corresponding row in the 1st table.
If x.c1 is unique, then I would suggest writing the query as:
select X.c1,
(select COUNT(Y.c3)
from Y
where X.c1 = Y.c2
)
from X;
For this query, you want an index on Y(c2, c3).
The reason why a left join might take longer is if many rows do not match. In that case, the group by is aggregating by many more rows than it really needs to. And no, MySQL does not attempt this type of optimization.

Where is better to put 'on' conditions in multiple joins? (mysql)

I have multiple joins including left joins in mysql. There are two ways to do that.
I can put "ON" conditions right after each join:
select * from A join B ON(A.bid=B.ID) join C ON(B.cid=C.ID) join D ON(c.did=D.ID)
I can put them all in one "ON" clause:
select * from A join B join C join D ON(A.bid=B.ID AND B.cid=C.ID AND c.did=D.ID)
Which way is better?
Is it different if I need Left join or Right join in my query?
For simple uses MySQL will almost inevitably execute them in the same manner, so it is a manner of preference and readability (which is a great subject of debate).
However with more complex queries, particularly aggregate queries with OUTER JOINs that have the potential to become disk and io bound - there may be performance and unseen implications in not using a WHERE clause with OUTER JOIN queries.
The difference between a query that runs for 8 minutes, or .8 seconds may ultimately depend on the WHERE clause, particularly as it relates to indexes (How MySQL uses Indexes): The WHERE clause is a core part of providing the query optimizer the information it needs to do it's job and tell the engine how to execute the query in the most efficient way.
From How MySQL Optimizes Queries using WHERE:
"This section discusses optimizations that can be made for processing
WHERE clauses...The best join combination for joining the tables is
found by trying all possibilities. If all columns in ORDER BY and
GROUP BY clauses come from the same table, that table is preferred
first when joining."
For each table in a join, a simpler WHERE is constructed to get a fast
WHERE evaluation for the table and also to skip rows as soon as
possible
Some examples:
Full table scans (type = ALL) with NO Using where in EXTRA
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id AND cr.role = cr2.role AND cr.util = 1000
[Err] Out of memory
Uses where to optimize results, with index (Using where,Using index):
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
WHERE cr.role = cr2.role
AND cr.util = 1000
515661 rows in set (0.124s)
****Combination of ON/WHERE - Same result - Same plan in EXPLAIN*******
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
AND cr.role = cr2.role
WHERE cr.util = 1000
515661 rows in set (0.121s)
MySQL is typically smart enough to figure out simple queries like the above and will execute them similarly but in certain cases it will not.
Outer Join Query Performance:
As both LEFT JOIN and RIGHT JOIN are OUTER JOINS (Great in depth review here) the issue of the Cartesian product arises, the avoidance of Table Scans must be avoided, so that as many rows as possible not needed for the query are eliminated as fast as possible.
WHERE, Indexes and the query optimizer used together may completely eliminate the problems posed by cartesian products when used carefully with aggregate functions like AVERAGE, GROUP BY, SUM, DISTINCT etc. orders of magnitude of decrease in run time is achieved with proper indexing by the user and utilization of the WHERE clause.
Finally
Again, for the majority of queries, the query optimizer will execute these in the same manner - making it a manner of preference but when query optimization becomes important, WHERE is a very important tool. I have seen some performance increase in certain cases with INNER JOIN by specifying an indexed col as an additional ON..AND ON clause but I could not tell you why.
Put the ON clause with the JOIN it applies to.
The reasons are:
readability: others can easily see how the tables are joined
performance: if you leave the conditions later in the query, you'll get way more joins happening than need to - it's like putting the conditions in the where clause
convention: by following normal style, your code will be more portable and less likely to encounter problems that may occur with unusual syntax - do what works

MySQL max-value – when and how often

If I have a query like
SELECT (MAX(b.A)/a.A) AS c
FROM a
INNER JOIN b
ON a.b_id = b.id
GROUP by a.id
does MySQL evaluate the value for "MAX(b.A)" for every row or only once?
It's just of interest to me if there is room for performance improvement or not.
Thanks!
UPDATE
OK let's move on to a real world example: I want to calculate the proportional value of a users likes compared to max-user-likes.
The query to only read the max value of users.likes (which is indexed) takes 0.0003
SELECT MAX(likes)
FROM users
So I now know the value of max-user-likes, let's say it's 10000 so I could query like this which takes 0.0007s:
SELECT (users.likes/10000) AS weight
FROM posts
INNER JOIN users ON posts.author_id = users.id
So one would expect to have both queries together to be something like 0.0003 + 0.0007s, but it takes 0.3s:
SELECT (users.likes/(SELECT MAX(likes) FROM users)) AS weight
FROM posts
INNER JOIN users ON posts.author_id = users.id
So something seems still wrong with my database - any suggestions?
Since you have no GROUP BY clause, the result will only have one row and you can't know from which row the value of a.A will be. The value of MAX(b.A) will be only evaluated once.
When you have a GROUP BY clause, MAX(b.A) will be evaluated for every group.
In general, the expression within an aggregation function is evaluated and checked for NULL for each row. There certainly can be a optimization for MIN and MAX in case of walking through an index, but I doubt that.
BTW, you can easily check this, when you execute MAX(id) on a large table. You will see that the execution time is the same as for COUNT(id) (and might be much more than COUNT(*) depending on the engine).