MySQL Left Join and Right Join Optimization - mysql

I've been looking up some documentation about this topic here: https://dev.mysql.com/doc/refman/5.7/en/left-join-optimization.html
But I don't understand the following example:
The join optimizer calculates the order in which to join tables. The table read order forced by LEFT JOIN or STRAIGHT_JOIN helps the join optimizer do its work much more quickly, because there are fewer table permutations to check. This means that if you execute a query of the following type, MySQL does a full scan on b because the LEFT JOIN forces it to be read before d:
SELECT *
FROM a JOIN b LEFT JOIN c ON (c.key=a.key)
LEFT JOIN d ON (d.key=a.key)
WHERE b.key=d.key;
The fix in this case is reverse the order in which a and b are listed in the FROM clause:
SELECT *
FROM b JOIN a LEFT JOIN c ON (c.key=a.key)
LEFT JOIN d ON (d.key=a.key)
WHERE b.key=d.key;
Why does the order make an optimization? Do JOIN and LEFT_JOIN execute in some order?

I suspect the first quote is not quite correct. I have seen LEFT JOIN turned into JOIN and then the tables touched in the 'wrong' order.
Anyway, don't worry about the work the optimizer needs to do. In thousands of slow JOINs, I have identified only one case where the cost of picking the order was important. And it was a case of multiple joins to a single table; yet another drawback of EAV schema. Anyway, there is a simple setting to avoid that problem.
LEFT/RIGHT/plain JOINs are semantically done left-to-right (regardless of the order the optimizer chooses to touch the tables).
If you are concerned about the ordering, you can add parentheses. For example:
FROM (a JOIN b ON ...) JOIN (c JOIN d ON ...) ON ...
If you are using "commajoin" (FROM a,b...), don't. However, its precedence changed long ago. The workaround was to add parens so that the same SQL would work in versions before and after the change.
Don't use LEFT unless you need it to get NULLs for missing 'right' rows. It just confuses readers into thinking that you expect NULLs.

This example is wrong in many ways, and it is not clear to me what it is trying to convey. Apologies for that. I will file a bug with the documentation team.
Some clarifications:
For the given query, the last LEFT JOIN will be converted to an inner join. This is because the WHERE clause, WHERE b.key=d.key, implies that d.key can not be NULL. Hence, any extra rows produced by LEFT JOIN compared to INNER JOIN would be filtered out by the WHERE clause. (The principles of this transformation is described in the paragraph following the given example.)
The ON clause of the first LEFT JOIN, ON (c.key=a.key), makes table c dependent on table a, but not table b. Hence, the only the requirement wrt join order is that table a is processed before table c. The order in which tables a and b are listed in the query, will not change that.
Tables b and d may be processed in any order, both wrt each other and wrt the other tables of the query.
This paragraph seems to recommend LEFT JOIN as a mechanism to reduce number of "table permutations to check". This is not meaningful since changing from INNER JOIN to LEFT JOIN may change the semantics of the query. For this purpose STRAIGHT_JOIN should be used instead.
For most join queries, execution time by far exceeds optimization time. Reducing the number of "table permutations to check" may cause potentially more efficient permutations to not be explored. Hence, LEFT JOIN should not be used unless it is required to get the wanted semantics.

Related

SQL a left join b -> b right join a: difference in order

I sometimes read that it is equivalent to write "a left join b" and "b right jojn a". I thought I would understand this but I read in a book that this is not the case. It says that the result tuples are the same but they might be in different order. I could not find an explanation for that. I also tried to reproduce such a difference in order on my local MySQL Server, but I could not.
The only difference seems to be order of attributes.
Can anyone explain to me when or why a difference in tuple order occures?
This is more complicated than it sounds. First:
select *
from a left join b on . . . ;
and:
select *
from b right join a on . . . ;
Are likely to produce result sets that differ in two ways:
The columns are in a different order.
The rows may be in a different order.
Neither of these affects the equivalence of the result set from a set-theory perspective. But they could have practical effects. In general, if you care about ordering, then respectively:
List the columns explicitly.
Include an order by.
The more important point is that left join and right join are not interchangeable when there are multiple joins, because joins always associate from left to right regardless of type.
In the following, I'm leaving out the on clauses. Consider:
from a left join b left join c
You would think that the equivalent with right join is:
from c right join b right join a
But, the joins are grouped so the first is interpreted as:
from (a left join b) left join c
The second is:
from (c right join b) right join a
But the equivalent with right joins is:
from c right join (b right join a)
In both cases, every row from a will be int he result set. But the results can differ depending on the overlap among the three tables.
I sometimes read that the order in which the tuples are returned is insignificant. The order in which a real life database returns your records, may change because the engine decides it has found a better path, using an index or not, because a block of data has been moved, ... There are big differences between the relational theory and the database of your choice. I don't mean MySQL with that.

LEFT JOIN or INNER JOIN?

I have the following tables. All fields are NOT NULL.
tb_post
id
account_id
created_at
content
tb_account
id
name
I want to select the latest post along with the name. Should I use INNER JOIN or LEFT JOIN? From my understanding both produce the same results. But which is more correct or faster?
SELECT p.content, a.name
FROM tb_post AS p
[INNER or LEFT] JOIN tb_account AS a
ON a.id = p.account_id
ORDER BY p.created_at DESC
LIMIT 50
A LEFT JOIN is absolutely not faster than an INNER JOIN. In fact, it's slower; by definition, an outer join (LEFT JOIN or RIGHT JOIN) has to do all the work of an INNER JOIN plus the extra work of null-extending the results. It would also be expected to return more rows, further increasing the total execution time simply due to the larger size of the result set.
(And even if a LEFT JOIN were faster in specific situations due to some difficult-to-imagine confluence of factors, it is not functionally equivalent to an INNER JOIN, so you cannot simply go replacing all instances of one with the other!)
Better go for INNER JOIN.
As Per My View The Correct One Is Inner join
because it returns resultset that include only matched elements where Left Join Returns all entries from Left Table. In this case I think Inner join returns the only required amount of data to be proceed.
You have to ask yourself two questions.
1) Is there any chance that at some point in your application lifetime, there will be posts with an empty or invalid account_id?
If not, it doesn't matter.
If yes...
2) Would it be desirable to include posts without an associated account in the result of the query? If yes, use LEFT JOIN, if no, use INNER JOIN.
I personally don't think speed is very relevant: the difference between them is what they do.
They happen to give the same result in your case, but that does not mean they can be interchanged, because choosing the one or the other still tells the other guy that reads your code something.
I tend to think like this:
INNER JOIN - the two tables are basically ONE set, we just need to combine two sources.
LEFT JOIN - the left tables is the source, and optionally we may have additional information (in the right table).
So if I would read your code and see a LEFT JOIN, that's the impression you give me about your data model.

Difference between FROM and JOIN tables

I'm working through the JOIN tutorial on SQL zoo.
Let's say I'm about to execute the code below:
SELECT a.stadium, COUNT(g.matchid)
FROM game a
JOIN goal g
ON g.matchid = a.id
GROUP BY a.stadium
As it happens, it produces the same output as the code below:
SELECT a.stadium, COUNT(g.matchid)
FROM goal g
JOIN game a
ON g.matchid = a.id
GROUP BY a.stadium
So then, when does it matter which table you assign at FROM and which one you assign at JOIN?
When you are using an INNER JOIN like you are here, the order doesn't matter. That is because you are connecting two tables on a common index, so the order in which you use them is up to you. You should pick an order that is most logical to you, and easiest to read. A habit of mine is to put the table I'm selecting from first. In your case, you're selecting information about a stadium, which comes from the game table, so my preference would be to put that first.
In other joins, however, such as LEFT OUTER JOIN and RIGHT OUTER JOIN the order will matter. That is because these joins will select all rows from one table. Consider for example I have a table for Students and a table for Projects. They can exist independently, some students may have an associated project, but not all will.
If I want to get all students and project information while still seeing students without projects, I need a LEFT JOIN:
SELECT s.name, p.project
FROM student s
LEFT JOIN project p ON p.student_id = s.id;
Note here, that the LEFT JOIN refers to the table in the FROM clause, so that means ALL of students were being selected. This also means that p.project will be null for some rows. Order matters here.
If I took the same concept with a RIGHT JOIN, it will select all rows from the table in the join clause. So if I changed the query to this:
SELECT s.name, p.project
FROM student s
RIGHT JOIN project p ON p.student_id = s.id;
This will return all rows from the project table, regardless of whether or not it has a match for students. This means that in some rows, s.name will be null. Similar to the first example, because I've made project the outer joined table, p.project will never be null (assuming it isn't in the original table). In the first example, s.name should never be null.
In the case of outer joins, order will matter. Thankfully, you can think intuitively with LEFT and RIGHT joins. A left join will return all rows in the table to the left of that statement, while a right join returns all rows from the right of that statement. Take this as a rule of thumb, but be careful. You might want to develop a pattern to be consistent with yourself, as I mentioned earlier, so these queries are easier for you to understand later on.
When you only JOIN 2 tables, usually the order does not matter: MySQL scans the tables in the optimal order.
When you scan more than 2 tables, the order could matter:
SELECT ...
FROM a
JOIN b ON ...
JOIN c ON ...
Also, MySQL tries to scan the tables in the fastest way (large tables first). But if a join is slow, it is possible that MySQL is scanning them in a non-optimal order. You can verify this with EXPLAIN. In this case, you can force the join order by adding the STRAIGHT_JOIN keyword.
The order doesn't always matter, I usually just order it in a way that makes sense to someone reading your query.
Sometime order does matter. Try it with LEFT JOIN and RIGHT JOIN.
In this instance you are using an INNER JOIN, if you're expecting a match on a common ID or foreign key, it probably doesn't matter too much.
You would however need to specify the tables the correct way round if you were performing an OUTER JOIN, as not all records in this type of join are guaranteed to match via the same field.
yes, it will matter when you will user another join LEFT JOIN, RIGHT JOIN
currently You are using NATURAL JOIN that is return all tables related data, if JOIN table row not match then it will exclude row from result
If you use LEFT / RIGHT {OUTER} join then result will be different, follow this link for more detail

Explain which table to choose "FROM" in a JOIN statement

I'm new to SQL and am having trouble understanding why there's a FROM keyword in a JOIN statement if I use dot notation to select the tables.columns that I want. Does it matter which table I choose out of the two? I didn't see any explanation for this in w3schools definition on which table is the FROM table. In the example below, how do I know which table to choose for the FROM? Since I essentially already selected which table.column to select, can it be either?
For example:
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
INNER JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
The order doesn't matter in an INNER JOIN.
However, it does matter in LEFT JOIN and RIGHT JOIN. In a LEFT JOIN, the table in the FROM clause is the primary table; the result will contain every row selected from this table, while rows named in the LEFT JOIN table can be missing (these columns will be NULL in the result). RIGHT JOIN is similar but the reverse: rows can be missing in the table named in FROM.
For instance, if you change your query to use LEFT JOIN, you'll see customers with no orders. But if you swapped the order of the tables and used a LEFT JOIN, you wouldn't see these customers. You would see orders with no customer (although such rows probably shouldn't exist).
The from statement refers to the join not the table. The join of table will create a set from which you will be selecting columns.
For an inner join it does not matter which table is in the from clause and which is in the join clause.
For outer joins it of course does matter, as the table in the outer join is allowed to have "missing" records.
It does not matter for inner joins: the optimizer will figure out the proper sequence of reading the tables, regardless of your choice for the ordering.
For directional outer joins, it does matter, because these are not symmetric. You choose the table in which you want to keep all rows for the first FROM table in a left outer join; for the right outer join it is the other way around.
For full outer joins it does not matter again, because the tables in full outer joins are used symmetrically to each other.
In situations when ordering does not matter you pick the order to be "natural" to the reader of your SQL statement, whatever that means for your model. SQL queries very quickly become rather hard to read, so the proper ordering of your tables is important for human readers of your queries.
Well in your current example the from operator can be applied on both tables.
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers,Orders
WHERE Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
->Will work like your code
The comma will join the two tables.
From just means which table you are retrieving data from.
In your example, you joined the two tables using different syntax.
it could also have been :
SELECT Customers.CustomerName, Orders.OrderID
FROM Orders
INNER JOIN Customers
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
all the code written will generate same results

sql joins as venn diagram

I've had trouble understanding joins in sql and came upon this image which I think might help me. The problem is that I don't fully understand it. For example, the join in the top right corner of the image, which colors the full B circle red and but only the overlap from A. The image makes it seem like circle B is the primary focus of the sql statement, but the sql statement itself, by starting with A (select from A, join B), conveys the opposite impression to me, namely that A would be the focus of the sql statement.
Similarly, the image below that only includes data from the B circle, so why is A included at all in the join statement?
Question: Working clockwise from the top right and finishing in the center, can someone provide more information about the representation of each sql image, explaining
a) why a join would be necessary in each case (for example, especially in situations where no data's taken from A or B i.e. where only A or B but not both is colored)
b) and any other detail that would clarify why the image is a good representation of the sql
I agree with Cade about the limitations of Venn diagrams here. A more apposite visual representation might be this.
Tables
SELECT A.Colour, B.Colour FROM A CROSS JOIN B SQL Fiddle
The cross join (or cartesian product) produces a result with every combination of the rows from the two tables. Each table has 4 rows so this produces 16 rows in the result.
SELECT A.Colour, B.Colour FROM A INNER JOIN B ON A.Colour = B.Colour SQL Fiddle
The inner join logically returns all rows from the cross join that match the join condition. In this case five do.
SELECT A.Colour, B.Colour FROM A INNER JOIN B ON A.Colour NOT IN ('Green','Blue') SQL Fiddle
The inner join condition need not necessarily be an equality condition and it need not reference columns from both (or even either) of the tables. Evaluating A.Colour NOT IN ('Green','Blue') on each row of the cross join returns.
An inner join condition of 1=1 would evaluate to true for every row in the cross join so the two are equivalent (SQL Fiddle).
SELECT A.Colour, B.Colour FROM A LEFT OUTER JOIN B ON A.Colour = B.Colour SQL Fiddle
Outer Joins are logically evaluated in the same way as inner joins except that if a row from the left table (for a left join) does not join with any rows from the right hand table at all it is preserved in the result with NULL values for the right hand columns.
SELECT A.Colour, B.Colour FROM A LEFT OUTER JOIN B ON A.Colour = B.Colour WHERE B.Colour IS NULL SQL Fiddle
This simply restricts the previous result to only return the rows where B.Colour IS NULL. In this particular case these will be the rows that were preserved as they had no match in the right hand table and the query returns the single red row not matched in table B. This is known as an anti semi join.
It is important to select a column for the IS NULL test that is either not nullable or for which the join condition ensures that any NULL values will be excluded in order for this pattern to work correctly and avoid just bringing back rows which happen to have a NULL value for that column in addition to the un matched rows.
SELECT A.Colour, B.Colour FROM A RIGHT OUTER JOIN B ON A.Colour = B.Colour SQL Fiddle
Right outer joins act similarly to left outer joins except they preserve non matching rows from the right table and null extend the left hand columns.
SELECT A.Colour, B.Colour FROM A FULL OUTER JOIN B ON A.Colour = B.Colour SQL Fiddle
Full outer joins combine the behaviour of left and right joins and preserve the non matching rows from both the left and the right tables.
I think your main underlying confusion is that when (for example) only A is highlighted in red, you're taking that to mean "the query only returns data from A", but in fact it means "the query only returns data for those cases where A has a record". The query might still contain data from B. (For cases where B does not have a record, the query will substitute NULL.)
Similarly, the image below that only includes data from the B circle, so why is A included at all in the join statement?
If you mean — the image where A is entirely in white, and there's a red crescent-shape for the part of B that doesn't overlap with A, then: the reason that A appears in the query is, A is how it finds the records in B that need to be excluded. (If A didn't appear in the query, then Venn diagram wouldn't have A, it would only show B, and there'd be no way to distinguish the desired records from the unwanted ones.)
The image makes it seem like circle B is the primary focus of the sql statement, but the sql statement itself, by starting with A (select from A, join B), conveys the opposite impression to me, namely that A would be the focus of the sql statement.
Quite right. For this reason, RIGHT JOINs are relatively uncommon; although a query that uses a LEFT JOIN can nearly always be re-ordered to use a RIGHT JOIN instead (and vice versa), usually people will write their queries with LEFT JOIN and not with RIGHT JOIN.
Venn diagrams are suitable for representing set operations such as UNION, INTERSECTS, EXCEPT etc.
To the extent only that those set operations like EXCEPT are simulated with things like LEFT JOIN WHERE rhs.KEY is NULL, this diagram is accurate.
Otherwise it is misleading. For instance, any join can cause rows to multiply if the join criteria are not 1:1. But sets are only allowed to contain distinct members, so those cannot be represented as set operations.
Then there is the CROSS JOIN or INNER JOIN ON 1 = 1 - this is neither analogous to the INNER JOIN as shown in this diagram, nor can the set which is produced be really described by a Venn diagram. Not to mention all the other possible triangular joins, self and anti-joins like:
lhs INNER JOIN rhs ON rhs.VALUE < lhs.VALUE (triangular)
or
SELF self1
INNER JOIN SELF self2
ON self2.key <> self1.key
AND self1.type = self2.type
(self cross and anti-join to find all similar family members except yourself - self1 and self2 are the same set and the result is a proper subset)
Sticking to joins on keys may be fine for the first few minutes of a tutorial, but this can lead to a poor path for learning what joins are about. I think this is what you have found.
This idea that Venn Diagrams can represent JOINs generally this way needs to go away.
When you do a join, it is likely that your two tables might not match up perfectly. Specifically, there could be some rows in A that don't match up to anything in B, or duplicate rows in A that match up with a single row in B, and vice-versa.
When this happens, you have a choice:
for each A, take a single B that works, if there is one. (upper left)
take each pair that fully matches (discard any that are missing either A or B--center)
for each B, take a single A that works, if there is one (upper right)
take EVERYTHING (lower left)
Center left and right are technically joins, but pointless ones; they could probably be more efficiently written SELECT <select_list> FROM TableA A WHERE A.Key NOT IN (SELECT B.Key FROM TableB B) (or the opposite).
In direct answer to your confusion, RIGHT JOIN says "the following expression is the focus of this query".
Lower right is rather strange, and I see no reason why you would want that. It returns the results from the two outer middle queries, mixed together with NULL's in all of the columns for the opposite table.
For the right join, yes the syntax can be confusing, but yes it is what it seems to be. When you say "TableA RIGHT JOIN TableB", it is indeed saying that TableB is the main table that you are referring to and TableA is just hanging on where it has matching records. This does read weird in queries, because TableA is listed first so your brain automatically assigns more priority to it, even though TableB is really the more important table in the query. For this reason, you rarely actually see right joins in real code.
So instead of A and B, lets take two things that are easy to keep track of. Supposed we have two tables for people's info, ShoeSize and IQ. You have ShoeSize info for some people, some IQ info for some people. And have a PersonID on both tables that you can join on.
Clockwise from top right (even tho this starts with some of the more complicated and contrived cases):
ShoeSize RIGHT JOIN IQ -> give me all of the IQ information. Include any ShoeSize information for those people if we have it.
ShoeSize RIGHT JOIN IQ WHERE ShowSize.PersonID = NULL -> Give me all of the IQ info, but only for people who don't have any shoe size info
ShoeSize FULL OUTER JOIN IQ WHERE ShoeSize.PersonID = NULL AND IQ.PersonID = NULL -> Give me the shoe size info only for people who don't have IQ info, plus the IQ info for people who don't have shoe size info
ShoeSize FULL OUTER JOIN IQ -> Give me everything, all shoe sizes and all IQ data. If any ShoeSizes and IQ records have the same PersonID, include them in one row.
ShoeSize LEFT JOIN IQ WHERE IQ.PersonID = NULL -> Give me all of the shoe size info, but only for people that don't have IQ info
ShoeSize LEFT JOIN IQ -> Give me all of the shoe size info. Include any IQ information for those people if we have it.