This question already has answers here:
MySQL Join clause vs WHERE clause
(4 answers)
Closed 7 years ago.
I have 2 tables customer and order1. I want to know which of the following queries is more efficient
select cust_name,ISBN from customer,order1 where customer.cust_no=order1.cust_no;
,
select cust_name,ISBN from customer inner join order1 on customer.cust_no=order1.cust_no;
and
select cust_name,ISBN from customer natural join order1;
I've read that inner join takes cartesian product of two tables and then return only rows that match the 'on' condition. Does natural operates in the same way as inner join? Also how inline queries are efficient than joins?
These three queries should do the same thing. You could verify by checking the execution plan, but any differences between them should be negligible.
According to MySQL 5.7 Reference Manual:
Natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard. The goal was to align the syntax and semantics of MySQL with respect to NATURAL JOIN and JOIN ... USING according to SQL:2003. However, these changes in join processing can result in different output columns for some joins. Also, some queries that appeared to work correctly in older versions (prior to 5.0.12) must be rewritten to comply with the standard.
These changes have five main aspects:
The way that MySQL determines the result columns of NATURAL or USING join operations (and thus the result of the entire FROM clause).
Expansion of SELECT * and SELECT tbl_name.* into a list of selected columns.
Resolution of column names in NATURAL or USING joins.
Transformation of NATURAL or USING joins into JOIN ... ON.
Resolution of column names in the ON condition of a JOIN ... ON.
Also, note that:
The conditional_expr used with ON is any conditional expression of the form that can be used in a WHERE clause. Generally, you should use the ON clause for conditions that specify how to join tables, and the WHERE clause to restrict which rows you want in the result set.
And finally to answer your question regarding sub queries, from Rewriting Subqueries as Joins:
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone. Prior to SQL-92, outer joins did not exist, so subqueries were the only way to do certain things. Today, MySQL Server and many other modern database systems offer a wide range of outer join types.
Related
For simplicity, assume all relevant fields are NOT NULL.
You can do:
SELECT
table1.this, table2.that, table2.somethingelse
FROM
table1, table2
WHERE
table1.foreignkey = table2.primarykey
AND (some other conditions)
Or else:
SELECT
table1.this, table2.that, table2.somethingelse
FROM
table1 INNER JOIN table2
ON table1.foreignkey = table2.primarykey
WHERE
(some other conditions)
Do these two work on the same way in MySQL?
INNER JOIN is ANSI syntax that you should use.
It is generally considered more readable, especially when you join lots of tables.
It can also be easily replaced with an OUTER JOIN whenever a need arises.
The WHERE syntax is more relational model oriented.
A result of two tables JOINed is a cartesian product of the tables to which a filter is applied which selects only those rows with joining columns matching.
It's easier to see this with the WHERE syntax.
As for your example, in MySQL (and in SQL generally) these two queries are synonyms.
Also, note that MySQL also has a STRAIGHT_JOIN clause.
Using this clause, you can control the JOIN order: which table is scanned in the outer loop and which one is in the inner loop.
You cannot control this in MySQL using WHERE syntax.
Others have pointed out that INNER JOIN helps human readability, and that's a top priority, I agree.
Let me try to explain why the join syntax is more readable.
A basic SELECT query is this:
SELECT stuff
FROM tables
WHERE conditions
The SELECT clause tells us what we're getting back; the FROM clause tells us where we're getting it from, and the WHERE clause tells us which ones we're getting.
JOIN is a statement about the tables, how they are bound together (conceptually, actually, into a single table).
Any query elements that control the tables - where we're getting stuff from - semantically belong to the FROM clause (and of course, that's where JOIN elements go). Putting joining-elements into the WHERE clause conflates the which and the where-from, that's why the JOIN syntax is preferred.
Applying conditional statements in ON / WHERE
Here I have explained the logical query processing steps.
Reference: Inside Microsoft® SQL Server™ 2005 T-SQL Querying
Publisher: Microsoft Press
Pub Date: March 07, 2006
Print ISBN-10: 0-7356-2313-9
Print ISBN-13: 978-0-7356-2313-2
Pages: 640
Inside Microsoft® SQL Server™ 2005 T-SQL Querying
(8) SELECT (9) DISTINCT (11) TOP <top_specification> <select_list>
(1) FROM <left_table>
(3) <join_type> JOIN <right_table>
(2) ON <join_condition>
(4) WHERE <where_condition>
(5) GROUP BY <group_by_list>
(6) WITH {CUBE | ROLLUP}
(7) HAVING <having_condition>
(10) ORDER BY <order_by_list>
The first noticeable aspect of SQL that is different than other programming languages is the order in which the code is processed. In most programming languages, the code is processed in the order in which it is written. In SQL, the first clause that is processed is the FROM clause, while the SELECT clause, which appears first, is processed almost last.
Each step generates a virtual table that is used as the input to the following step. These virtual tables are not available to the caller (client application or outer query). Only the table generated by the final step is returned to the caller. If a certain clause is not specified in a query, the corresponding step is simply skipped.
Brief Description of Logical Query Processing Phases
Don't worry too much if the description of the steps doesn't seem to make much sense for now. These are provided as a reference. Sections that come after the scenario example will cover the steps in much more detail.
FROM: A Cartesian product (cross join) is performed between the first two tables in the FROM clause, and as a result, virtual table VT1 is generated.
ON: The ON filter is applied to VT1. Only rows for which the <join_condition> is TRUE are inserted to VT2.
OUTER (join): If an OUTER JOIN is specified (as opposed to a CROSS JOIN or an INNER JOIN), rows from the preserved table or tables for which a match was not found are added to the rows from VT2 as outer rows, generating VT3. If more than two tables appear in the FROM clause, steps 1 through 3 are applied repeatedly between the result of the last join and the next table in the FROM clause until all tables are processed.
WHERE: The WHERE filter is applied to VT3. Only rows for which the <where_condition> is TRUE are inserted to VT4.
GROUP BY: The rows from VT4 are arranged in groups based on the column list specified in the GROUP BY clause. VT5 is generated.
CUBE | ROLLUP: Supergroups (groups of groups) are added to the rows from VT5, generating VT6.
HAVING: The HAVING filter is applied to VT6. Only groups for which the <having_condition> is TRUE are inserted to VT7.
SELECT: The SELECT list is processed, generating VT8.
DISTINCT: Duplicate rows are removed from VT8. VT9 is generated.
ORDER BY: The rows from VT9 are sorted according to the column list specified in the ORDER BY clause. A cursor is generated (VC10).
TOP: The specified number or percentage of rows is selected from the beginning of VC10. Table VT11 is generated and returned to the caller.
Therefore, (INNER JOIN) ON will filter the data (the data count of VT will be reduced here itself) before applying the WHERE clause. The subsequent join conditions will be executed with filtered data which improves performance. After that, only the WHERE condition will apply filter conditions.
(Applying conditional statements in ON / WHERE will not make much difference in few cases. This depends on how many tables you have joined and the number of rows available in each join tables)
The implicit join ANSI syntax is older, less obvious, and not recommended.
In addition, the relational algebra allows interchangeability of the predicates in the WHERE clause and the INNER JOIN, so even INNER JOIN queries with WHERE clauses can have the predicates rearranged by the optimizer.
I recommend you write the queries in the most readable way possible.
Sometimes this includes making the INNER JOIN relatively "incomplete" and putting some of the criteria in the WHERE simply to make the lists of filtering criteria more easily maintainable.
For example, instead of:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
AND c.State = 'NY'
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
AND a.Status = 1
Write:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
WHERE c.State = 'NY'
AND a.Status = 1
But it depends, of course.
Implicit joins (which is what your first query is known as) become much much more confusing, hard to read, and hard to maintain once you need to start adding more tables to your query. Imagine doing that same query and type of join on four or five different tables ... it's a nightmare.
Using an explicit join (your second example) is much more readable and easy to maintain.
I'll also point out that using the older syntax is more subject to error. If you use inner joins without an ON clause, you will get a syntax error. If you use the older syntax and forget one of the join conditions in the where clause, you will get a cross join. The developers often fix this by adding the distinct keyword (rather than fixing the join because they still don't realize the join itself is broken) which may appear to cure the problem but will slow down the query considerably.
Additionally for maintenance if you have a cross join in the old syntax, how will the maintainer know if you meant to have one (there are situations where cross joins are needed) or if it was an accident that should be fixed?
Let me point you to this question to see why the implicit syntax is bad if you use left joins.
Sybase *= to Ansi Standard with 2 different outer tables for same inner table
Plus (personal rant here), the standard using the explicit joins is over 20 years old, which means implicit join syntax has been outdated for those 20 years. Would you write application code using a syntax that has been outdated for 20 years? Why do you want to write database code that is?
The SQL:2003 standard changed some precedence rules so a JOIN statement takes precedence over a "comma" join. This can actually change the results of your query depending on how it is setup. This cause some problems for some people when MySQL 5.0.12 switched to adhering to the standard.
So in your example, your queries would work the same. But if you added a third table:
SELECT ... FROM table1, table2 JOIN table3 ON ... WHERE ...
Prior to MySQL 5.0.12, table1 and table2 would be joined first, then table3. Now (5.0.12 and on), table2 and table3 are joined first, then table1. It doesn't always change the results, but it can and you may not even realize it.
I never use the "comma" syntax anymore, opting for your second example. It's a lot more readable anyway, the JOIN conditions are with the JOINs, not separated into a separate query section.
They have a different human-readable meaning.
However, depending on the query optimizer, they may have the same meaning to the machine.
You should always code to be readable.
That is to say, if this is a built-in relationship, use the explicit join. if you are matching on weakly related data, use the where clause.
I know you're talking about MySQL, but anyway:
In Oracle 9 explicit joins and implicit joins would generate different execution plans. AFAIK that has been solved in Oracle 10+: there's no such difference anymore.
If you are often programming dynamic stored procedures, you will fall in love with your second example (using where). If you have various input parameters and lots of morph mess, then that is the only way. Otherwise, they both will run the same query plan so there is definitely no obvious difference in classic queries.
ANSI join syntax is definitely more portable.
I'm going through an upgrade of Microsoft SQL Server, and I would also mention that the =* and *= syntax for outer joins in SQL Server is not supported (without compatibility mode) for 2005 SQL server and later.
I have two points for the implicit join (The second example):
Tell the database what you want, not what it should do.
You can write all tables in a clear list that is not cluttered by join conditions. Then you can much easier read what tables are all mentioned. The conditions come all in the WHERE part, where they are also all lined up one below the other. Using the JOIN keyword mixes up tables and conditions.
I have multiple joins including left joins in mysql. There are two ways to do that.
I can put "ON" conditions right after each join:
select * from A join B ON(A.bid=B.ID) join C ON(B.cid=C.ID) join D ON(c.did=D.ID)
I can put them all in one "ON" clause:
select * from A join B join C join D ON(A.bid=B.ID AND B.cid=C.ID AND c.did=D.ID)
Which way is better?
Is it different if I need Left join or Right join in my query?
For simple uses MySQL will almost inevitably execute them in the same manner, so it is a manner of preference and readability (which is a great subject of debate).
However with more complex queries, particularly aggregate queries with OUTER JOINs that have the potential to become disk and io bound - there may be performance and unseen implications in not using a WHERE clause with OUTER JOIN queries.
The difference between a query that runs for 8 minutes, or .8 seconds may ultimately depend on the WHERE clause, particularly as it relates to indexes (How MySQL uses Indexes): The WHERE clause is a core part of providing the query optimizer the information it needs to do it's job and tell the engine how to execute the query in the most efficient way.
From How MySQL Optimizes Queries using WHERE:
"This section discusses optimizations that can be made for processing
WHERE clauses...The best join combination for joining the tables is
found by trying all possibilities. If all columns in ORDER BY and
GROUP BY clauses come from the same table, that table is preferred
first when joining."
For each table in a join, a simpler WHERE is constructed to get a fast
WHERE evaluation for the table and also to skip rows as soon as
possible
Some examples:
Full table scans (type = ALL) with NO Using where in EXTRA
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id AND cr.role = cr2.role AND cr.util = 1000
[Err] Out of memory
Uses where to optimize results, with index (Using where,Using index):
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
WHERE cr.role = cr2.role
AND cr.util = 1000
515661 rows in set (0.124s)
****Combination of ON/WHERE - Same result - Same plan in EXPLAIN*******
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
AND cr.role = cr2.role
WHERE cr.util = 1000
515661 rows in set (0.121s)
MySQL is typically smart enough to figure out simple queries like the above and will execute them similarly but in certain cases it will not.
Outer Join Query Performance:
As both LEFT JOIN and RIGHT JOIN are OUTER JOINS (Great in depth review here) the issue of the Cartesian product arises, the avoidance of Table Scans must be avoided, so that as many rows as possible not needed for the query are eliminated as fast as possible.
WHERE, Indexes and the query optimizer used together may completely eliminate the problems posed by cartesian products when used carefully with aggregate functions like AVERAGE, GROUP BY, SUM, DISTINCT etc. orders of magnitude of decrease in run time is achieved with proper indexing by the user and utilization of the WHERE clause.
Finally
Again, for the majority of queries, the query optimizer will execute these in the same manner - making it a manner of preference but when query optimization becomes important, WHERE is a very important tool. I have seen some performance increase in certain cases with INNER JOIN by specifying an indexed col as an additional ON..AND ON clause but I could not tell you why.
Put the ON clause with the JOIN it applies to.
The reasons are:
readability: others can easily see how the tables are joined
performance: if you leave the conditions later in the query, you'll get way more joins happening than need to - it's like putting the conditions in the where clause
convention: by following normal style, your code will be more portable and less likely to encounter problems that may occur with unusual syntax - do what works
Is Order of Joins important if there are
multiple joins
3rd join depends on 2nd join (lets assume and is the case in this question)
I am unable to come to conclusion on this. I had multiple queries with the above criteria. Some of them seem to work, some are not producing proper result (not sure if its because of joins), some actually throw error.
Anyone has any specific Idea on this?
Order of joins is important if you are using OUTER joins in your query (LEFT OUTER JOIN, RIGHT OUTER JOIN, LEFT JOIN, or RIGHT JOIN notation).
If you're only using all INNER joins, it should not matter as long as they all relate to each other via some chain of ON conditions. This hold true whether you have 3 or 30 inner joins linked together.
The query optimizer will juggle them around anyhow based on the optimal execution plan based on indexes and such.
What is the difference between the query
SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons
INNER JOIN Orders
ON Persons.P_Id=Orders.P_Id
ORDER BY Persons.LastName
and this one
SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons, Orders
WHERE Persons.P_Id=Orders.P_Id
ORDER BY Persons.LastName
There is a small difference in syntax, but both queries are doing a join on the P_Id fields of the respective tables.
In your second example, this is an implicit join, which you are constraining in your WHERE clause to the P_Id fields of both tables.
The join is explicit in your first example and the join clause contains the constraint instead of in an additional WHERE clause.
They are basically equivalent. In general, the JOIN keywords enables you to be more explicit about direction (LEFT, RIGHT) and type (INNER, OUTER, CROSS) of your join.
This SO posting has a good explanation of the differences in ANSI SQL complaince, and bears similarities to the question asked here.
While (as it has been stated) both queries will produce the same result, I find that it is always a good idea to explicitly state your JOINs. It's much easier to understand, especially when there are non-JOIN-related evaluations in the WHERE clause.
Explicitly stating your JOIN also prevents you from inadvertently querying a Cartesian product. In your 2nd query above, if you (for whatever reason) forgot to include your WHERE clause, your query would run without JOIN conditions and return a result set of every row in Persons matched with every row in Orders...probably not something that you want.
The difference is in syntax, but not in the semantics.
The explicit JOIN syntax:
is considered more readable and
allows you to cleanly and in standard way specify whether you want INNER, LEFT/RIGHT OUTER or a CROSS join. This is in contrast to using DBMS-specific syntax, such as old Oracle's Persons.P_Id = Orders.P_Id(+) syntax for left outer join, for example.
When I'm selecting data from multiple tables I used to use JOINS a lot and recently I started to use another way but I'm unsure of the impact in the long run.
Examples:
SELECT * FROM table_1 LEFT JOIN table_2 ON (table_1.column = table_2.column)
So this is your basic LEFT JOIN across tables but take a look at the query below.
SELECT * FROM table_1,table_2 WHERE table_1.column = table_2.column
Personally if I was joining across lets say 7 tables of data I would prefer to do this over JOINS.
But are there any pros and cons in regards to the 2 methods ?
Second method is a shortcut for INNER JOIN.
SELECT * FROM table_1 INNER JOIN table_2 ON table_1.column = table_2.column
Will only select records that match the condition in both tables (LEFT JOIN will select all records from table on the left, and matching records from table on the right)
Quote from http://dev.mysql.com/doc/refman/5.0/en/join.html
[...] we consider each comma in a list of table_reference items as equivalent to an inner join
And
INNER JOIN and , (comma) are semantically equivalent in the absence of a join condition: both produce a Cartesian product between the specified tables (that is, each and every row in the first table is joined to each and every row in the second table).
However, the precedence of the comma operator is less than of INNER JOIN, CROSS JOIN, LEFT JOIN, and so on. If you mix comma joins with the other join types when there is a join condition, an error of the form Unknown column 'col_name' in 'on clause' may occur. Information about dealing with this problem is given later in this section.
In general there are quite a few things mentioned there, that should make you consider not using commas.
The first method is the ANSI/ISO version of the Join. The second method is the older format (pre-89) to produce the equivalent of an Inner Join. It does this by cross joining all the tables you list and then narrowing the Cartesian product in the Where clause to produce the equivalent of an inner join.
I would strongly recommend against the second method.
It is harder for other developers to read
It breaks the rule of least astonishment to other developers who will wonder whether you simply did not know any better or if there was some specific reason for not using the ANSI/ISO format.
It will cause you grief when you start trying to use that format with something other than Inner Joins.
It makes it harder to discern your intent especially in a large query with many tables. Are all of these tables supposed to be inner joins? Did you miss something in the Where clause and create a cross join? Did you intend to make a cross join? Etc.
There is simply no reason to use the second format and in fact many database systems are ending support for that format.
ANSI Syntax
Both queries are JOINs, and both use ANSI syntax but one is older than the other.
Joins using with the JOIN keyword means that ANSI-92 syntax is being used. ANSI-89 syntax is when you have tables comma separated in the FROM clause, and the criteria that joins them is found in the WHERE clause. When comparing INNER JOINs, there is no performance difference - this:
SELECT *
FROM table_1 t1, table_2 t2
WHERE t1.column = t2.column
...will produce the same query plan as:
SELECT *
FROM TABLE_1 t1
JOIN TABLE_2 t2 ON t2.column = t1.column
Apples to Oranges
Another difference is that the two queries are not identical - a LEFT [OUTER] JOIN will produce all rows from TABLE_1, and references to TABLE_2 in the output will be NULL if there's no match based on the JOIN criteria (specified in the ON clause). The second example is an INNER JOIN, which will only produce rows that have matching records in TABLE_2. Here's a link to a visual representation of JOINs to reinforce the difference...
Pros/Cons
The main reason to use ANSI-92 syntax is because ANSI-89 doesn't have any OUTER JOIN (LEFT, RIGHT, FULL) support. ANSI-92 syntax was specifically introduced to address this shortcoming, because vendors were implementing their own, custom syntax. Oracle used (+); SQL Server used an asterisk on the side of the equals in the join criteria (IE: t1.column =* t2.column).
The next reason to use ANSI-92 syntax is that it's more explicit, more readable, while separating what is being used for joining tables vs actual filteration.
I personally feel the explicit join syntax (A JOIN B, A LEFT JOIN B) is preferable. Both because it's more explicit about what you're doing, and because if you use implicit join syntax for inner joins, you still have to use the explicit syntax for outer joins and thus your SQL formatting will be inconsistent.