Inner Join vs. Natural Join, speed-wise? - mysql

I'm having a discussion with a colleague of mine, and we're stuck on the point of whether an inner join is faster than a natural join simply due to the fact that the ON clause is specified in an inner join, and thus the SQL engine need not compare table structures to determine common columns.
Please share your thoughts/answers.

It is going to depend on the DBMS, and to some extent on the tables joined, and the difference is generally not going to be measurable. With an INNER JOIN, there will be two copies of the common columns; with a NATURAL JOIN, there will be just one copy of the common columns.
If the intermediate results have to be written to disk or sorted, there will be more data to be managed, and there is slightly more data to be returned to the client over the client-server connection, so an INNER JOIN might take slightly longer than an NATURAL JOIN. If the common columns are big enough - long characters fields, for example - this size difference could be significant.
The difference in time to prepare the two types of join is completely negligible. Both have to read the information for all the columns in both tables involved in the join, and the processing of that data is substantially similar.
During execution, the copying of the data to form the intermediate results might be slightly more complex for a NATURAL JOIN - because you don't simply take a copy of each row - but you don't necessarily take a full copy of each row for an INNER JOIN either, so the difference is again negligible. And, as before, there is slightly less data to be copied in aggregate for the NATURAL JOIN, so there is a still a small advantage to it.
So, overall, a NATURAL JOIN might be slightly quicker than an INNER JOIN, but the difference is still likely to be negligible.

I can't imagine it being so expensive to find common columns that you could support that hypothesis.

In the exceptional case a natural join can be as speedy as an inner join.

Related

Is this SQL statement making a join? [duplicate]

I develop against Oracle databases. When I need to manually write (not use an ORM like hibernate), I use a WHERE condition instead of a JOIN.
for example (this is simplistic just to illustrate the style):
Select *
from customers c, invoices i, shipment_info si
where c.customer_id = i.customer_id
and i.amount > 999.99
and i.invoice_id = si.invoice_id(+) -- added to show a replacement for a join
order by i.amount, c.name
I learned this style from an OLD oracle DBA. I have since learned that this is not standard SQL syntax. Other than being non-standard and much less database portable, are there any other repercussions to using this format?
I don't like the style because it makes it harder to determine which WHERE clauses are for simulating JOINs and which ones are for actual filters, and I don't like code that makes it unnecessarily difficult to determine the original intent of the programmer.
The biggest issue that I have run into with this format is the tendency to forget some join's WHERE clause, thereby resulting in a cartesian product. This is particularly common (for me, at least) when adding a new table to the query. For example, suppose an ADDRESSES table is thrown into the mix and your mind is a bit forgetful:
SELECT *
FROM customers c, invoices i, addresses a
WHERE c.customer_id = i.customer_id
AND i.amount > 999.99
ORDER BY i.amount, c.name
Boom! Cartesian product! :)
The old style join is flat out wrong in some cases (outer joins are the culprit). Although they are more or less equivalent when using inner joins, they can generate incorrect results with outer joins, especially if columns on the outer side can be null. This is because when using the older syntax the join conditions are not logically evaluated until the entire result set has been constructed, it is simply not possible to express a condition on a column from outer side of a join that will filter records when the column can be null because there is no matching record.
As an example:
Select all Customers, and the sum of the sales of Widgets on all their Invoices in the month Of August, where the Invoice has been processed (Invoice.ProcessDate is Not Null)
using new ANSI-92 Join syntax
Select c.name, Sum(d.Amount)
From customer c
Left Join Invoice I
On i.custId = c.custId
And i.SalesDate Between '8/1/2009'
and '8/31/2009 23:59:59'
And i.ProcessDate Is Not Null
Left Join InvoiceDetails d
On d.InvoiceId = i.InvoiceId
And d.Product = 'widget'
Group By c.Name
Try doing this with old syntax... Because when using the old style syntax, all the conditions in the where clause are evaluated/applied BEFORE the 'outer' rows are added back in, All the UnProcessed Invoice rows will get added back into the final result set... So this is not possible with old syntax - anything that attempts to filter out the invoices with null Processed Dates will eliminate customers... the only alternative is to use a correlated subquery.
Some people will say that this style is less readable, but that's a matter of habit. From a performance point of view, it doesn't matter, since the query optimizer takes care of that.
I have since learned that this is not standard SQL syntax.
That's not quite true. The "a,b where" syntax is from the ansi-89 standard, the "a join b on" syntax is ansi-92. However, the 89 syntax is deprecated, which means you should not use it for new queries.
Also, there are some situations where the older style lacks expressive power, especially with regard to outer joins or complex queries.
It can be a pain going through the where clause trying to pick out join conditions. For anything more than one join the old style is absolute evil. And once you know the new style, you may as well just keep using it.
This is a standard SQL syntax, just an older standard than JOIN. There's a reason that the syntax has evolved and you should use the newer JOIN syntax because:
It's more expressive, clearly indicating which tables are JOINed, the JOIN order, which conditions apply to which JOIN, and separating out the filtering WHERE conditions from the JOIN conditions.
It supports LEFT, RIGHT, and FULL OUTER JOINs, which the WHERE syntax does not.
I don't think you'll find the WHERE-type JOIN substantially less portable than the JOIN syntax.
As long as you don't use the ANSI natural join feature I'm OK with it.
I found this quote by – ScottCher, I totally agree:
I find the WHERE syntax easier to read than INNER JOIN - I guess its like Vegemite. Most people in the world probably find it disgusting but kids brought up eating it love it.
It really depends on habits, but I have always found Oracle's comma separated syntax more natural. The first reason is that I think using (INNER) JOIN diminishes readability. The second is about flexibility. In the end, a join is a cartesian product by definition. You do not necessarily have to restrict the results based on IDs of both tables. Although very seldom, one might well need cartesian product of two tables. Restricting them based on IDs is just a very reasonable practice, but NOT A RULE. However, if you use JOIN keyword in e.g. SQL Server, it won't let you omit the ON keyword. Suppose you want to create a combination list. You have to do like this:
SELECT *
FROM numbers
JOIN letters
ON 1=1
Apart from that, I find the (+) syntax of Oracle also very reasonable. It is a nice way to say, "Add this record to the resultset too, even if it is null." It is way better than the RIGHT/LEFT JOIN syntax, because in fact there is no left or right! When you want to join 10 tables with several different types of outer joins, it gets confusing which table is on the "left hand side" and which one on the right.
By the way, as a more general comment, I don't think SQL portability exists in the practical world any more. The standard SQL is so poor and the expressiveness of diverse DBMS specific syntax are so often demanded, I don't think 100% portable SQL code is an achievable goal. The most obvious evidence of my observation is the good old row number problemmatic. Just search any forum for "sql row number", including SO, and you will see hundreds of posts asking how it can be achieved in a specific DBMS. Similar and related to that, so is limiting the number of returned rows, for example..
This is Transact SQL syntax, and I'm not quite sure how "unportable" it is - it is the main syntax used in Sybase, for example (Sybase supports ANSI syntax as well) as well as many other databases (if not all).
The main benefits to ANSI syntax is that it allows you to write some fairly tricky chained joins that T-SQL prohibits
Speaking as someone who writes automated sql query transformers (inline view expansions, grafted joins, union factoring) and thinks of SQL as a data structure to manipulate: the non-JOIN syntax is far less pain to manipulate.
I can't speak to "harder to read" complaints; JOIN looks like an lunge toward relational algebra operators. Don't go there :-)
Actually, this syntax is more portable than a JOIN, because it will work with pretty much any database, whereas not everybody supports the JOIN syntax (Oracle Lite doesn't, for example [unless this has changed recently]).

Want to Understand the performance of join in MYSQL

There are different types of join in mysql like below:
1.JOIN
2.LEFT JOIN
3.RIGHT JOIN
4.INNER JOIN
5.LEFT OUTER JOIN
6.RIGHT OUTER JOIN
And i want to know which one perform better in query. And how we decide that this one suitable for this query. As JOIN and INEER JOIN fetch same data.In this case which one suitable.
I will repeat what #Dai said in the comments, joins should be used based on which operation you need, not on performance. The answers to this question cover what the different types of joins are. In particular I like this visual explanation.
Analyzing why a query is slow is usually done with EXPLAIN. It will tell you the plan for the query and you can determine things like if its doing a full table scan and what rows might need to be indexed. Here is a good writeup of how to use an EXPLAIN.

query optimizer operator choice - nested loops vs hash match (or merge)

One of my stored procedures was taking too long execute. Taking a look at query execution plan I was able to locate the operation taking too long. It was a nested loop physical operator that had outer table (65991 rows) and inner table (19223 rows). On the nested loop it showed estimated rows = 1,268,544,993 (multiplying 65991 by 19223) as below:
I read a few articles on physical operators used for joins and got a bit confused whether nested loop or hash match would have been better for this case. From what i could gather:
Hash Match - is used by optimizer when no useful indexes are available, one table is substantially smaller than the other, tables are not sorted on the join columns. Also hash match might indicate more efficient join method (nested loops or merge join) could be used.
Question: Would hash match be better than nested loops in this scenario?
Thanks
ABSOLUTELY. A hash match would be a huge improvement. Creating the hash on the smaller 19,223 row table then probing into it with the larger 65,991 row table is a much smaller operation than the nested loop requiring 1,268,544,993 row comparisons.
The only reason the server would choose the nested loops is that it badly underestimated the number of rows involved. Do your tables have statistics on them, and if so, are they being updated regularly? Statistics are what enable the server to choose good execution plans.
If you've properly addressed statistics and are still having a problem you could force it to use a HASH join like so:
SELECT *
FROM
TableA A -- The smaller table
LEFT HASH JOIN TableB B -- the larger table
Please note that the moment you do this it will also force the join order. This means you have to arrange all your tables correctly so that their join order makes sense. Generally you would examine the execution plan the server already has and alter the order of your tables in the query to match. If you're not familiar with how to do this, the basics are that each "left" input comes first, and in graphical execution plans, the left input is the lower one. A complex join involving many tables may have to group joins together inside parentheses, or use RIGHT JOIN in order to get the execution plan to be optimal (swap left and right inputs, but introduce the table at the correct point in the join order).
It is generally best to avoid using join hints and forcing join order, so do whatever else you can first! You could look into the indexes on the tables, fragmentation, reducing column sizes (such as using varchar instead of nvarchar where Unicode is not required), or splitting the query into parts (insert to a temp table first, then join to that).
I would not recommend trying to "fix" the plan by forcing the hints in one direction or another. Instead, you need to look to your indexes, statistics and the TSQL code to understand why you have a Table spool loading up 1.2billion rows from 19000.

What's better: joins or multiple sub-select statements as part of one query

Performance wise, what is better?
If I have 3 or 4 join statements in my query or use embedded select statements to pull the same information from my database as part of one query?
I would say joins are better because:
They are easier to read.
You have more control over whether you want to do an inner, left/right outer join or full outer join
join statements cannot be so easily abused to create query abominations
with joins it is easier for the query optimizer to create a fast query (if the inner select is simple, it might work out the same, but with more complicated stuff joins will work better).
embedded select's can only simulate left/right outer join.
Sometimes you cannot do stuff using joins, in that case (and only then) you'll have to fall back on an inner select.
It rather depends on your database: sizes of tables particularly, but also the memory parameters and sometimes even how the tables are indexed.
On less than current versions of MySQL, there was a real possibility of a query with a sub-select being considerably slower than a query that would return the same results structured with a join. (In the MySQL 4.1 days, I have seen the difference to be greater than an order of magnitude.) As a result, I prefer to build queries with joins.
That said, there are some types of queries that are extremely difficult to build with a join and a sub-select is the only way to really do it.
Assuming the database engine does absolutely no optimization, I would say it depends on how consistent you need your data to be. If you're doing multiple SELECT statements on a busy database, where the data you are looking at may change rapidly, you may run into issues where your data does not match up, between queries.
Assuming your data contains no inter-dependencies, then multiple queries will work fine. However, if your data requires consistency, use a single query.
This viewpoint boils down to keeping your data transactionally safe. Consider the situation where you have to pull a total of all accounts receivable, which is kept in a separate table from the monetary transaction amounts. If someone were to add another transaction in between your two queries, the accounts receivable total would not match the sum of the transaction amounts.
Most databases will optimize both queries below into the same plan, so whether you do:
select A.a1, B.b1 from A left outer join B on A.id = B.a_id
or
select A.a1, (select B.b1 from B where B.a_id = A.id) as b1 from A
It ends up being the same. However, in most cases for non-trivial queries you'd better stick with joins whenever you can, especially since some types of joins (such as an inner join) are not possible to achieve using sub-selects.

MySQL FULLTEXT Search Across >1 Table

As a more general case of this question because I think it may be of interest to more people...What's the best way to perform a fulltext search on two tables? Assume there are three tables, one for programs (with submitter_id) and one each for tags and descriptions with object_id: foreign keys referring to records in programs. We want the submitter_id of programs with certain text in their tags OR descriptions. We have to use MATCH AGAINST for reasons that I won't go into here. Don't get hung up on that aspect.
programs
id
submitter_id
tags_programs
object_id
text
descriptions_programs
object_id
text
The following works and executes in a 20ms or so:
SELECT p.submitter_id
FROM programs p
WHERE p.id IN
(SELECT t.object_id
FROM titles_programs t
WHERE MATCH (t.text) AGAINST ('china')
UNION ALL
SELECT d.object_id
FROM descriptions_programs d
WHERE MATCH (d.text) AGAINST ('china'))
but I tried to rewrite this as a JOIN as follows and it runs for a very long time. I have to kill it after 60 seconds.
SELECT p.id
FROM descriptions_programs d, tags_programs t, programs p
WHERE (d.object_id=p.id AND MATCH (d.text) AGAINST ('china'))
OR (t.object_id=p.id AND MATCH (t.text) AGAINST ('china'))
Just out of curiosity I replaced the OR with AND. That also runs in s few milliseconds, but it's not what I need. What's wrong with the above second query? I can live with the UNION and subselects, but I'd like to understand.
Join after the filters (e.g. join the results), don't try to join and then filter.
The reason is that you lose use of your fulltext index.
Clarification in response to the comment: I'm using the word join generically here, not as JOIN but as a synonym for merge or combine.
I'm essentially saying you should use the first (faster) query, or something like it. The reason it's faster is that each of the subqueries is sufficiently uncluttered that the db can use that table's full text index to do the select very quickly. Joining the two (presumably much smaller) result sets (with UNION) is also fast. This means the whole thing is fast.
The slow version winds up walking through lots of data testing it to see if it's what you want, rather than quickly winnowing the data down and only searching through rows you are likely to actually want.
Just in case you don't know: MySQL has a built in statement called EXPLAIN that can be used to see what's going on under the surface. There's a lot of articles about this, so I won't be going into any detail, but for each table it provides an estimate for the number of rows it will need to process. If you look at the "rows" column in the EXPLAIN result for the second query you'll probably see that the number of rows is quite large, and certainly a lot larger than from the first one.
The net is full of warnings about using subqueries in MySQL, but it turns out that many times the developer is smarter than the MySQL optimizer. Filtering results in some manner before joining can cause major performance boosts in many cases.
If you join both tables you end up having lots of records to inspect. Just as an example, if both tables have 100,000 records, fully joining them give you with 10,000,000,000 records (10 billion!).
If you change the OR by AND, then you allow the engine to filter out all records from table descriptions_programs which doesn't match 'china', and only then joining with titles_programs.
Anyway, that's not what you need, so I'd recommend sticking to the UNION way.
The union is the proper way to go. The join will pull in both full text indexes at once and can multiple the number of checks actually preformed.