SQL: difference between INNER JOIN and INNER SELECT in particular case - mysql

I start to learn SQL. And I find that we often can achieve the same result with help of JOINs or Inner Select statements.
Question1 (broad): Where JOINs are faster than inner selects and vise versa?
Question2 (narrow): Can you explain me what causes performance difference of three queries below?
P.S. There is very nice site which calculates query performance, but I can't understand it estimation results.
Query1:
SELECT DISTINCT maker
FROM Product pro INNER JOIN Printer pri
on pro.model = pri.model
Query2:
SELECT DISTINCT maker
FROM Product
WHERE model IN (
SELECT model FROM Printer
)
Query3:
SELECT distinct maker
FROM Product pro, Printer pri
WHERE pro.model = pri.model

When the server evaluate a JOIN it matches the join equivalence scanning only the columns needed only for the value in the other table, and filter out everything else, it is usually done with a specific action.
When you have a subquery the server need to evaluate the plan for the subquery before the JOIN equivalence match, so if the subquery doesn't make up for the extra effort filtering out a lot of noise you have a better perfomance without it.
The server are quite smart, and they try to shave everything they don't need to evaluate the join. Then they try to use every index they can to have the best performance, where the best performance mean the best they can find in a limited amount of time, so that the plan time itself don't kill the performance.
Added after the comment of the OP
The O(n) estimation depent on the complexity of the query and the subquery, if you are interested on the query plan building you'll have to navigate the help section of your database of choice and probably you will not find a lot, if the DB is not opensource.
In layman term:
a the simple join is evaluated on one level, the main query plan
a sub query is evaluated on two level, the subquery plan and the main query plan.
Some DB IDE can display a visual rappresentation of the total plan, that usually help to understand some of those point (I don't know if mySQL has that)

Query1 is faster in general but RDBMC could optimize the Query2 to provide approximately the same result.
If the IN subquery rather complicated with dependencies from main table(s) it could be executed for each row retrieved to check the condition.

Normally INNER JOIN is to join two different table values ,where as INNER SELECT is to select a particular value from a different table and use the result to produce a single output.

Related

In SQL, is it true that the shorter the query, the faster it run

I'm very curious to know because when I run these two query in MySQL the shorter run faster
SELECT FirstName, LastName, City, State
FROM Person
LEFT JOIN Address ON Person.PersonId = Address.PersonId;
and
SELECT FirstName, LastName, City, State
FROM Person
LEFT OUTER JOIN Address ON Person.PersonId = Address.PersonId;
In addition, I want to ask if the same thing happen in the other RDBMS such as Posrgres, MS SQL Server, Oracle and SQLite?
LEFT OUTER JOIN is just a synonym for LEFT JOIN, and the execution plan for both of your queries should be identical. Therefore, I would attribute any difference in performance to things other than the use of OUTER in the left join syntax. You should check what other tasks might be running on your database during the test, and also what other processes might be running on the OS.
First, performance of a query is not based on how long it is. There are many "obvious" reasons for this. But perhaps the most is that the size of the data is determined by the size of tables in the FROM clause, and that has nothing to do with how many characters are in the name.
Second, LEFT JOIN and LEFT OUTER JOIN are synonyms for each other. No doubt, the OUTER takes a new compute cycles more on a modern computer to parse. But the difference would be measured in fractions of a microsecond (probably) and are pretty immeasurable.
Some of the things that affect the performance of queries are:
The size of the data.
Operations such as GROUP BY AND ORDER BY.
JOINs, particularly without indexes.
And much, much more.
No, you cannot correlate the length of a SQL statement's text to its execution duration. It's more accurate to correlate a statement's execution duration to its execution plan with respect to the cardinality of the plan's row source operations.
Existence proof:
Take a billion row table, T, with a single column primary key, P
The full table scan query
select * from T
will take longer to execute than the unique index scan query (followed by a table access by index rowid):
select * from T where P = :pval
Except for some odd maintenance operations (eg global index rebuild over a partitioned table), the shorter query will no doubt take longer to execute than the longer query.

how to convert left join to sub query?

I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'
In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.

What is a "point-in-select" in MySQL?

I was given this query to update a report, and it was taking a long time to run on my computer.
select
c.category_type, t.categoryid, t.date, t.clicks
from transactions t
join category c
on c.category_id = t.categoryid
I asked the DBA if there were any issues with the query, and the DBA optimized the query in this manner:
select
(select category_type
from category c where c.category_id = t.categoryid) category_type,
categoryid,
date, clicks
from transactions t
He described the first subquery as a "point-in-select". I have never heard of this before. Can someone explain this concept?
I want to note that the two queries are not the same, unless the following is true:
transactions.categoryid is always present in category.
category has no duplicate values of category_id.
In practice, these would be true (in most databases). The first query should be using a left join version for closer equivalence:
select c.category_type, t.categoryid, t.date, t.clicks
from transactions t left join
category c
on c.category_id = t.categoryid;
Still not exactly the same, but more similar.
Finally, both versions should make use of an index on category(category_id), and I would expect the performance to be very similar in MySQL.
Your DBA's query is not the same, as others noted, and afaik nonstandard SQL. Yours is much preferable just for its simplicity alone.
It's usually not advantageous to re-write queries for performance. It can help sometimes, but the DBMS is supposed to execute logically equivalent queries equivalently. Failure to do so is a flaw in the query planner.
Performance issues are often a function of physical design. In your case, I would look for indexes on the category and transactions tables that contain categoryid as first column. If neither exist, your join is O(mn) because the category table must be scanned for each transaction row.
Not being a MySQL user, I can only advise you to get query planner output and look for indexing opportunities.

mysql efficient select query for each user (use join or not)

I have the following job select query (the actual query is massive with many joins)
Main query:
SELECT id FROM job LIMIT 20
When the user log in, I want to select if the user has saved the job. (many users but one main result)
My question is more efficient method? (thinking in terms of query cache, buffer pool etc)
method 1:
SELECT id, Member
FROM job AS t1 LEFT JOIN Member AS t2 ON (t1.id=t2.Job) LIMIT 20
(i.e. if user didn't saved the job it would return 'Null' for Select Member)
method 2:
use the Main query for the main result, then for each result select (i.e. loop sql query)
SELECT Member FROM Member WHERE Member=1 AND Job=(each job id)
Generally speaking, it's more efficient to bundle into a single query to reduce round-trips to the DB, query parsing cost, etc. However, if bundling into a single query means returning a bunch of data you don't need, then it may not be more efficient than individual queries.
The method with the left join is almost guaranteed to be more efficient.
While the underlying logic is essentially the same (ie: you're comparing to the same column against the same value) and while the LEFT JOIN query will be more expensive to parse, when using LEFT JOIN, you're only doing 1 round-trip to the server while with the looped query you're doing multiple round trips to achieve the same result.
Generally speaking, it's difficult to give generalized answers to such questions since much of it depends on your specific setup and data, but in this case you should see the single LEFT JOIN query being significantly faster.
I've benchmarked such queries in the past in the LEFT JOIN approach was always faster.

Should criteria be duplicated on subqueries

I have a query which actually runs two queries on a table. I query the whole table, a datediff and then a subquery which tells me the sum of hours each unit spent in certain operational steps. The main query limits the results to the REP depot so technically I don't need to put that same criteria on the subquery since repair_order is unique.
Would it be faster, slower or no difference to apply the depot filter on the subquery?
SELECT
*,
DATEDIFF(date_shipped, date_received) as htg_days,
(SELECT SUM(t3.total_days) FROM report_tables.cycle_time_days as t3 WHERE t1.repair_order=t3.repair_order AND (operation='MFG' OR operation='ENG' OR operation='ENGH' OR operation='HOLD') GROUP BY t3.repair_order) as subt_days
FROM
report_tables.cycle_time_days as t1
WHERE
YEAR(t1.date_shipped)=2010
AND t1.depot='REP'
GROUP BY
repair_order
ORDER BY
date_shipped;
I run into this with a lot of situations but I never know if it would be better to put the filter in the sub query, main query or both.
In this example, it would actually alter the query if you moved your WHERE clause to filter by REP into the subquery. So it wouldn't be about performance at that point, it would be about getting the same result set. In general, though, if you will get the same exact result set by moving a WHERE clause elsewhere in a complex query, it is better to do so at the most atomic level possible, ie, in the subquery. Then the subquery returns a smaller result set to the main query before the main query has to process it.
The answer to your question will vary depending on your schema, the complexity of your queries, the reliability of your data, etc. A general rule of thumb is to try to process the least amount of data possible, which generally means filtering it at the lowest level possible as well.
When you want to optimize a query the absolute number one place to start is to use the EXPLAIN output to see what optimizations the query parser was able to figure out and check to see what the weakest link is in the query plan. Resolve that, rinse, repeat.
You can also use explain's "extended" keyword to see the actual query it built to run which will reveal more about its usage of your criteria. In some cases, it will optimize away duplicate conditions between parent/subqueries. In other cases, it may push the conditions down from the parent in to the subquery. In some cases for (too) complex queries I've seen the it repeat the condition when it was only specified in the query once. Thankfully, you don't have to guess, mysql's explain plan will reveal all, albeit sometimes in cryptic ways.
I usually use a derived table as a "driver or aggregating" query then join that result back onto whatever table that i want to pull data from:
select
t1.*,
datediff(t1.date_shipped, t1.date_received) as htg_days,
subt_days.total_days
from
cycle_time_days as t1
inner join
(
-- aggregating/driver query
select
repair_order,
sum(total_days) as total_days
from
cycle_time_days
where
year(date_shipped) = 2010 and depot = 'REP' and
operation in ('MFG','ENG','ENGH','HOLD') -- covering index on date, depot, op ???
group by
repair_order -- indexed ??
having
total_days > 14 -- added for demonstration purposes
order by
total_days desc limit 10
) as subt_days on t1.repair_order = subt_days.repair_order
order by
t1.date_shipped;