Optimizing Queries in MySQL - mysql

Is Query 1 more optimized say for example for a larger database than Query 2 even by slight or am I just doubling the work with an additional WHERE clause?
Query 1:
SELECT sample_data
FROM table1 INNER JOIN table2 ON table1.key = table2.key
WHERE table1.key = table2.key;
Query 2:
SELECT sample_data
FROM table1 INNER JOIN table2 ON table1.key = table2.key;
Because I read this article saying that using filters in JOIN clauses improve the performance..:

Is Query 1 more optimized say for example for a larger database than Query 2?
No, it is not more optimized. Query 2 is the correct way to handle the JOIN. Query 1 does the same thing, but with extra verbiage for the MySQL server software to scrub out as it figures out how to satisfy your query.
The advice at the Adobe documentation about filtering both tables in a join does not relate to the join's ON-condition. Their example says to do this...
SELECT whatever, whatever
FROM table1
JOIN table2 ON table2.table1_id = table1.table1_id
WHERE table1.date >= '2021-01-01'
AND table2.date >= '2021-01-01' /* THIS LINE IS WHAT THEY SUGGEST */
Their suggestion, from 2015, has to do with filtering non-join attributes from both tables. It's a suggestion to use to optimize a query if it just isn't fast enough for you. And, in my experience, it's not a very good suggestion. Ignore it, at least for now. More recent MySQL versions have gotten more efficient.
Let me add to this. SQL is a so-called "declarative" language. You declare what you want and the MySQL server figures out how to get it for you. SQL software is getting really good at doing that; keep in mind that MySQL is now a quarter century old. In that time its programmers have been continuously making it smarter at figuring out how to get stuff. You probably can't outsmart it. But you may need to add indexes when your tables get really big. https://use-the-index-luke.com/
Other languages are "procedural": you, as a programmer, spell out a procedure for getting what you want. You don't need to do that for SQL.

I like to put it this way:
ON is where you specify how the tables are related.
WHERE is for filtering.
That makes it easy for a human reading the query to understand it.
In reality (for MySQL), JOIN (aka INNER JOIN) treats ON and WHERE identically. That is, there is no performance difference. Your Query 1 unnecessarily specifies the "relation" twice.
Also, MySQL's Optimizer is smart enough to realize when two columns have the same value. For example,
SELECT ...
FROM a
JOIN bb ON a.foo = bb.foo
WHERE a.foo = 123
If the Optimizer decides that starting with the filter bb.foo = 123 is more optimal, it will do so. Note: This is not the same as the example you showed; it joins on one thing (id) but filters on another (date). The two queries there are not equivalent!
LEFT JOIN, necessarily treats ON and WHERE differently. (But that is another topic.)

Related

SQL: INNER JOIN or WHERE? [duplicate]

For simplicity, assume all relevant fields are NOT NULL.
You can do:
SELECT
table1.this, table2.that, table2.somethingelse
FROM
table1, table2
WHERE
table1.foreignkey = table2.primarykey
AND (some other conditions)
Or else:
SELECT
table1.this, table2.that, table2.somethingelse
FROM
table1 INNER JOIN table2
ON table1.foreignkey = table2.primarykey
WHERE
(some other conditions)
Do these two work on the same way in MySQL?
INNER JOIN is ANSI syntax that you should use.
It is generally considered more readable, especially when you join lots of tables.
It can also be easily replaced with an OUTER JOIN whenever a need arises.
The WHERE syntax is more relational model oriented.
A result of two tables JOINed is a cartesian product of the tables to which a filter is applied which selects only those rows with joining columns matching.
It's easier to see this with the WHERE syntax.
As for your example, in MySQL (and in SQL generally) these two queries are synonyms.
Also, note that MySQL also has a STRAIGHT_JOIN clause.
Using this clause, you can control the JOIN order: which table is scanned in the outer loop and which one is in the inner loop.
You cannot control this in MySQL using WHERE syntax.
Others have pointed out that INNER JOIN helps human readability, and that's a top priority, I agree.
Let me try to explain why the join syntax is more readable.
A basic SELECT query is this:
SELECT stuff
FROM tables
WHERE conditions
The SELECT clause tells us what we're getting back; the FROM clause tells us where we're getting it from, and the WHERE clause tells us which ones we're getting.
JOIN is a statement about the tables, how they are bound together (conceptually, actually, into a single table).
Any query elements that control the tables - where we're getting stuff from - semantically belong to the FROM clause (and of course, that's where JOIN elements go). Putting joining-elements into the WHERE clause conflates the which and the where-from, that's why the JOIN syntax is preferred.
Applying conditional statements in ON / WHERE
Here I have explained the logical query processing steps.
Reference: Inside Microsoft® SQL Server™ 2005 T-SQL Querying
Publisher: Microsoft Press
Pub Date: March 07, 2006
Print ISBN-10: 0-7356-2313-9
Print ISBN-13: 978-0-7356-2313-2
Pages: 640
Inside Microsoft® SQL Server™ 2005 T-SQL Querying
(8) SELECT (9) DISTINCT (11) TOP <top_specification> <select_list>
(1) FROM <left_table>
(3) <join_type> JOIN <right_table>
(2) ON <join_condition>
(4) WHERE <where_condition>
(5) GROUP BY <group_by_list>
(6) WITH {CUBE | ROLLUP}
(7) HAVING <having_condition>
(10) ORDER BY <order_by_list>
The first noticeable aspect of SQL that is different than other programming languages is the order in which the code is processed. In most programming languages, the code is processed in the order in which it is written. In SQL, the first clause that is processed is the FROM clause, while the SELECT clause, which appears first, is processed almost last.
Each step generates a virtual table that is used as the input to the following step. These virtual tables are not available to the caller (client application or outer query). Only the table generated by the final step is returned to the caller. If a certain clause is not specified in a query, the corresponding step is simply skipped.
Brief Description of Logical Query Processing Phases
Don't worry too much if the description of the steps doesn't seem to make much sense for now. These are provided as a reference. Sections that come after the scenario example will cover the steps in much more detail.
FROM: A Cartesian product (cross join) is performed between the first two tables in the FROM clause, and as a result, virtual table VT1 is generated.
ON: The ON filter is applied to VT1. Only rows for which the <join_condition> is TRUE are inserted to VT2.
OUTER (join): If an OUTER JOIN is specified (as opposed to a CROSS JOIN or an INNER JOIN), rows from the preserved table or tables for which a match was not found are added to the rows from VT2 as outer rows, generating VT3. If more than two tables appear in the FROM clause, steps 1 through 3 are applied repeatedly between the result of the last join and the next table in the FROM clause until all tables are processed.
WHERE: The WHERE filter is applied to VT3. Only rows for which the <where_condition> is TRUE are inserted to VT4.
GROUP BY: The rows from VT4 are arranged in groups based on the column list specified in the GROUP BY clause. VT5 is generated.
CUBE | ROLLUP: Supergroups (groups of groups) are added to the rows from VT5, generating VT6.
HAVING: The HAVING filter is applied to VT6. Only groups for which the <having_condition> is TRUE are inserted to VT7.
SELECT: The SELECT list is processed, generating VT8.
DISTINCT: Duplicate rows are removed from VT8. VT9 is generated.
ORDER BY: The rows from VT9 are sorted according to the column list specified in the ORDER BY clause. A cursor is generated (VC10).
TOP: The specified number or percentage of rows is selected from the beginning of VC10. Table VT11 is generated and returned to the caller.
Therefore, (INNER JOIN) ON will filter the data (the data count of VT will be reduced here itself) before applying the WHERE clause. The subsequent join conditions will be executed with filtered data which improves performance. After that, only the WHERE condition will apply filter conditions.
(Applying conditional statements in ON / WHERE will not make much difference in few cases. This depends on how many tables you have joined and the number of rows available in each join tables)
The implicit join ANSI syntax is older, less obvious, and not recommended.
In addition, the relational algebra allows interchangeability of the predicates in the WHERE clause and the INNER JOIN, so even INNER JOIN queries with WHERE clauses can have the predicates rearranged by the optimizer.
I recommend you write the queries in the most readable way possible.
Sometimes this includes making the INNER JOIN relatively "incomplete" and putting some of the criteria in the WHERE simply to make the lists of filtering criteria more easily maintainable.
For example, instead of:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
AND c.State = 'NY'
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
AND a.Status = 1
Write:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
WHERE c.State = 'NY'
AND a.Status = 1
But it depends, of course.
Implicit joins (which is what your first query is known as) become much much more confusing, hard to read, and hard to maintain once you need to start adding more tables to your query. Imagine doing that same query and type of join on four or five different tables ... it's a nightmare.
Using an explicit join (your second example) is much more readable and easy to maintain.
I'll also point out that using the older syntax is more subject to error. If you use inner joins without an ON clause, you will get a syntax error. If you use the older syntax and forget one of the join conditions in the where clause, you will get a cross join. The developers often fix this by adding the distinct keyword (rather than fixing the join because they still don't realize the join itself is broken) which may appear to cure the problem but will slow down the query considerably.
Additionally for maintenance if you have a cross join in the old syntax, how will the maintainer know if you meant to have one (there are situations where cross joins are needed) or if it was an accident that should be fixed?
Let me point you to this question to see why the implicit syntax is bad if you use left joins.
Sybase *= to Ansi Standard with 2 different outer tables for same inner table
Plus (personal rant here), the standard using the explicit joins is over 20 years old, which means implicit join syntax has been outdated for those 20 years. Would you write application code using a syntax that has been outdated for 20 years? Why do you want to write database code that is?
The SQL:2003 standard changed some precedence rules so a JOIN statement takes precedence over a "comma" join. This can actually change the results of your query depending on how it is setup. This cause some problems for some people when MySQL 5.0.12 switched to adhering to the standard.
So in your example, your queries would work the same. But if you added a third table:
SELECT ... FROM table1, table2 JOIN table3 ON ... WHERE ...
Prior to MySQL 5.0.12, table1 and table2 would be joined first, then table3. Now (5.0.12 and on), table2 and table3 are joined first, then table1. It doesn't always change the results, but it can and you may not even realize it.
I never use the "comma" syntax anymore, opting for your second example. It's a lot more readable anyway, the JOIN conditions are with the JOINs, not separated into a separate query section.
They have a different human-readable meaning.
However, depending on the query optimizer, they may have the same meaning to the machine.
You should always code to be readable.
That is to say, if this is a built-in relationship, use the explicit join. if you are matching on weakly related data, use the where clause.
I know you're talking about MySQL, but anyway:
In Oracle 9 explicit joins and implicit joins would generate different execution plans. AFAIK that has been solved in Oracle 10+: there's no such difference anymore.
If you are often programming dynamic stored procedures, you will fall in love with your second example (using where). If you have various input parameters and lots of morph mess, then that is the only way. Otherwise, they both will run the same query plan so there is definitely no obvious difference in classic queries.
ANSI join syntax is definitely more portable.
I'm going through an upgrade of Microsoft SQL Server, and I would also mention that the =* and *= syntax for outer joins in SQL Server is not supported (without compatibility mode) for 2005 SQL server and later.
I have two points for the implicit join (The second example):
Tell the database what you want, not what it should do.
You can write all tables in a clear list that is not cluttered by join conditions. Then you can much easier read what tables are all mentioned. The conditions come all in the WHERE part, where they are also all lined up one below the other. Using the JOIN keyword mixes up tables and conditions.

In mysql which inner join sql is most effective and best?

In mysql which inner join sql is most effective and best?
                               
1.
select t01.uname, t02.deptname
from user t01, department t02
where t01.deptid = t02.deptid
and t01.uid = '001'
2.
select t01.uname, t02.deptname
from user t01, department t02
where t01.uid = '001'
and t01.deptid = t02.deptid
3.
select t01.uname, t02.deptname
from user t01 inner join department t02 on t01.deptid = t02.deptid
and t01.uid = '001'
4.
select t01.uname, t02.deptname
from user t01 inner join department t02 on t01.deptid = t02.deptid
where t01.uid = '001'
My mysql is 5.1
All of those are functionally equivalent. Even the separation between WHERE clause and JOIN condition will not change the results when working entirely with INNER joins (it can matter with OUTER joins). Additionally, all of those should work out into the exact same query plan (effectively zero performance difference). The order in which you include items does not matter. The query engine is free to optimize as it sees best fit within the functional specification of the query. Even when you identify specific behavior with regards to order, you shouldn't count on it. The specification allows for tomorrow's patch to change today's behavior in this area. Remember: the whole point of SQL is to be set-based and declarative: you tell the database what you want it to do, not how you want it to do it.
Now that correctness and performance are out of the way, we're down to matters of style: things like programmer productivity and readability/maintainability of the code. In that regard, option #4 in that list is by far the best choice, with #3 the next best, especially as you start to get into more complicated queries. Just don't use the A,B syntax anymore; it's been obsolete since the 1992 version of the SQL standard. Always write out the full INNER JOIN (or LEFT JOIN/RIGHT JOIN/CROSS JOIN etc).
All that said, while order does (or, at least, should) not matter to performance, I do find it helpful when I'm writing SQL to use a convention in my approach that does dictate the order. This helps me identify errors or false assumptions later when debugging and troubleshooting. This general guide that I try to follow is to behave as if the order does matter, and then with that in mind try to keep the working set of memory needed by the database to fulfill the query as small as possible for as long as possible: start with smaller tables first and then join to the larger; when considering table size, take into account conditions in the WHERE clause that match up with an index; prefer the inner joins before outer when you have the choice; list join conditions to favor indexes (especially primary/clustered keys) first, and other conditions on the join second.

SELECT with JOIN where joined row is NULL

I am trying to select rows from a table which don't have a correspondence in the other table.
For this purpose, I'm currently using LEFT JOIN and WHERE joined_table.any_column IS NULL, but I don't think that's the fastest way.
SELECT * FROM main_table mt LEFT JOIN joined_table jt ON mt.foreign_id=jt.id WHERE jt.id IS NULL
This query works, but as I said, I'm looking for a faster alternative.
Your query is a standard query for this:
SELECT *
FROM main_table mt LEFT JOIN
joined_table jt
ON mt.foreign_id=jt.id
WHERE jt.id IS NULL;
You can try this as well:
SELECT mt.*
FROM main_table mt
WHERE not exists (select 1 from joined_table jt where mt.foreign_id = jt.id);
In some versions of MySQL, it might produce a better execution plan.
In my experience with MSSQL the syntax used (usually) produces the exact same query plan as the WHERE NOT EXISTS() syntax, however this is mysql, so I can't be sure about performance!!
That said, I'm a much bigger fan of using the WHERE NOT EXISTS() syntax for the following reasons :
it's easier to read. If you speak a bit of English anyone can deduce the meaning of the query
it's more foolproof, I've seen people test for NULL on a NULL-able field
it can't have side effects like 'doubled-records' due to the JOIN. If the referenced field is unique there is no problem, but again I've seen situations where people chose 'insufficient keys' causing the main-table to get multiple hits against the joined table... and off course they solved it again using DISTINCT (aarrgg!!! =)
As for performance, make sure to have a (unique) index on the referenced field(s) and if possible put a FK-relationship between both tables. Query-wise I doubt you can squeeze much more out of it.
My 2 cents.
The query that you are running is usually the fastest option, just make sure that you have an index forh both mt.foreign_id and jt.id.
You mentioned that this query is more complex, so it might be possible that the problem is in another part of the query. You should check the execution plan to see what is wrong and fix it.

MySQL Query Efficiency Suggestion

Although this question is specific to MySQL, I wouldn't mind knowing if this answer applies to SQL engines in general.
Also, since this isn't a syntax query, I'm using psuedo-SQL for brevity/clarity.
Let's say C[1]..C[M] are a set of criteria (separated by AND or OR) and Q[1]..Q[N] are another set (separated by OR). I want to use C[1]...C[M] to filter a table and from this filtered table, I want all the rows matching Q[1]...Q[N].
If I were to do:
SELECT ... FROM ... WHERE (C[1]...C[M]) AND (Q[1]...Q[N])
Would this be automatically optimized so that C[1]...C[M] is found only once and each Q[i] is run against this cache'ed result? If not, should I then split the query into two like so:
INSERT INTO TEMP ... SELECT ... FROM ... WHERE C[1]...C[N]
SELECT ... FROM TEMP WHERE Q[1]...Q[N]
This is the job of the internal query optimizer to calculate the best order for compiling the joins according to filters.
For instance in:
SELECT *
FROM
table1
INNER JOIN table2 ON table1.id = table2.id AND table2.column = Y
INNER JOIN table3 ON table3.id2 = table2.id2 AND table3.column = Z
WHERE
table1.column = X
Mysql (/oracle/sqlserver etc...) would try to compute beforehand each intermediary resultset to provide the best performances, and actually here the engine is doing a pretty good job.
However, everything relies on the statistics it actually has about the tables and the indexes you've setup in your architecture. These 2 points (apart from filling up the tables with datas...) are the only ones you can influence to help the optimizer to make good decisions by giving it the right and accurate information.
I hope it helped.
ps: have a look at this. This is about operators and order of precedence in queries compilation under oracle yet it is probably a good thing to know anyway:
http://ezinearticles.com/?Oracle-SQL---The-Importance-of-Order-of-Precedence&id=1597846

Performans of nested queries

I want to ask a question about database queries. In case of query such like where clause of the query is coming from the another query. For example
select ? from ? where ? = select ? from ?
This is the simple example so it is easy to write this. But for the more complex case, i want to know what is the best way in case of performance. Join? seperate queries? nested or another?
Thank you for answers.
Best Regards.
You should test it. These things depend a lot on the details of the query and of the indices it can use.
In my experience JOINs tend to be faster than nested queries in MySQL. In some cases MySQL isn't very smart and appears to run the subquery for every row produced by the outer query.
You can read more about these things in the official documentation:
Optimizing subqueries: http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Rewriting subqueries as joins: http://dev.mysql.com/doc/refman/5.6/en/rewriting-subqueries.html
This is case dependent. In case you have a very less result in the inner query you should go for it. The flow works in the manner where in the inner query is executed first and the result set is being used in the outer query.
Meanwhile joins give you a Cartesian product which is again a heavy operation.
As Mitch and Joni stated, it depends. But generally a join will offer the best performance. You're trying to avoid running the nested query for each row of the outer query. A good query optimizer may do this for you anyway, by interpreting what you're trying to do and essentially "fixing" your mistake. But with the vast majority of queries, you should be writing it as a join in the first place. That way you're being explicit about what you're trying to do and you're fully understanding yourself what is being done, and what the most efficient way to do the work is.
I EXPECT the joins to be quicker, mainly because you have an equivalence and an explicit JOIN. Still use explain to see the differences in how the SQl engine will interpret them.
I would not expect these to be so different, where you can get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
Since almost everyone is saying that joins will give the optimal performance I just logged in to say the exact opposite experience I had.
So some days back I was writing a query for 3-4 tables which had huge amount of data. I wrote a big sql query with joins and it was taking around 2-3 hours to execute it. Then I restructured it, created a nested select query, put as many where constraints as I can inside the nested one & made it as stricter as possible and then the performance improved by >90%, it now takes less than 4 mins to run.
This is just my experience and may be theoretically joins are better. I just felt to share my experience. Its better to try out different things, getting additional knowledge about the tables, it's indexes etc would help a lot.
Update:
And I just found out what I did is actually suggested in this optimization reference page of MySQL. http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Pasting it here for quick reference:
Replace a join with a subquery. For example, try this:
SELECT DISTINCT column1 FROM t1 WHERE t1.column1 IN ( SELECT column1
FROM t2);
Instead of this:
SELECT DISTINCT t1.column1 FROM t1, t2 WHERE t1.column1 =
t2.column1;
Move clauses from outside to inside the subquery. For example, use
this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1 UNION ALL SELECT s1
FROM t2); Instead of this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1) OR s1 IN (SELECT s1
FROM t2); For another example, use this query:
SELECT (SELECT column1 + 5 FROM t1) FROM t2; Instead of this query:
SELECT (SELECT column1 FROM t1) + 5 FROM t2;