I want to ask a question about database queries. In case of query such like where clause of the query is coming from the another query. For example
select ? from ? where ? = select ? from ?
This is the simple example so it is easy to write this. But for the more complex case, i want to know what is the best way in case of performance. Join? seperate queries? nested or another?
Thank you for answers.
Best Regards.
You should test it. These things depend a lot on the details of the query and of the indices it can use.
In my experience JOINs tend to be faster than nested queries in MySQL. In some cases MySQL isn't very smart and appears to run the subquery for every row produced by the outer query.
You can read more about these things in the official documentation:
Optimizing subqueries: http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Rewriting subqueries as joins: http://dev.mysql.com/doc/refman/5.6/en/rewriting-subqueries.html
This is case dependent. In case you have a very less result in the inner query you should go for it. The flow works in the manner where in the inner query is executed first and the result set is being used in the outer query.
Meanwhile joins give you a Cartesian product which is again a heavy operation.
As Mitch and Joni stated, it depends. But generally a join will offer the best performance. You're trying to avoid running the nested query for each row of the outer query. A good query optimizer may do this for you anyway, by interpreting what you're trying to do and essentially "fixing" your mistake. But with the vast majority of queries, you should be writing it as a join in the first place. That way you're being explicit about what you're trying to do and you're fully understanding yourself what is being done, and what the most efficient way to do the work is.
I EXPECT the joins to be quicker, mainly because you have an equivalence and an explicit JOIN. Still use explain to see the differences in how the SQl engine will interpret them.
I would not expect these to be so different, where you can get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
Since almost everyone is saying that joins will give the optimal performance I just logged in to say the exact opposite experience I had.
So some days back I was writing a query for 3-4 tables which had huge amount of data. I wrote a big sql query with joins and it was taking around 2-3 hours to execute it. Then I restructured it, created a nested select query, put as many where constraints as I can inside the nested one & made it as stricter as possible and then the performance improved by >90%, it now takes less than 4 mins to run.
This is just my experience and may be theoretically joins are better. I just felt to share my experience. Its better to try out different things, getting additional knowledge about the tables, it's indexes etc would help a lot.
Update:
And I just found out what I did is actually suggested in this optimization reference page of MySQL. http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Pasting it here for quick reference:
Replace a join with a subquery. For example, try this:
SELECT DISTINCT column1 FROM t1 WHERE t1.column1 IN ( SELECT column1
FROM t2);
Instead of this:
SELECT DISTINCT t1.column1 FROM t1, t2 WHERE t1.column1 =
t2.column1;
Move clauses from outside to inside the subquery. For example, use
this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1 UNION ALL SELECT s1
FROM t2); Instead of this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1) OR s1 IN (SELECT s1
FROM t2); For another example, use this query:
SELECT (SELECT column1 + 5 FROM t1) FROM t2; Instead of this query:
SELECT (SELECT column1 FROM t1) + 5 FROM t2;
Related
Is Query 1 more optimized say for example for a larger database than Query 2 even by slight or am I just doubling the work with an additional WHERE clause?
Query 1:
SELECT sample_data
FROM table1 INNER JOIN table2 ON table1.key = table2.key
WHERE table1.key = table2.key;
Query 2:
SELECT sample_data
FROM table1 INNER JOIN table2 ON table1.key = table2.key;
Because I read this article saying that using filters in JOIN clauses improve the performance..:
Is Query 1 more optimized say for example for a larger database than Query 2?
No, it is not more optimized. Query 2 is the correct way to handle the JOIN. Query 1 does the same thing, but with extra verbiage for the MySQL server software to scrub out as it figures out how to satisfy your query.
The advice at the Adobe documentation about filtering both tables in a join does not relate to the join's ON-condition. Their example says to do this...
SELECT whatever, whatever
FROM table1
JOIN table2 ON table2.table1_id = table1.table1_id
WHERE table1.date >= '2021-01-01'
AND table2.date >= '2021-01-01' /* THIS LINE IS WHAT THEY SUGGEST */
Their suggestion, from 2015, has to do with filtering non-join attributes from both tables. It's a suggestion to use to optimize a query if it just isn't fast enough for you. And, in my experience, it's not a very good suggestion. Ignore it, at least for now. More recent MySQL versions have gotten more efficient.
Let me add to this. SQL is a so-called "declarative" language. You declare what you want and the MySQL server figures out how to get it for you. SQL software is getting really good at doing that; keep in mind that MySQL is now a quarter century old. In that time its programmers have been continuously making it smarter at figuring out how to get stuff. You probably can't outsmart it. But you may need to add indexes when your tables get really big. https://use-the-index-luke.com/
Other languages are "procedural": you, as a programmer, spell out a procedure for getting what you want. You don't need to do that for SQL.
I like to put it this way:
ON is where you specify how the tables are related.
WHERE is for filtering.
That makes it easy for a human reading the query to understand it.
In reality (for MySQL), JOIN (aka INNER JOIN) treats ON and WHERE identically. That is, there is no performance difference. Your Query 1 unnecessarily specifies the "relation" twice.
Also, MySQL's Optimizer is smart enough to realize when two columns have the same value. For example,
SELECT ...
FROM a
JOIN bb ON a.foo = bb.foo
WHERE a.foo = 123
If the Optimizer decides that starting with the filter bb.foo = 123 is more optimal, it will do so. Note: This is not the same as the example you showed; it joins on one thing (id) but filters on another (date). The two queries there are not equivalent!
LEFT JOIN, necessarily treats ON and WHERE differently. (But that is another topic.)
I'm trying to understand the difference between EXISTS and ALL in MySQL. Let me give you an example:
SELECT *
FROM table1
WHERE NOT EXISTS (
SELECT *
FROM table2
WHERE table2.val < table1.val
);
SELECT *
FROM table1
WHERE val <= ALL( SELECT val FROM table2 );
A quote from MySQL docs:
Traditionally, an EXISTS subquery starts with SELECT *, but it could
begin with SELECT 5 or SELECT column1 or anything at all. MySQL
ignores the SELECT list in such a subquery, so it makes no difference. [1]
Reading this, it seems to me that mysql should be able to translate both queries to the same relational algebra expression. Both queries are just a simple comparison between values from two tables. However, that doesn't seem to be the case. I tried both queries and the second one performs much better than the first one.
How are this queries exactly handled by the optimizer?
Why the optimizer can't make the first query perform as the second one?
Is it always more efficient to use an ALL/ANY/SOME condition?
The queries in your question are not equivalent, so they will have different execution plans regardless of how well they're optimized. If you used NOT val > ANY(...) then it would be equivalent.
You should always use EXPLAIN to see the execution plan of a query and realize that the execution plan can change as your data changes. Testing and understanding the execution plan will help you determine which methods perform better. There is no hard and fast rule for ALL/ANY/SOME and they're often optimized down to an EXISTS.
I have big DB. It's about 1 mln strings. I need to do something like this:
select * from t1 WHERE id1 NOT IN (SELECT id2 FROM t2)
But it works very slow. I know that I can do it using "JOIN" syntax, but I can't understand how.
Try this way:
select *
from t1
left join t2 on t1.id1 = t2.id
where t2.id is null
First of all you should optimize your indexes in both tables, and after that you should use join
There are different ways a dbms can deal with this task:
It can select id2 from t2 and then select all t1 where id1 is not in that set. You suggest this using the IN clause.
It can select record by record from t1 and look for each record if it finds a match in t2. You would suggest this using the EXISTS clause.
You can outer join the table then throw away all matches and stay with the non-matching entries. This may look like a bad way, especially when there are many matches, because you would get big intermediate data and then throw most of it away. However, depending on how the dbms works, it can be rather fast, for example when it applies hash join techniques.
It all depends on table sizes, number of matches, indexes, etc. and on what the dbms makes of your query. There are dbms that are able to completely re-write your query to find the best execution plan.
Having said all this, you can just try different things:
the IN clause with (SELECT DISTINCT id2 FROM t2). DISTINCT can reduce the intermediate result significantly and really speed up your query. (But maybe your dbms does that anyhow to get a good execution plan.)
use an EXISTS clause and see if that is faster
the outer join suggested by Parado
Although this question is specific to MySQL, I wouldn't mind knowing if this answer applies to SQL engines in general.
Also, since this isn't a syntax query, I'm using psuedo-SQL for brevity/clarity.
Let's say C[1]..C[M] are a set of criteria (separated by AND or OR) and Q[1]..Q[N] are another set (separated by OR). I want to use C[1]...C[M] to filter a table and from this filtered table, I want all the rows matching Q[1]...Q[N].
If I were to do:
SELECT ... FROM ... WHERE (C[1]...C[M]) AND (Q[1]...Q[N])
Would this be automatically optimized so that C[1]...C[M] is found only once and each Q[i] is run against this cache'ed result? If not, should I then split the query into two like so:
INSERT INTO TEMP ... SELECT ... FROM ... WHERE C[1]...C[N]
SELECT ... FROM TEMP WHERE Q[1]...Q[N]
This is the job of the internal query optimizer to calculate the best order for compiling the joins according to filters.
For instance in:
SELECT *
FROM
table1
INNER JOIN table2 ON table1.id = table2.id AND table2.column = Y
INNER JOIN table3 ON table3.id2 = table2.id2 AND table3.column = Z
WHERE
table1.column = X
Mysql (/oracle/sqlserver etc...) would try to compute beforehand each intermediary resultset to provide the best performances, and actually here the engine is doing a pretty good job.
However, everything relies on the statistics it actually has about the tables and the indexes you've setup in your architecture. These 2 points (apart from filling up the tables with datas...) are the only ones you can influence to help the optimizer to make good decisions by giving it the right and accurate information.
I hope it helped.
ps: have a look at this. This is about operators and order of precedence in queries compilation under oracle yet it is probably a good thing to know anyway:
http://ezinearticles.com/?Oracle-SQL---The-Importance-of-Order-of-Precedence&id=1597846
I was wondering which is faster an INNER JOIN or INNER SELECT with IN?
select t1.* from test1 t1
inner join test2 t2 on t1.id = t2.id
where t2.id = 'blah'
OR
select t1.* from test1 t1
where t1.id IN (select t2.id from test2 t2 where t2.id = 'blah')
Assuming id is key, these queries mean the same thing, and a decent DBMS will execute them in the exact same way. Unfortunately MySQL doesn't, as can be seen by expanding the "View Execution Plan" link in this SQL Fiddle. Which one will be faster probably depends on the size of tables - if TABLE1 has very few rows, then IN has a chance for being faster, while JOIN will likely be faster in all other cases.
This is a peculiarity of MySQL's query optimizer. I've never seen Oracle, PostgreSQL or MS SQL Server execute such simple equivalent queries differently.
If you have to guess, INNER JOIN is likely to be more efficient than an IN (SELECT ...), but that can vary from one query to another.
The EXPLAIN keyword is one of your best friends. Type EXPLAIN in front of your complete SELECT query and MySQL will give you some basic information about how it will execute the query. It'll tell you where it's using file sorts, where it's using indices you've created (and where it's ignoring them), and how many rows it will probably have to examine to fulfill the request.
If all else is equal, use the INNER JOIN mostly because it's more predictable and thus easier to understand to a new developer coming in. But of course if you see a real advantage to the IN (SELECT ...) form, use it!
Though you'd have to check the execution plan on whatever RDBS you're inquiring about, I would guess the inner join would be faster or at least the same. Perhaps someone will correct me if I'm wrong.
The nested select will most likely run the entire inner query anyway, and build a hash table of possible values from test2. If that query returns a million rows, you've incurred the cost of loading that data into memory no matter what.
With the inner join, if test1 only has 2 rows, it will probably just do 2 index scans on test2 for the id values of each of those rows, and not have to load a million rows into memory.
It's also possible that a more modern database system can optimize the first scenario since it has statistics on each table, however at the very best case, the inner join would be the same.
In most of the cases JOIN is much faster than sub query but sub-query is more readable than JOIN.
RDBMS creates an execution plan against JOIN so it can be predict that what data should be loaded to be processed. This definitely saves time. On the other hand for the sub-query it run all the queries and load all their data to do the processing.
For more details please check this link.