EXISTS vs ALL, ANY, SOME - mysql

I'm trying to understand the difference between EXISTS and ALL in MySQL. Let me give you an example:
SELECT *
FROM table1
WHERE NOT EXISTS (
SELECT *
FROM table2
WHERE table2.val < table1.val
);
SELECT *
FROM table1
WHERE val <= ALL( SELECT val FROM table2 );
A quote from MySQL docs:
Traditionally, an EXISTS subquery starts with SELECT *, but it could
begin with SELECT 5 or SELECT column1 or anything at all. MySQL
ignores the SELECT list in such a subquery, so it makes no difference. [1]
Reading this, it seems to me that mysql should be able to translate both queries to the same relational algebra expression. Both queries are just a simple comparison between values from two tables. However, that doesn't seem to be the case. I tried both queries and the second one performs much better than the first one.
How are this queries exactly handled by the optimizer?
Why the optimizer can't make the first query perform as the second one?
Is it always more efficient to use an ALL/ANY/SOME condition?

The queries in your question are not equivalent, so they will have different execution plans regardless of how well they're optimized. If you used NOT val > ANY(...) then it would be equivalent.
You should always use EXPLAIN to see the execution plan of a query and realize that the execution plan can change as your data changes. Testing and understanding the execution plan will help you determine which methods perform better. There is no hard and fast rule for ALL/ANY/SOME and they're often optimized down to an EXISTS.

Related

why does sub query with group by do full scan twice?

I am testing two types of queries.
The first type looks like below:
explain select * from ord_order;
explain select * from (select * from ord_order) as tbl;
These two execution plan shows the same behavior (full scan once).
However, the second type looks like below:
explain select * from ord_order
group by bundle_or_order_number;
explain select * from
(select * from ord_order
group by bundle_or_order_number) as tbl;
the second query do the full scan twice!
Can someone explain it? Thanks.
First, it doesn't matter because your queries are malformed. Don't use select * with group by. It just doesn't make sense -- and current versions of MySQL do not support it. The question is: What rows do the columns come from? You should be using aggregation functions.
Why are the two queries different? In MySQL lingo, the difference is whether or not the derived table (a subquery in the from clause) is materialized. Whether or not subqueries are materialized depends on the nature of the subquery and what MySQL decides to do in the version you are using.
You can read about optimizing derived tables in the documentation.

MySql performance query vs view with 'explain' output

I'm trying to understand why a direct query takes ~0.5s to run but a view using the same query takes ~10s to run. MySql v5.6.27.
Direct Query:
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i
where i.b = -1 and i.a > 1500;
Direct Query 'explain' Results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,i,range,PRIMARY,PRIMARY,4,\N,3629,Using where
2,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
The View's definition/query is the same without the 'where' clause...
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i;
Query against the view:
select * from ViewA t where t.b = -1 and t.a > 1500;
Query of View 'explain' results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,<derived2>,ALL,\N,\N,\N,\N,7259,Using where
2,DERIVED,i,ALL,\N,\N,\N,\N,7259,\N
3,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
Why does the query against the view end up performing 3 full table scans whereas the direct query performs ~1.5?
The short answer is: the MySQL optimizer is not clever enough to do it.
When processing a view, MySQL can either merge the view or create a temporary table for it:
For MERGE, the text of a statement that refers to the view and the view definition are merged such that parts of the view definition replace corresponding parts of the statement.
For TEMPTABLE, the results from the view are retrieved into a temporary table, which then is used to execute the statement.
This applies in a very similar way to derived tables and subqueries too.
The behaviour you are seeking is the merge. This is the default value, and MySQL will try to use it whenever possible. If it is not possible (or rather: if MySQL thinks it is not possible), MySQL has to evaluate the complete view, no matter if you only need one row out of it. This obviously takes more time, and is what happens in your view.
There is a list of things that prevent MySQL from using the merge algorithm:
MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Assignment to user variables
Refers only to literal values (in this case, there is no underlying table)
You can test this if MySQL will merge or not: try to create the view specifying the merge-algorithm:
create algorithm=merge view viewA as ...
If MySQL doesn't think it can merge the view, you get the warning
1 warning(s): 1354 View merge algorithm can't be used here for now (assumed undefined algorithm)
In your case, the Subquery in the select list is preventing the merge. This is not because it would be impossible to do. You have already prooven that it is possible to merge it: by just rewriting it.
But the MySQL optimizer didn't see that possibility. It is not specific to views: it will actually not merge it either if you use the unmerged viewcode directly: explain select * from (select a, b, ... from TableA i) as ViewA where .... You would have to test this on MySQL 5.7, as MySQL 5.6 will not merge in this situation on principle (as, in a query, it assumes you want to have a temptable here, even for very simple derived tables that could be merged). MySQL 5.7 will by default try to do it, although it won't work with your view.
As the optimizer gets improved, in some situation, the optimizer will merge even in cases where there is an subquery in the select list, so there are some exceptions to that list. MariaDB, which is based on MySQL, is actually a lot better doing the merge optimization, and would merge your view just like you did it - so it is possible to do it even as a machine.
So to summarize: the MySQL optimizer is currently not clever enough to do it.
And you unfortunately cannot do much about it, except testing if MySQL accepts algorithm=merge and then not using views that MySQL cannot merge and instead merge them yourself.

How can I EXPLAIN several consecutive queries without executing them?

Suppose I have a pair of arbitrary SQL queries, each one depending upon the former ones, e.g.
CREATE VIEW v1 ( c3 ) AS SELECT c1 + c2 FROM t1;
SELECT sum(c3) FROM v1;
DROP VIEW v1;
(but note I am not asking about these specific queries - this is just an example; assume I get the queries from a file and do not know them in advance.)
Now, I want to get my DBMS to EXPLAIN its plan for all of my queries (or an arbitrary query in the middle, it's the same problem essentially) - but I do not want it to actually execute any of them.
Is this possible with (1) MySQL? (2) PostgreSQL? (3) MonetDB?
PostgreSQL
You may use the explain statements as follows.
BEGIN;
EXPLAIN ANALYZE ...;
ROLLBACK;
Refer this, documentation answers your question.
MonetDB
Similarly to the above, except that the transaction-related statements are BEGIN TRANSACTION and ROLLBACK statements (assuming that you are in auto-commit mode to begin with).
Refer this.
MySQL
MySQL explain it self does what you need. No need to ROLLBACK.
Refer this answer.
When you execute a view, the underlying SELECT query is executed, obviously. So in PostgreSQL the actual execution plan is based on this:
-- Common use of the view
SELECT sum(c3) FROM v1;
becomes
-- Expansion of the view into plain SQL
SELECT sum(c3) FROM (SELECT c1 + c2 AS c3 FROM t1) v1;
becomes
-- Flattening by the query planner, this is what actually gets executed
SELECT sum(c1 + c2) FROM t1;
So the answer is:
EXPLAIN SELECT sum(c1 + c2) FROM t1;
This most certainly works for PostgreSQL and most likely for all other DBMSes too, but check their docs on how the query planner works.
If your view definition is very complex, just take the query on the view you want to evaluate and paste the entire view definition in brackets () just before the name of the view (which then effectively becomes an alias for a sub-query). The query planner will do the rest for you.

MySQL Query Efficiency Suggestion

Although this question is specific to MySQL, I wouldn't mind knowing if this answer applies to SQL engines in general.
Also, since this isn't a syntax query, I'm using psuedo-SQL for brevity/clarity.
Let's say C[1]..C[M] are a set of criteria (separated by AND or OR) and Q[1]..Q[N] are another set (separated by OR). I want to use C[1]...C[M] to filter a table and from this filtered table, I want all the rows matching Q[1]...Q[N].
If I were to do:
SELECT ... FROM ... WHERE (C[1]...C[M]) AND (Q[1]...Q[N])
Would this be automatically optimized so that C[1]...C[M] is found only once and each Q[i] is run against this cache'ed result? If not, should I then split the query into two like so:
INSERT INTO TEMP ... SELECT ... FROM ... WHERE C[1]...C[N]
SELECT ... FROM TEMP WHERE Q[1]...Q[N]
This is the job of the internal query optimizer to calculate the best order for compiling the joins according to filters.
For instance in:
SELECT *
FROM
table1
INNER JOIN table2 ON table1.id = table2.id AND table2.column = Y
INNER JOIN table3 ON table3.id2 = table2.id2 AND table3.column = Z
WHERE
table1.column = X
Mysql (/oracle/sqlserver etc...) would try to compute beforehand each intermediary resultset to provide the best performances, and actually here the engine is doing a pretty good job.
However, everything relies on the statistics it actually has about the tables and the indexes you've setup in your architecture. These 2 points (apart from filling up the tables with datas...) are the only ones you can influence to help the optimizer to make good decisions by giving it the right and accurate information.
I hope it helped.
ps: have a look at this. This is about operators and order of precedence in queries compilation under oracle yet it is probably a good thing to know anyway:
http://ezinearticles.com/?Oracle-SQL---The-Importance-of-Order-of-Precedence&id=1597846

Performans of nested queries

I want to ask a question about database queries. In case of query such like where clause of the query is coming from the another query. For example
select ? from ? where ? = select ? from ?
This is the simple example so it is easy to write this. But for the more complex case, i want to know what is the best way in case of performance. Join? seperate queries? nested or another?
Thank you for answers.
Best Regards.
You should test it. These things depend a lot on the details of the query and of the indices it can use.
In my experience JOINs tend to be faster than nested queries in MySQL. In some cases MySQL isn't very smart and appears to run the subquery for every row produced by the outer query.
You can read more about these things in the official documentation:
Optimizing subqueries: http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Rewriting subqueries as joins: http://dev.mysql.com/doc/refman/5.6/en/rewriting-subqueries.html
This is case dependent. In case you have a very less result in the inner query you should go for it. The flow works in the manner where in the inner query is executed first and the result set is being used in the outer query.
Meanwhile joins give you a Cartesian product which is again a heavy operation.
As Mitch and Joni stated, it depends. But generally a join will offer the best performance. You're trying to avoid running the nested query for each row of the outer query. A good query optimizer may do this for you anyway, by interpreting what you're trying to do and essentially "fixing" your mistake. But with the vast majority of queries, you should be writing it as a join in the first place. That way you're being explicit about what you're trying to do and you're fully understanding yourself what is being done, and what the most efficient way to do the work is.
I EXPECT the joins to be quicker, mainly because you have an equivalence and an explicit JOIN. Still use explain to see the differences in how the SQl engine will interpret them.
I would not expect these to be so different, where you can get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
Since almost everyone is saying that joins will give the optimal performance I just logged in to say the exact opposite experience I had.
So some days back I was writing a query for 3-4 tables which had huge amount of data. I wrote a big sql query with joins and it was taking around 2-3 hours to execute it. Then I restructured it, created a nested select query, put as many where constraints as I can inside the nested one & made it as stricter as possible and then the performance improved by >90%, it now takes less than 4 mins to run.
This is just my experience and may be theoretically joins are better. I just felt to share my experience. Its better to try out different things, getting additional knowledge about the tables, it's indexes etc would help a lot.
Update:
And I just found out what I did is actually suggested in this optimization reference page of MySQL. http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Pasting it here for quick reference:
Replace a join with a subquery. For example, try this:
SELECT DISTINCT column1 FROM t1 WHERE t1.column1 IN ( SELECT column1
FROM t2);
Instead of this:
SELECT DISTINCT t1.column1 FROM t1, t2 WHERE t1.column1 =
t2.column1;
Move clauses from outside to inside the subquery. For example, use
this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1 UNION ALL SELECT s1
FROM t2); Instead of this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1) OR s1 IN (SELECT s1
FROM t2); For another example, use this query:
SELECT (SELECT column1 + 5 FROM t1) FROM t2; Instead of this query:
SELECT (SELECT column1 FROM t1) + 5 FROM t2;