Simplified example 1:
SELECT * FROM (
SELECT x, y, z FROM table1
WHERE x = 'test'
-- union, etc, etc, complicated stuff...
) AS t
-- union, etc, etc, complicated stuff...
Simplified example 2:
SELECT * FROM (
SELECT x, y, z FROM table1
-- union, etc, etc, complicated stuff...
) AS t
-- union, etc, etc, complicated stuff...
WHERE x = 'test'
Which of the above is more popular? Is more performant? Is recommended for other reasons? Does it help to filter the results "early", before doing union and similar operations? Thanks.
In MySQL you definitely want the filtering condition in the subquery. MySQL materializes subqueries. The smaller the subquery the faster the query.
In addition, MySQL may be able to use an index for the condition.
Related
Why isn't MySQL able to consistently optimize queries in the format of WHERE <indexed_field> IN (<subquery>)?
I have a query as follows:
SELECT
*
FROM
t1
WHERE
t1.indexed_field IN (select val from ...)
AND (...)
The subquery select val from ... runs very quickly. The problem is MySQL is doing a full table scan to get the required rows from t1 -- even though t1.indexed_field is indexed.
I've gotten around this by changing the query to an inner join:
SELECT
*
FROM
t1
INNER JOIN
(select val from ...) vals ON (vals.val = t1.indexed_field)
WHERE
(...)
Explain shows that this works perfectly -- MySQL is now able use the indexed_field index when joining to the subquery table.
My question is: Why isn't MySQL able to optimize the first query? Intuitively, doing where <indexed_field> IN (<subquery>) seems like quite an easy optimization -- do the subquery, use the index to grab the rows.
No or Yes.
Old versions of MySQL did a very poor job of optimizing IN ( SELECT ... ). It seemed to re-evaluate the subquery repeatedly.
New versions are turning it into EXISTS ( SELECT 1 ... ) or perhaps a LEFT JOIN.
Please provide
Version
SHOW CREATE TABLE
EXPLAIN SELECT ...
I found this sample interview question and answer posted on toptal reproduced here. But I don't really understand the code. How can a UNION ALL turn into a UNIION (distinct) like that? Also, why is this code faster?
QUESTION
Write a SQL query using UNION ALL (not UNION) that uses the WHERE clause to eliminate duplicates. Why might you want to do this?
Hide answer
You can avoid duplicates using UNION ALL and still run much faster than UNION DISTINCT (which is actually same as UNION) by running a query like this:
ANSWER
SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X
The key is the AND a!=X part. This gives you the benefits of the UNION (a.k.a., UNION DISTINCT) command, while avoiding much of its performance hit.
But in the example, the first query has a condition on column a, whereas the second query has a condition on column b. This probably came from a query that's hard to optimize:
SELECT * FROM mytable WHERE a=X OR b=Y
This query is hard to optimize with simple B-tree indexing. Does the engine search an index on column a? Or on column b? Either way, searching the other term requires a table-scan.
Hence the trick of using UNION to separate into two queries for one term each. Each subquery can use the best index for each search term. Then combine the results using UNION.
But the two subsets may overlap, because some rows where b=Y may also have a=X in which case such rows occur in both subsets. Therefore you have to do duplicate elimination, or else see some rows twice in the final result.
SELECT * FROM mytable WHERE a=X
UNION DISTINCT
SELECT * FROM mytable WHERE b=Y
UNION DISTINCT is expensive because typical implementations sort the rows to find duplicates. Just like if you use SELECT DISTINCT ....
We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets. It's a lot of rows to eliminate.
But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct. That is, if you guarantee there is no overlap. If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.
If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.
SELECT * FROM mytable WHERE a=X
UNION ALL
SELECT * FROM mytable WHERE b=Y AND a!=X
These two sets are guaranteed to have no overlap. If the first set has rows where a=X and the second set has rows where a!=X then there can be no row that is in both sets.
The second query therefore only catches some of the rows where b=Y, but any row where a=X AND b=Y is already included in the first set.
So the query achieves an optimized search for two OR terms, without producing duplicates, and requiring no UNION DISTINCT operation.
The most simple way is like this, especially if you have many columns:
SELECT *
INTO table2
FROM table1
UNION
SELECT *
FROM table1
ORDER BY column1
I guest this is right (Oracle):
select distinct * from (
select * from test_a
union all
select * from test_b
);
The question will be correct if the table has unique identifier - primary key. Otherwise every select can return many the same rows.
To understand why it can faster let's look at how database executes UNION ALL and UNION.
The first is simple joining results from two independent queries. These queries can be processed in parallel and taken to client one by one.
The second is joining + distinction. To distinct records from 2 queries db needs to have all them in memory or if memory is not enough db needs to store them to temporary table and next select unique ones. This is where performance degradation can be. DB's are pretty smart and distinction algorithms are developed good but for large result sets it could be a problem anyway.
UNION ALL + additional WHERE condition can be faster if an index will be used while filtering.
So, here the performance magic.
I guess it will work
select col1 From (
select row_number() over (partition by col1 order by col1) as b, col1
from (
select col1 From u1
union all
select col1 From u2 ) a
) x
where x.b =1
This will also do the same trick:
select * from (
select * from table1
union all
select * from table2
) a group by
columns
having count(*) >= 1
or
select * from table1
union all
select * from table2 b
where not exists (select 1 from table1 a where a.col1 = b.col1)
Problem Summary
Using MySql 5.6, I'm noticing that combined Select queries (e.g. select x.a from X x where x.b in (select y.b from Y y where y.c = 'something')) are way slower than doing two separate queries, using the results of the first query in the in clause of the second query. And my attempts at using Join statements instead of nested queries (influenced by other posts on this site) don't produce any performance improvements.
I know this is a common issue with MySql and I've read many postings here on SO about this issue and tried some of the solutions which, apparently, worked for other posters, but not for me.
This query:
select ADSH_ from SECSub where Symbol_='MSFT';
is fast and produces this result:
'0001193125-10-015598'
'0001193125-10-090116'
'0001193125-10-171791'
There are actually 21 results, but I've trimmed them for this posting to 3.
Here's some additional info:
show indexes from SECSub;
produces:
And
explain select * from SECSub where Symbol_='MSFT';
produces:
Querying a different table using the results of the first query, like this:
select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791);
Is similarly fast (.094 seconds). The actual query's in clause utilized the 21 results from the first query, but again I've trimmed them for this posting to 3.
And this:
show indexes from SECNum;
produces:
And
explain select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791');
produces:
But this combined query:
select *
from SECNum
where ADSH_ in (select ADSH_
from SECSub sub
where Symbol_='MSFT');
Is very slow, taking 151 seconds (compared to about 0.1 second for the previous query).
explain select * from SECNum where ADSH_ in (select ADSH_ from SECSub sub where Symbol_='MSFT');
produces:
So, after reading a few similar posts on SO I though I'd try to re-cast the combined query as a Join operation:
Join Attempt 1
select *
from SECNum num
inner join SECSub sub on num.ADSH_ = sub.ADSH_
where sub.Symbol_ = 'MSFT';
This result, which took 158 seconds, was even slower than using the combined query, which took 151 seconds.
explain select * from SECNum num inner join SECSub sub on num.ADSH_ = sub.ADSH_ where sub.Symbol_ = 'MSFT';
produced:
Join Attempt 2
select *
from (select sub.ADSH_
from SECSub sub
where sub.Symbol_='MSFT') SubSelect
join SECNum num on SubSelect.ADSH_ = num.ADSH_;
This result clocked in at 151 seconds, the same as my combined query..
explain select * from (select sub.ADSH_ from SECSub sub where sub.Symbol_='MSFT') SubSelect join SECNum num on SubSelect.ADSH_ = num.ADSH_;
produced:
So obviously, I don't know what I'm doing (yet). Any suggestions on how to write a query that produces the same results as my combined query, or any of these Join queries, that runs as fast as the case where I have two separate queries (which was around 0.1 seconds)?
Let me start with this query:
select *
from SECNum
where ADSH_ in (select ADSH_
from SECSub sub
where Symbol_ = 'MSFT');
The optimal index on this would be the composite index SECSub(Symbol_, ADSH_). I am guess that because this index is not available, MySQL seems to be making the wrong choice. It is doing a full table scan and checking for the where condition, rather than using the index to lookup the appropriate rows. A covering index (with the two columns) should put the MySQL optimizer on the right path.
Sometimes, in with a subquery is not optimized so well (although I thought this was fixed in 5.6). Also try the query with not exists:
select *
from SECNum sn
where not exists (select ADSH_
from SECSub sub
where sub.Symbol_ = 'MSFT' AND
sub.ADSH_ = sn.ADSH_
);
IN ( SELECT ... ) does not optimize well. In fact, until 5.6 it optimizes very poorly. 5.6 adds a technique that helps. But generally it is better to turn it into a JOIN, even with 5.6.
FROM ( SELECT ... ) a
JOIN ( SELECT ... ) b ON ...
Before 5.6, that performs very poorly because neither subquery has an index, hence lots of table scans of one of the tmp table. 5.6 (or is it 5.7?) 'discovers' the optimal index for subqueries, thereby helping significantly.
FROM tbl
JOIN ( SELECT ... ) x ON ...
will always (at least before 5.6) perform the subquery first, into a temporary table. Then it will do a NLJ (Nested Loop Join). So, it behooves you to have an index in tbl for whatever column(s) are in the ON clause. And make it a compound index if there are multiple columns.
Compound queries are often better than single-column queries. Keep in mind that MySQL almost never uses two indexes in a single SELECT. ("Index merge")
Whenever asking a performance question, please provide SHOW CREATE TABLE.
With these principles, you should be able to write better-performing queries without having to experiment so much.
First, I tried #Gordon Linoff's suggestion (or implied suggestion) to add a composite index on SECSub consisting of Symbol_ and ADSH_. That made no difference in the performance of any of the queries I tried.
While struggling with this performance issue I noticed that SECNum.ADSC_ was defined as character set latin1 while SECSub.ADSC_ was defined as character set utf8_general_ci.
I then suspected that when I created the second query by copy and pasting the output from the first query:
select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791');
That the literal strings in the in clause were using character set latin1, since they were typed (well, copied and pasted) all from within the MySQL Workbench and that might explain why this query is so fast.
After doing this:
alter table SECSub convert to character set latin1;
The combined query (the subquery) was fast (under 1 second) and for the first time, the explain showed that the query was using the index. The same was true for the variations using Join.
I suppose if I had included in my original question the actual table definitions, someone would have pointed out to me that there was an inconsistency in character sets assigned to table columns that participate in indexes and queries. Lesson learned. Next time I post, I'll include the table definitions (at least for those columns participating in indexes and queries that I'm asking about).
I am running a complicated and costly query to find the MIN() values of a function grouped by another attribute. But I don't just need the value, I need the entry that produces it + the value.
My current pseudoquery goes something like this:
SELECT MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) FROM (prefiltering) as a GROUP BY a.group_att;
but I want a.* and MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) as my result.
The only way I can think of is using this ugly beast:
SELECT a1.*, COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2)
FROM (prefiltering) as a1
WHERE COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2) =
(SELECT MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) FROM (prefiltering) as a GROUP BY a.group_att)
But now I am executing the prefiltering_query 2 times and have to run the costly function twice. This is ridiculous and I hope that I am doing something seriously wrong here.
Possible solution?:
Just now I realize that I could create a temporary table containing:
(SELECT a1.*, COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2) as complex FROM (prefiltering) as a1)
and then run the MIN() as subquery and compare it at greatly reduced cost. Is that the way to go?
A problem with your temporary table solution is that I can't see any way to avoid using it twice in the same query.
However, if you're willing to use an actual permanent table (perhaps with ENGINE = MEMORY), it should work.
You can also move the subquery into the FROM clause, where it might be more efficient:
CREATE TABLE temptable ENGINE = MEMORY
SELECT a1.*,
COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2) AS complex
FROM prefiltering AS a1;
CREATE INDEX group_att_complex USING BTREE
ON temptable (group_att, complex);
SELECT a2.*
FROM temptable AS a2
NATURAL JOIN (
SELECT group_att, MIN(complex) AS complex
FROM temptable GROUP BY group_att
) AS a3;
DROP TABLE temptable;
(You can try it without the index too, but I suspect it'll be faster with it.)
Edit: Of course, if one temporary table won't do, you could always use two:
CREATE TEMPORARY TABLE temp1
SELECT *, COSTLY_FUNCTION(att1,att2,$v1,$v2) AS complex
FROM prefiltering;
CREATE INDEX group_att_complex ON temp1 (group_att, complex);
CREATE TEMPORARY TABLE temp2
SELECT group_att, MIN(complex) AS complex
FROM temp1 GROUP BY group_att;
SELECT temp1.* FROM temp1 NATURAL JOIN temp2;
(Again, you may want to try it with or without the index; when I ran EXPLAIN on it, MySQL didn't seem to want to use the index for the final query at all, although that might be just because my test data set was so small. Anyway, here's a link to SQLize if you want to play with it; I used CONCAT() to stand in for your expensive function.)
You can use the HAVING clause to get columns in addition to that MIN value. For example:
SELECT a.*, COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2) FROM (prefiltering) as a GROUP BY a.group_att HAVING MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) = COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2);
I have a query like:
select * from (select ... ) t1 join (select ... ) t2 on t1._ = t2._
where the join subselects are identical. Is there an easy way to name this select so that I can use it both times? I tried this:
select * from (select ... ) t1 join t1 t2 on t1._ = t2._
but it gave an error. Any ideas?
If the cost of acquiring the rows in your subselect is significant, you may consider storing the intermediate result in a temporary table and then reference that twice in your select.
But you better measure this, because it also costs to store the intermediate result...
Can you share your query? Maybe you don't need to reference it twice after all?
CREATE VIEW MyCommonSelect (Col1, Col2. . .) AS
SELECT Col1, Col2. . .
Depending on exactly what your query looks like, you may be able to name the subqueries internally, but something like this tends to indicate that the subquery represents database logic that (in my opinion — others disagree) deserves its own name.