Why isn't MySQL able to consistently optimize queries in the format of WHERE <indexed_field> IN (<subquery>)?
I have a query as follows:
SELECT
*
FROM
t1
WHERE
t1.indexed_field IN (select val from ...)
AND (...)
The subquery select val from ... runs very quickly. The problem is MySQL is doing a full table scan to get the required rows from t1 -- even though t1.indexed_field is indexed.
I've gotten around this by changing the query to an inner join:
SELECT
*
FROM
t1
INNER JOIN
(select val from ...) vals ON (vals.val = t1.indexed_field)
WHERE
(...)
Explain shows that this works perfectly -- MySQL is now able use the indexed_field index when joining to the subquery table.
My question is: Why isn't MySQL able to optimize the first query? Intuitively, doing where <indexed_field> IN (<subquery>) seems like quite an easy optimization -- do the subquery, use the index to grab the rows.
No or Yes.
Old versions of MySQL did a very poor job of optimizing IN ( SELECT ... ). It seemed to re-evaluate the subquery repeatedly.
New versions are turning it into EXISTS ( SELECT 1 ... ) or perhaps a LEFT JOIN.
Please provide
Version
SHOW CREATE TABLE
EXPLAIN SELECT ...
Related
Problem Summary
Using MySql 5.6, I'm noticing that combined Select queries (e.g. select x.a from X x where x.b in (select y.b from Y y where y.c = 'something')) are way slower than doing two separate queries, using the results of the first query in the in clause of the second query. And my attempts at using Join statements instead of nested queries (influenced by other posts on this site) don't produce any performance improvements.
I know this is a common issue with MySql and I've read many postings here on SO about this issue and tried some of the solutions which, apparently, worked for other posters, but not for me.
This query:
select ADSH_ from SECSub where Symbol_='MSFT';
is fast and produces this result:
'0001193125-10-015598'
'0001193125-10-090116'
'0001193125-10-171791'
There are actually 21 results, but I've trimmed them for this posting to 3.
Here's some additional info:
show indexes from SECSub;
produces:
And
explain select * from SECSub where Symbol_='MSFT';
produces:
Querying a different table using the results of the first query, like this:
select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791);
Is similarly fast (.094 seconds). The actual query's in clause utilized the 21 results from the first query, but again I've trimmed them for this posting to 3.
And this:
show indexes from SECNum;
produces:
And
explain select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791');
produces:
But this combined query:
select *
from SECNum
where ADSH_ in (select ADSH_
from SECSub sub
where Symbol_='MSFT');
Is very slow, taking 151 seconds (compared to about 0.1 second for the previous query).
explain select * from SECNum where ADSH_ in (select ADSH_ from SECSub sub where Symbol_='MSFT');
produces:
So, after reading a few similar posts on SO I though I'd try to re-cast the combined query as a Join operation:
Join Attempt 1
select *
from SECNum num
inner join SECSub sub on num.ADSH_ = sub.ADSH_
where sub.Symbol_ = 'MSFT';
This result, which took 158 seconds, was even slower than using the combined query, which took 151 seconds.
explain select * from SECNum num inner join SECSub sub on num.ADSH_ = sub.ADSH_ where sub.Symbol_ = 'MSFT';
produced:
Join Attempt 2
select *
from (select sub.ADSH_
from SECSub sub
where sub.Symbol_='MSFT') SubSelect
join SECNum num on SubSelect.ADSH_ = num.ADSH_;
This result clocked in at 151 seconds, the same as my combined query..
explain select * from (select sub.ADSH_ from SECSub sub where sub.Symbol_='MSFT') SubSelect join SECNum num on SubSelect.ADSH_ = num.ADSH_;
produced:
So obviously, I don't know what I'm doing (yet). Any suggestions on how to write a query that produces the same results as my combined query, or any of these Join queries, that runs as fast as the case where I have two separate queries (which was around 0.1 seconds)?
Let me start with this query:
select *
from SECNum
where ADSH_ in (select ADSH_
from SECSub sub
where Symbol_ = 'MSFT');
The optimal index on this would be the composite index SECSub(Symbol_, ADSH_). I am guess that because this index is not available, MySQL seems to be making the wrong choice. It is doing a full table scan and checking for the where condition, rather than using the index to lookup the appropriate rows. A covering index (with the two columns) should put the MySQL optimizer on the right path.
Sometimes, in with a subquery is not optimized so well (although I thought this was fixed in 5.6). Also try the query with not exists:
select *
from SECNum sn
where not exists (select ADSH_
from SECSub sub
where sub.Symbol_ = 'MSFT' AND
sub.ADSH_ = sn.ADSH_
);
IN ( SELECT ... ) does not optimize well. In fact, until 5.6 it optimizes very poorly. 5.6 adds a technique that helps. But generally it is better to turn it into a JOIN, even with 5.6.
FROM ( SELECT ... ) a
JOIN ( SELECT ... ) b ON ...
Before 5.6, that performs very poorly because neither subquery has an index, hence lots of table scans of one of the tmp table. 5.6 (or is it 5.7?) 'discovers' the optimal index for subqueries, thereby helping significantly.
FROM tbl
JOIN ( SELECT ... ) x ON ...
will always (at least before 5.6) perform the subquery first, into a temporary table. Then it will do a NLJ (Nested Loop Join). So, it behooves you to have an index in tbl for whatever column(s) are in the ON clause. And make it a compound index if there are multiple columns.
Compound queries are often better than single-column queries. Keep in mind that MySQL almost never uses two indexes in a single SELECT. ("Index merge")
Whenever asking a performance question, please provide SHOW CREATE TABLE.
With these principles, you should be able to write better-performing queries without having to experiment so much.
First, I tried #Gordon Linoff's suggestion (or implied suggestion) to add a composite index on SECSub consisting of Symbol_ and ADSH_. That made no difference in the performance of any of the queries I tried.
While struggling with this performance issue I noticed that SECNum.ADSC_ was defined as character set latin1 while SECSub.ADSC_ was defined as character set utf8_general_ci.
I then suspected that when I created the second query by copy and pasting the output from the first query:
select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791');
That the literal strings in the in clause were using character set latin1, since they were typed (well, copied and pasted) all from within the MySQL Workbench and that might explain why this query is so fast.
After doing this:
alter table SECSub convert to character set latin1;
The combined query (the subquery) was fast (under 1 second) and for the first time, the explain showed that the query was using the index. The same was true for the variations using Join.
I suppose if I had included in my original question the actual table definitions, someone would have pointed out to me that there was an inconsistency in character sets assigned to table columns that participate in indexes and queries. Lesson learned. Next time I post, I'll include the table definitions (at least for those columns participating in indexes and queries that I'm asking about).
About query optimizations, I'm wondering if statements like one below get optimized:
select *
from (
select *
from table1 t1
join table2 t2 using (entity_id)
order by t2.sort_order, t1.name
) as foo -- main query of object
where foo.name = ?; -- inserted
Consider that the query is taken care by a dependency object but just (rightly?) allows one to tack in a WHERE condition. I'm thinking that at least not a lot of data gets pulled in to your favorite language, but I'm having second thoughts if that's an adequate optimization and maybe the database is still taking some time going through the query.
Or is it better to take that query out and write a separate query method that has the where and maybe a LIMIT 1 clause, too?
In MySQL, no.
The predicate in an outer query does not get "pushed" down into the inline view query.
The query in the inline view is processed first, independent of the outer query. (MySQL will optimize that view query just like it would optimize that query if you submitted that separately.)
The way that MySQL processes this query: the inline view query gets run first, the result is materialized as a 'derived table'. That is, the result set from that query gets stored as a temporary table, in memory in some cases (if it's small enough, and doesn't contain any columns that aren't supported by the MEMORY engine. Otherwise, it's spun out to disk with as a MyISAM table, using the MyISAM storage engine.
Once the derived table is populated, then the outer query runs.
(Note that the derived table does not have any indexes on it. That's true in MySQL versions before 5.6; I think there are some improvements in 5.6 where MySQL will actually create an index.
Clarification: indexes on derived tables: As of MySQL 5.6.3 "During query execution, the optimizer may add an index to a derived table to speed up row retrieval from it." Reference: http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html
Also, I don't think MySQL "optimizes out" any unneeded columns from the inline view. If the inline view query is a SELECT *, then all of the columns will be represented in the derived table, whether those are referenced in the outer query or not.
This can lead to some significant performance issues, especially when we don't understand how MySQL processes a statement. (And the way that MySQL processes a statement is significantly different from other relational databases, like Oracle and SQL Server.)
You may have heard a recommendation to "avoid using views in MySQL". The reasoning behind this general advice (which applies to both "stored" views and "inline" views) is the significant performance issues that can be unnecessarily introduced.
As an example, for this query:
SELECT q.name
FROM ( SELECT h.*
FROM huge_table h
) q
WHERE q.id = 42
MySQL does not "push" the predicate id=42 down into the view definition. MySQL first runs the inline view query, and essentially creates a copy of huge_table, as an un-indexed MyISAM table. Once that is done, then the outer query will scan the copy of the table, to locate the rows satisfying the predicate.
If we instead re-write the query to "push" the predicate into the view definition, like this:
SELECT q.name
FROM ( SELECT h.*
FROM huge_table h
WHERE h.id = 42
) q
We expect a much smaller resultset to be returned from the view query, and the derived table should be much smaller. MySQL will also be able to make effective use of an index ON huge_table (id). But there's still some overhead associated with materializing the derived table.
If we eliminate the unnecessary columns from the view definition, that can be more efficient (especially if there are a lot of columns, there are any large columns, or any columns with datatypes not supported by the MEMORY engine):
SELECT q.name
FROM ( SELECT h.name
FROM huge_table h
WHERE h.id = 42
) q
And it would be even more efficient to eliminate the inline view entirely:
SELECT q.name
FROM huge_table q
WHERE q.id
I can't speak for MySQL - not to mention the fact that it probably varies by storage engine and MySQL version, but for PostgreSQL:
PostgreSQL will flatten this into a single query. The inner ORDER BY isn't a problem, because adding or removing a predicate cannot affect the ordering of the remaining rows.
It'll get flattened to:
select *
from table1 t1
join table2 t2 using (entity_id)
where foo.name = ?
order by t2.sort_order, t1.name;
then the join predicate will get internally converted, producing a plan corresponding to the SQL:
select t1.col1, t1.col2, ..., t2.col1, t2.col2, ...
from table1 t1, table2 t2
where
t1.entity_id = t2.entity_id
and foo.name = ?
order by t2.sort_order, t1.name;
Example with a simplified schema:
regress=> CREATE TABLE demo1 (id integer primary key, whatever integer not null);
CREATE TABLE
regress=> INSERT INTO demo1 (id, whatever) SELECT x, x FROM generate_series(1,100) x;
INSERT 0 100
regress=> EXPLAIN SELECT *
FROM (
SELECT *
FROM demo1
ORDER BY id
) derived
WHERE whatever % 10 = 0;
QUERY PLAN
-----------------------------------------------------------
Sort (cost=2.51..2.51 rows=1 width=8)
Sort Key: demo1.id
-> Seq Scan on demo1 (cost=0.00..2.50 rows=1 width=8)
Filter: ((whatever % 10) = 0)
Planning time: 0.173 ms
(5 rows)
... which is the same plan as:
EXPLAIN SELECT *
FROM demo1
WHERE whatever % 10 = 0
ORDER BY id;
QUERY PLAN
-----------------------------------------------------------
Sort (cost=2.51..2.51 rows=1 width=8)
Sort Key: id
-> Seq Scan on demo1 (cost=0.00..2.50 rows=1 width=8)
Filter: ((whatever % 10) = 0)
Planning time: 0.159 ms
(5 rows)
If there was a LIMIT, OFFSET, a window function, or certain other things that prevent qualifier push-down/pull-up/flattening in the inner query then PostgreSQL would recognise that it can't safely flatten it. It'd evaluate the inner query either by materializing it or by iterating over its output and feeding that to the outer query.
The same applies for a view. PostgreSQL will in-line and flatten views into the containing query where it is safe to do so.
Hi i have this query but its giving me an error of Operand should contain 1 column(s) not sure why?
Select *,
(Select *
FROM InstrumentModel
WHERE InstrumentModel.InstrumentModelID=Instrument.InstrumentModelID)
FROM Instrument
according to your query you wanted to get data from instrument and instrumentModel table and in your case its expecting "from table name " after your select * .when the subselect query runs to get its result its not finding table instrument.InstrumentModelId inorder to fetch result from both the table by matching you can use join .or you can also select perticuler fields by tableName.fieldName and in where condition use your condition.
like :
select Instrument.x,InstrumentModel.y
from instrument,instrumentModel
where instrument.x=instrumentModel.y
You can use a join to select from 2 connected tables
select *
from Instrument i
join InstrumentModel m on m.InstrumentModelID = i.InstrumentModelID
When you use subqueries in the column list, they need to return exactly one value. You can read more in the documentation
as a user commented in the documentation, using subqueries like this can ruin your performance:
when the same subquery is used several times, mysql does not use this fact to optimize the query, so be careful not to run into performance problems.
example:
SELECT
col0,
(SELECT col1 FROM table1 WHERE table1.id = table0.id),
(SELECT col2 FROM table1 WHERE table1.id = table0.id)
FROM
table0
WHERE ...
the join of table0 with table1 is executed once for EACH subquery, leading to very bad performance for this kind of query.
Therefore you should rather join the tables, as described by the other answer.
Running the following works great:
SELECT email FROM User WHERE empNum IN (126,513,74)
However, this takes a very long to reply (no errors) using:
SELECT email FROM table1 WHERE empNum IN (
SELECT empNum FROM table2 WHERE accomp = 'onhold' GROUP BY empNum
)
What is causing this?
How about that one?
SELECT DISTINCT table1.email
FROM table1
INNER JOIN table2 USING(empNum)
WHERE table2.accomp = 'onhold'
You should probably make an index on table2.accomp if you use that query often enough:
CREATE INDEX accomp ON table2 (accomp);
or maybe
CREATE INDEX accomp ON table2 (empNum,accomp);
To perform some crude (but deciding) benchmarks:
log in mysql console
clear the query cache(*):
RESET QUERY CACHE;
run the slow query and write down the timing
create an index
clear the query cache
run the slow query and write down the timing
drop the index
create the other index
clear the cache
run the slow query one more time
compare the timings and keep the best index (by droping the current one and creating the correct one if necessary)
(*) You will need the relevant privileges to run that command
I think the join statement you need is:
SELECT email FROM table1
INNER JOIN table2
ON table1.empNum=table2.empNum
AND table2.accomp = 'onhold'
I am running a complicated and costly query to find the MIN() values of a function grouped by another attribute. But I don't just need the value, I need the entry that produces it + the value.
My current pseudoquery goes something like this:
SELECT MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) FROM (prefiltering) as a GROUP BY a.group_att;
but I want a.* and MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) as my result.
The only way I can think of is using this ugly beast:
SELECT a1.*, COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2)
FROM (prefiltering) as a1
WHERE COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2) =
(SELECT MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) FROM (prefiltering) as a GROUP BY a.group_att)
But now I am executing the prefiltering_query 2 times and have to run the costly function twice. This is ridiculous and I hope that I am doing something seriously wrong here.
Possible solution?:
Just now I realize that I could create a temporary table containing:
(SELECT a1.*, COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2) as complex FROM (prefiltering) as a1)
and then run the MIN() as subquery and compare it at greatly reduced cost. Is that the way to go?
A problem with your temporary table solution is that I can't see any way to avoid using it twice in the same query.
However, if you're willing to use an actual permanent table (perhaps with ENGINE = MEMORY), it should work.
You can also move the subquery into the FROM clause, where it might be more efficient:
CREATE TABLE temptable ENGINE = MEMORY
SELECT a1.*,
COSTLY_FUNCTION(a1.att1,a1.att2,$v1,$v2) AS complex
FROM prefiltering AS a1;
CREATE INDEX group_att_complex USING BTREE
ON temptable (group_att, complex);
SELECT a2.*
FROM temptable AS a2
NATURAL JOIN (
SELECT group_att, MIN(complex) AS complex
FROM temptable GROUP BY group_att
) AS a3;
DROP TABLE temptable;
(You can try it without the index too, but I suspect it'll be faster with it.)
Edit: Of course, if one temporary table won't do, you could always use two:
CREATE TEMPORARY TABLE temp1
SELECT *, COSTLY_FUNCTION(att1,att2,$v1,$v2) AS complex
FROM prefiltering;
CREATE INDEX group_att_complex ON temp1 (group_att, complex);
CREATE TEMPORARY TABLE temp2
SELECT group_att, MIN(complex) AS complex
FROM temp1 GROUP BY group_att;
SELECT temp1.* FROM temp1 NATURAL JOIN temp2;
(Again, you may want to try it with or without the index; when I ran EXPLAIN on it, MySQL didn't seem to want to use the index for the final query at all, although that might be just because my test data set was so small. Anyway, here's a link to SQLize if you want to play with it; I used CONCAT() to stand in for your expensive function.)
You can use the HAVING clause to get columns in addition to that MIN value. For example:
SELECT a.*, COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2) FROM (prefiltering) as a GROUP BY a.group_att HAVING MIN(COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2)) = COSTLY_FUNCTION(a.att1,a.att2,$v1,$v2);