About query optimizations, I'm wondering if statements like one below get optimized:
select *
from (
select *
from table1 t1
join table2 t2 using (entity_id)
order by t2.sort_order, t1.name
) as foo -- main query of object
where foo.name = ?; -- inserted
Consider that the query is taken care by a dependency object but just (rightly?) allows one to tack in a WHERE condition. I'm thinking that at least not a lot of data gets pulled in to your favorite language, but I'm having second thoughts if that's an adequate optimization and maybe the database is still taking some time going through the query.
Or is it better to take that query out and write a separate query method that has the where and maybe a LIMIT 1 clause, too?
In MySQL, no.
The predicate in an outer query does not get "pushed" down into the inline view query.
The query in the inline view is processed first, independent of the outer query. (MySQL will optimize that view query just like it would optimize that query if you submitted that separately.)
The way that MySQL processes this query: the inline view query gets run first, the result is materialized as a 'derived table'. That is, the result set from that query gets stored as a temporary table, in memory in some cases (if it's small enough, and doesn't contain any columns that aren't supported by the MEMORY engine. Otherwise, it's spun out to disk with as a MyISAM table, using the MyISAM storage engine.
Once the derived table is populated, then the outer query runs.
(Note that the derived table does not have any indexes on it. That's true in MySQL versions before 5.6; I think there are some improvements in 5.6 where MySQL will actually create an index.
Clarification: indexes on derived tables: As of MySQL 5.6.3 "During query execution, the optimizer may add an index to a derived table to speed up row retrieval from it." Reference: http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html
Also, I don't think MySQL "optimizes out" any unneeded columns from the inline view. If the inline view query is a SELECT *, then all of the columns will be represented in the derived table, whether those are referenced in the outer query or not.
This can lead to some significant performance issues, especially when we don't understand how MySQL processes a statement. (And the way that MySQL processes a statement is significantly different from other relational databases, like Oracle and SQL Server.)
You may have heard a recommendation to "avoid using views in MySQL". The reasoning behind this general advice (which applies to both "stored" views and "inline" views) is the significant performance issues that can be unnecessarily introduced.
As an example, for this query:
SELECT q.name
FROM ( SELECT h.*
FROM huge_table h
) q
WHERE q.id = 42
MySQL does not "push" the predicate id=42 down into the view definition. MySQL first runs the inline view query, and essentially creates a copy of huge_table, as an un-indexed MyISAM table. Once that is done, then the outer query will scan the copy of the table, to locate the rows satisfying the predicate.
If we instead re-write the query to "push" the predicate into the view definition, like this:
SELECT q.name
FROM ( SELECT h.*
FROM huge_table h
WHERE h.id = 42
) q
We expect a much smaller resultset to be returned from the view query, and the derived table should be much smaller. MySQL will also be able to make effective use of an index ON huge_table (id). But there's still some overhead associated with materializing the derived table.
If we eliminate the unnecessary columns from the view definition, that can be more efficient (especially if there are a lot of columns, there are any large columns, or any columns with datatypes not supported by the MEMORY engine):
SELECT q.name
FROM ( SELECT h.name
FROM huge_table h
WHERE h.id = 42
) q
And it would be even more efficient to eliminate the inline view entirely:
SELECT q.name
FROM huge_table q
WHERE q.id
I can't speak for MySQL - not to mention the fact that it probably varies by storage engine and MySQL version, but for PostgreSQL:
PostgreSQL will flatten this into a single query. The inner ORDER BY isn't a problem, because adding or removing a predicate cannot affect the ordering of the remaining rows.
It'll get flattened to:
select *
from table1 t1
join table2 t2 using (entity_id)
where foo.name = ?
order by t2.sort_order, t1.name;
then the join predicate will get internally converted, producing a plan corresponding to the SQL:
select t1.col1, t1.col2, ..., t2.col1, t2.col2, ...
from table1 t1, table2 t2
where
t1.entity_id = t2.entity_id
and foo.name = ?
order by t2.sort_order, t1.name;
Example with a simplified schema:
regress=> CREATE TABLE demo1 (id integer primary key, whatever integer not null);
CREATE TABLE
regress=> INSERT INTO demo1 (id, whatever) SELECT x, x FROM generate_series(1,100) x;
INSERT 0 100
regress=> EXPLAIN SELECT *
FROM (
SELECT *
FROM demo1
ORDER BY id
) derived
WHERE whatever % 10 = 0;
QUERY PLAN
-----------------------------------------------------------
Sort (cost=2.51..2.51 rows=1 width=8)
Sort Key: demo1.id
-> Seq Scan on demo1 (cost=0.00..2.50 rows=1 width=8)
Filter: ((whatever % 10) = 0)
Planning time: 0.173 ms
(5 rows)
... which is the same plan as:
EXPLAIN SELECT *
FROM demo1
WHERE whatever % 10 = 0
ORDER BY id;
QUERY PLAN
-----------------------------------------------------------
Sort (cost=2.51..2.51 rows=1 width=8)
Sort Key: id
-> Seq Scan on demo1 (cost=0.00..2.50 rows=1 width=8)
Filter: ((whatever % 10) = 0)
Planning time: 0.159 ms
(5 rows)
If there was a LIMIT, OFFSET, a window function, or certain other things that prevent qualifier push-down/pull-up/flattening in the inner query then PostgreSQL would recognise that it can't safely flatten it. It'd evaluate the inner query either by materializing it or by iterating over its output and feeding that to the outer query.
The same applies for a view. PostgreSQL will in-line and flatten views into the containing query where it is safe to do so.
Related
I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.
Problem Summary
Using MySql 5.6, I'm noticing that combined Select queries (e.g. select x.a from X x where x.b in (select y.b from Y y where y.c = 'something')) are way slower than doing two separate queries, using the results of the first query in the in clause of the second query. And my attempts at using Join statements instead of nested queries (influenced by other posts on this site) don't produce any performance improvements.
I know this is a common issue with MySql and I've read many postings here on SO about this issue and tried some of the solutions which, apparently, worked for other posters, but not for me.
This query:
select ADSH_ from SECSub where Symbol_='MSFT';
is fast and produces this result:
'0001193125-10-015598'
'0001193125-10-090116'
'0001193125-10-171791'
There are actually 21 results, but I've trimmed them for this posting to 3.
Here's some additional info:
show indexes from SECSub;
produces:
And
explain select * from SECSub where Symbol_='MSFT';
produces:
Querying a different table using the results of the first query, like this:
select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791);
Is similarly fast (.094 seconds). The actual query's in clause utilized the 21 results from the first query, but again I've trimmed them for this posting to 3.
And this:
show indexes from SECNum;
produces:
And
explain select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791');
produces:
But this combined query:
select *
from SECNum
where ADSH_ in (select ADSH_
from SECSub sub
where Symbol_='MSFT');
Is very slow, taking 151 seconds (compared to about 0.1 second for the previous query).
explain select * from SECNum where ADSH_ in (select ADSH_ from SECSub sub where Symbol_='MSFT');
produces:
So, after reading a few similar posts on SO I though I'd try to re-cast the combined query as a Join operation:
Join Attempt 1
select *
from SECNum num
inner join SECSub sub on num.ADSH_ = sub.ADSH_
where sub.Symbol_ = 'MSFT';
This result, which took 158 seconds, was even slower than using the combined query, which took 151 seconds.
explain select * from SECNum num inner join SECSub sub on num.ADSH_ = sub.ADSH_ where sub.Symbol_ = 'MSFT';
produced:
Join Attempt 2
select *
from (select sub.ADSH_
from SECSub sub
where sub.Symbol_='MSFT') SubSelect
join SECNum num on SubSelect.ADSH_ = num.ADSH_;
This result clocked in at 151 seconds, the same as my combined query..
explain select * from (select sub.ADSH_ from SECSub sub where sub.Symbol_='MSFT') SubSelect join SECNum num on SubSelect.ADSH_ = num.ADSH_;
produced:
So obviously, I don't know what I'm doing (yet). Any suggestions on how to write a query that produces the same results as my combined query, or any of these Join queries, that runs as fast as the case where I have two separate queries (which was around 0.1 seconds)?
Let me start with this query:
select *
from SECNum
where ADSH_ in (select ADSH_
from SECSub sub
where Symbol_ = 'MSFT');
The optimal index on this would be the composite index SECSub(Symbol_, ADSH_). I am guess that because this index is not available, MySQL seems to be making the wrong choice. It is doing a full table scan and checking for the where condition, rather than using the index to lookup the appropriate rows. A covering index (with the two columns) should put the MySQL optimizer on the right path.
Sometimes, in with a subquery is not optimized so well (although I thought this was fixed in 5.6). Also try the query with not exists:
select *
from SECNum sn
where not exists (select ADSH_
from SECSub sub
where sub.Symbol_ = 'MSFT' AND
sub.ADSH_ = sn.ADSH_
);
IN ( SELECT ... ) does not optimize well. In fact, until 5.6 it optimizes very poorly. 5.6 adds a technique that helps. But generally it is better to turn it into a JOIN, even with 5.6.
FROM ( SELECT ... ) a
JOIN ( SELECT ... ) b ON ...
Before 5.6, that performs very poorly because neither subquery has an index, hence lots of table scans of one of the tmp table. 5.6 (or is it 5.7?) 'discovers' the optimal index for subqueries, thereby helping significantly.
FROM tbl
JOIN ( SELECT ... ) x ON ...
will always (at least before 5.6) perform the subquery first, into a temporary table. Then it will do a NLJ (Nested Loop Join). So, it behooves you to have an index in tbl for whatever column(s) are in the ON clause. And make it a compound index if there are multiple columns.
Compound queries are often better than single-column queries. Keep in mind that MySQL almost never uses two indexes in a single SELECT. ("Index merge")
Whenever asking a performance question, please provide SHOW CREATE TABLE.
With these principles, you should be able to write better-performing queries without having to experiment so much.
First, I tried #Gordon Linoff's suggestion (or implied suggestion) to add a composite index on SECSub consisting of Symbol_ and ADSH_. That made no difference in the performance of any of the queries I tried.
While struggling with this performance issue I noticed that SECNum.ADSC_ was defined as character set latin1 while SECSub.ADSC_ was defined as character set utf8_general_ci.
I then suspected that when I created the second query by copy and pasting the output from the first query:
select * from SECNum where ADSH_ in (
'0001193125-10-015598',
'0001193125-10-090116',
'0001193125-10-171791');
That the literal strings in the in clause were using character set latin1, since they were typed (well, copied and pasted) all from within the MySQL Workbench and that might explain why this query is so fast.
After doing this:
alter table SECSub convert to character set latin1;
The combined query (the subquery) was fast (under 1 second) and for the first time, the explain showed that the query was using the index. The same was true for the variations using Join.
I suppose if I had included in my original question the actual table definitions, someone would have pointed out to me that there was an inconsistency in character sets assigned to table columns that participate in indexes and queries. Lesson learned. Next time I post, I'll include the table definitions (at least for those columns participating in indexes and queries that I'm asking about).
I have a table and a query that looks like below. For a working example, see this SQL Fiddle.
SELECT o.property_B, SUM(o.score1), w.score
FROM o
INNER JOIN
(
SELECT o.property_B, SUM(o.score2) AS score FROM o GROUP BY property_B
) w ON w.property_B = o.property_B
WHERE o.property_A = 'specific_A'
GROUP BY property_B;
With my real data, this query takes 27 seconds. However, if I first create w as a temporary Table and index property_B, it all together takes ~1 second.
CREATE TEMPORARY TABLE w AS
SELECT o.property_B, SUM(o.score2) AS score FROM o GROUP BY property_B;
ALTER TABLE w ADD INDEX `property_B_idx` (property_B);
SELECT o.property_B, SUM(o.score1), w.score
FROM o
INNER JOIN w ON w.property_B = o.property_B
WHERE o.property_A = 'specific_A'
GROUP BY property_B;
DROP TABLE IF EXISTS w;
Is there a way to combine the best of these two queries? I.e. a single query with the speed advantages of the indexing in the subquery?
EDIT
After Mehran's answer below, I read this piece of explanation in the MySQL documentation:
As of MySQL 5.6.3, the optimizer more efficiently handles subqueries in the FROM clause (that is, derived tables):
...
For cases when materialization is required for a subquery in the FROM clause, the optimizer may speed up access to the result by adding an index to the materialized table. If such an index would permit ref access to the table, it can greatly reduce amount of data that must be read during query execution. Consider the following query:
SELECT * FROM t1
JOIN (SELECT * FROM t2) AS derived_t2 ON t1.f1=derived_t2.f1;
The optimizer constructs an index over column f1 from derived_t2 if doing so would permit the use of ref access for the lowest cost execution plan. After adding the index, the optimizer can treat the materialized derived table the same as a usual table with an index, and it benefits similarly from the generated index. The overhead of index creation is negligible compared to the cost of query execution without the index. If ref access would result in higher cost than some other access method, no index is created and the optimizer loses nothing.
First of all you need to know that creating a temporary table is absolutely a feasible solution. But in cases no other choice is applicable which is not true here!
In your case, you can easily boost your query as FrankPl pointed out because your sub-query and main-query are both grouping by the same field. So you don't need any sub-queries. I'm going to copy and paste FrankPl's solution for the sake of completeness:
SELECT o.property_B, SUM(o.score1), SUM(o.score2)
FROM o
GROUP BY property_B;
Yet it doesn't mean it's impossible to come across a scenario in which you wish you could index a sub-query. In which cases you've got two choices, first is using a temporary table as you pointed out yourself, holding the results of the sub-query. This solution is advantageous since it is supported by MySQL for a long time. It's just not feasible if there's a huge amount of data involved.
The second solution is using MySQL version 5.6 or above. In recent versions of MySQL new algorithms are incorporated so an index defined on a table used within a sub-query can also be used outside of the sub-query.
[UPDATE]
For the edited version of the question I would recommend the following solution:
SELECT o.property_B, SUM(IF(o.property_A = 'specific_A', o.score1, 0)), SUM(o.score2)
FROM o
GROUP BY property_B
HAVING SUM(IF(o.property_A = 'specific_A', o.score1, 0)) > 0;
But you need to work on the HAVING part. You might need to change it according to your actual problem.
I am not really that familiar with MySql, I mostly worked with Oracle.
If you want a where-clause in the SUM, you can use decode or case.
it would look something like that
SELECT o.property_B, , SUM(decode(property_A, 'specific_A', o.score1, 0), SUM(o.score2)
FROM o
GROUP BY property_B;
or with case
SELECT o.property_B, , SUM(CASE
WHEN property_A = 'specific_A' THEN o.score1
ELSE 0
END ),
SUM(o.score2)
FROM o
GROUP BY property_B;
I do not see why you would need the join at all. I would assume that
SELECT o.property_B, SUM(o.score1), SUM(o.score2)
FROM o
GROUP BY property_B;
should give what you want, but with a much simpler and hence better to optimize statement.
It should be the duty of MySQL to optimise your query, and I don't think there is a way to create an index on the fly. However, you can try to force the use of the index of property_o (if you have it). See http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Also, you can merge the create and alter statements, if you prefer.
[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?
This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.
EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.
You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Let's say I have a query of the form:
SELECT a, b, c, d
FROM table1
WHERE a IN (
SELECT x
FROM table2
WHERE some_condition);
Now the query for the IN can return a huge number of records.
Assuming that a is the primary key, so an index is used is this the best way to write such a query?
Or it is more optimal to loop over each of the records returned by the subquery?
For me it is clear that when I do a where a = X it is clear that I just do an index (tree) traversal.
But I am not sure how an IN (especially over a huge data set) would traverse/utilize an index.
The MySQL optimizer isn't really ready (jet) to handle this correctly you should rewrite this kind of query to a iNNER JOIN and index correctly this will be the fasted method assuming t1.a and t2.x are unique
something like this.
SELECT
a
, b
, c
, d
FROM
table1 as t1
INNER JOIN
table2 as t2
ON t1.a = t2.x
WHERE
t1.some_condition ....
And make sure that t1.a and t2.x have PRIMARY or UNIQUE indexes
Having 1 query instead of loop will be definitely more efficient (and by nature consistent , to get consistent results with loop in general you will have to use serializable transaction ). One can argue in favour of EXISTS vs IN; as far as I remember mysql generates (or at least it was true for up to 5.1)...
Efficiency of utilizing index on a depends on number and order subquery result (assuming optimizer choses to grab results from subquery first and then compare it with a). In my understanding, the fastest option is to perform merge join which requires both resultsets sorted by the same key; however it may not be possible due to different sort order. Then I guess it's optimizer decision whether to sort or to use loop join. You can rely on its choice or try using hints and see if it makes a difference.