Let's say I have a query of the form:
SELECT a, b, c, d
FROM table1
WHERE a IN (
SELECT x
FROM table2
WHERE some_condition);
Now the query for the IN can return a huge number of records.
Assuming that a is the primary key, so an index is used is this the best way to write such a query?
Or it is more optimal to loop over each of the records returned by the subquery?
For me it is clear that when I do a where a = X it is clear that I just do an index (tree) traversal.
But I am not sure how an IN (especially over a huge data set) would traverse/utilize an index.
The MySQL optimizer isn't really ready (jet) to handle this correctly you should rewrite this kind of query to a iNNER JOIN and index correctly this will be the fasted method assuming t1.a and t2.x are unique
something like this.
SELECT
a
, b
, c
, d
FROM
table1 as t1
INNER JOIN
table2 as t2
ON t1.a = t2.x
WHERE
t1.some_condition ....
And make sure that t1.a and t2.x have PRIMARY or UNIQUE indexes
Having 1 query instead of loop will be definitely more efficient (and by nature consistent , to get consistent results with loop in general you will have to use serializable transaction ). One can argue in favour of EXISTS vs IN; as far as I remember mysql generates (or at least it was true for up to 5.1)...
Efficiency of utilizing index on a depends on number and order subquery result (assuming optimizer choses to grab results from subquery first and then compare it with a). In my understanding, the fastest option is to perform merge join which requires both resultsets sorted by the same key; however it may not be possible due to different sort order. Then I guess it's optimizer decision whether to sort or to use loop join. You can rely on its choice or try using hints and see if it makes a difference.
Related
It seems to me that you can do the same thing in a SQL query using either NOT EXISTS, NOT IN, or LEFT JOIN WHERE IS NULL. For example:
SELECT a FROM table1 WHERE a NOT IN (SELECT a FROM table2)
SELECT a FROM table1 WHERE NOT EXISTS (SELECT * FROM table2 WHERE table1.a = table2.a)
SELECT a FROM table1 LEFT JOIN table2 ON table1.a = table2.a WHERE table1.a IS NULL
I'm not sure if I got all the syntax correct, but these are the general techniques I've seen. Why would I choose to use one over the other? Does performance differ...? Which one of these is the fastest / most efficient? (If it depends on implementation, when would I use each one?)
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: PostgreSQL
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
In a nutshell:
NOT IN is a little bit different: it never matches if there is but a single NULL in the list.
In MySQL, NOT EXISTS is a little bit less efficient
In SQL Server, LEFT JOIN / IS NULL is less efficient
In PostgreSQL, NOT IN is less efficient
In Oracle, all three methods are the same.
If the database is good at optimising the query, the two first will be transformed to something close to the third.
For simple situations like the ones in you question, there should be little or no difference, as they all will be executed as joins. In more complex queries, the database might not be able to make a join out of the not in and not exists queryes. In that case the queries will get a lot slower. On the other hand, a join may also perform badly if there is no index that can be used, so just because you use a join doesn't mean that you are safe. You would have to examine the execution plan of the query to tell if there may be any performance problems.
Assuming you are avoiding nulls, they are all ways of writing an anti-join using Standard SQL.
An obvious omission is the equivalent using EXCEPT:
SELECT a FROM table1
EXCEPT
SELECT a FROM table2
Note in Oracle you need to use the MINUS operator (arguably a better name):
SELECT a FROM table1
MINUS
SELECT a FROM table2
Speaking of proprietary syntax, there may also be non-Standard equivalents worth investigating depending on the product you are using e.g. OUTER APPLY in SQL Server (something like):
SELECT t1.a
FROM table1 t1
OUTER APPLY
(
SELECT t2.a
FROM table2 t2
WHERE t2.a = t1.a
) AS dt1
WHERE dt1.a IS NULL;
When need to insert data in table with multi-field primary key, consider that it will be much faster (I tried in Access but I think in any Database) not to check that "not exists records with 'such' values in table", - rather just insert into table, and excess records (by the key) will not be inserted twice.
Performance perspective always avoid using inverse keywords like NOT IN, NOT EXISTS, ...
Because to check the inverse items DBMS need to runs through all the available and drop the inverse selection.
I have 2 tables a and b,each have 2 M and 3.2 Million records. I'am trying to get those id's which are not exists in b from a. I have written below query,
select a.id from a where not exists (select b.id from b where a.id =b.id)
this is taking longer time. is there any better way to get results faster.
Update: I just look into the table structure for both tables and found table a.id has decimal datatype and table b.id has varchar as datatype
will this difference in datatype cause any issues.
Could you try the LEFT JOIN with NULL. It will return the Id's which are exists in TableA and those are not in TableB.
SELECT T1.Id
FROM TableA T1
LEFT JOIN TableB T2 ON T2.Id = T1.Id
WHERE T2.Id IS NULL
While you could write your query using an anti-join, it probably would not affect the performance much, and in fact the underlying execution plan could even be the same. The only way I can see to speed up your query would be to add an index to the b table:
CREATE TABLE idx ON b (id);
But, if b.id be a primary key, then it should already be part of the clustered index. In this case, your current performance might be as good as you can get.
(this is mostly comment, but it's a bit long)
Please take some time to read some of the many questions about query optimization here is SO. The ones which are downvoted and closed omit table/index definitions and explain plans. The ones which will receive upvotes include these along with cardinality, performance and result metrics.
The join to table a in your sub query is redundant. When you remove the second reference to that table you end up with a simpler query. Then you can use a "not in" or a left join.
But the performance is still going to suck. Wherever possible you should try to avoid painting yourself into a corner like this in your data design.
Thanks for your valuable answers, I found the way. It got resolved after keeping same datatypes for lookup ID's, got results in 22 sec.
About query optimizations, I'm wondering if statements like one below get optimized:
select *
from (
select *
from table1 t1
join table2 t2 using (entity_id)
order by t2.sort_order, t1.name
) as foo -- main query of object
where foo.name = ?; -- inserted
Consider that the query is taken care by a dependency object but just (rightly?) allows one to tack in a WHERE condition. I'm thinking that at least not a lot of data gets pulled in to your favorite language, but I'm having second thoughts if that's an adequate optimization and maybe the database is still taking some time going through the query.
Or is it better to take that query out and write a separate query method that has the where and maybe a LIMIT 1 clause, too?
In MySQL, no.
The predicate in an outer query does not get "pushed" down into the inline view query.
The query in the inline view is processed first, independent of the outer query. (MySQL will optimize that view query just like it would optimize that query if you submitted that separately.)
The way that MySQL processes this query: the inline view query gets run first, the result is materialized as a 'derived table'. That is, the result set from that query gets stored as a temporary table, in memory in some cases (if it's small enough, and doesn't contain any columns that aren't supported by the MEMORY engine. Otherwise, it's spun out to disk with as a MyISAM table, using the MyISAM storage engine.
Once the derived table is populated, then the outer query runs.
(Note that the derived table does not have any indexes on it. That's true in MySQL versions before 5.6; I think there are some improvements in 5.6 where MySQL will actually create an index.
Clarification: indexes on derived tables: As of MySQL 5.6.3 "During query execution, the optimizer may add an index to a derived table to speed up row retrieval from it." Reference: http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html
Also, I don't think MySQL "optimizes out" any unneeded columns from the inline view. If the inline view query is a SELECT *, then all of the columns will be represented in the derived table, whether those are referenced in the outer query or not.
This can lead to some significant performance issues, especially when we don't understand how MySQL processes a statement. (And the way that MySQL processes a statement is significantly different from other relational databases, like Oracle and SQL Server.)
You may have heard a recommendation to "avoid using views in MySQL". The reasoning behind this general advice (which applies to both "stored" views and "inline" views) is the significant performance issues that can be unnecessarily introduced.
As an example, for this query:
SELECT q.name
FROM ( SELECT h.*
FROM huge_table h
) q
WHERE q.id = 42
MySQL does not "push" the predicate id=42 down into the view definition. MySQL first runs the inline view query, and essentially creates a copy of huge_table, as an un-indexed MyISAM table. Once that is done, then the outer query will scan the copy of the table, to locate the rows satisfying the predicate.
If we instead re-write the query to "push" the predicate into the view definition, like this:
SELECT q.name
FROM ( SELECT h.*
FROM huge_table h
WHERE h.id = 42
) q
We expect a much smaller resultset to be returned from the view query, and the derived table should be much smaller. MySQL will also be able to make effective use of an index ON huge_table (id). But there's still some overhead associated with materializing the derived table.
If we eliminate the unnecessary columns from the view definition, that can be more efficient (especially if there are a lot of columns, there are any large columns, or any columns with datatypes not supported by the MEMORY engine):
SELECT q.name
FROM ( SELECT h.name
FROM huge_table h
WHERE h.id = 42
) q
And it would be even more efficient to eliminate the inline view entirely:
SELECT q.name
FROM huge_table q
WHERE q.id
I can't speak for MySQL - not to mention the fact that it probably varies by storage engine and MySQL version, but for PostgreSQL:
PostgreSQL will flatten this into a single query. The inner ORDER BY isn't a problem, because adding or removing a predicate cannot affect the ordering of the remaining rows.
It'll get flattened to:
select *
from table1 t1
join table2 t2 using (entity_id)
where foo.name = ?
order by t2.sort_order, t1.name;
then the join predicate will get internally converted, producing a plan corresponding to the SQL:
select t1.col1, t1.col2, ..., t2.col1, t2.col2, ...
from table1 t1, table2 t2
where
t1.entity_id = t2.entity_id
and foo.name = ?
order by t2.sort_order, t1.name;
Example with a simplified schema:
regress=> CREATE TABLE demo1 (id integer primary key, whatever integer not null);
CREATE TABLE
regress=> INSERT INTO demo1 (id, whatever) SELECT x, x FROM generate_series(1,100) x;
INSERT 0 100
regress=> EXPLAIN SELECT *
FROM (
SELECT *
FROM demo1
ORDER BY id
) derived
WHERE whatever % 10 = 0;
QUERY PLAN
-----------------------------------------------------------
Sort (cost=2.51..2.51 rows=1 width=8)
Sort Key: demo1.id
-> Seq Scan on demo1 (cost=0.00..2.50 rows=1 width=8)
Filter: ((whatever % 10) = 0)
Planning time: 0.173 ms
(5 rows)
... which is the same plan as:
EXPLAIN SELECT *
FROM demo1
WHERE whatever % 10 = 0
ORDER BY id;
QUERY PLAN
-----------------------------------------------------------
Sort (cost=2.51..2.51 rows=1 width=8)
Sort Key: id
-> Seq Scan on demo1 (cost=0.00..2.50 rows=1 width=8)
Filter: ((whatever % 10) = 0)
Planning time: 0.159 ms
(5 rows)
If there was a LIMIT, OFFSET, a window function, or certain other things that prevent qualifier push-down/pull-up/flattening in the inner query then PostgreSQL would recognise that it can't safely flatten it. It'd evaluate the inner query either by materializing it or by iterating over its output and feeding that to the outer query.
The same applies for a view. PostgreSQL will in-line and flatten views into the containing query where it is safe to do so.
I have big DB. It's about 1 mln strings. I need to do something like this:
select * from t1 WHERE id1 NOT IN (SELECT id2 FROM t2)
But it works very slow. I know that I can do it using "JOIN" syntax, but I can't understand how.
Try this way:
select *
from t1
left join t2 on t1.id1 = t2.id
where t2.id is null
First of all you should optimize your indexes in both tables, and after that you should use join
There are different ways a dbms can deal with this task:
It can select id2 from t2 and then select all t1 where id1 is not in that set. You suggest this using the IN clause.
It can select record by record from t1 and look for each record if it finds a match in t2. You would suggest this using the EXISTS clause.
You can outer join the table then throw away all matches and stay with the non-matching entries. This may look like a bad way, especially when there are many matches, because you would get big intermediate data and then throw most of it away. However, depending on how the dbms works, it can be rather fast, for example when it applies hash join techniques.
It all depends on table sizes, number of matches, indexes, etc. and on what the dbms makes of your query. There are dbms that are able to completely re-write your query to find the best execution plan.
Having said all this, you can just try different things:
the IN clause with (SELECT DISTINCT id2 FROM t2). DISTINCT can reduce the intermediate result significantly and really speed up your query. (But maybe your dbms does that anyhow to get a good execution plan.)
use an EXISTS clause and see if that is faster
the outer join suggested by Parado
Although this question is specific to MySQL, I wouldn't mind knowing if this answer applies to SQL engines in general.
Also, since this isn't a syntax query, I'm using psuedo-SQL for brevity/clarity.
Let's say C[1]..C[M] are a set of criteria (separated by AND or OR) and Q[1]..Q[N] are another set (separated by OR). I want to use C[1]...C[M] to filter a table and from this filtered table, I want all the rows matching Q[1]...Q[N].
If I were to do:
SELECT ... FROM ... WHERE (C[1]...C[M]) AND (Q[1]...Q[N])
Would this be automatically optimized so that C[1]...C[M] is found only once and each Q[i] is run against this cache'ed result? If not, should I then split the query into two like so:
INSERT INTO TEMP ... SELECT ... FROM ... WHERE C[1]...C[N]
SELECT ... FROM TEMP WHERE Q[1]...Q[N]
This is the job of the internal query optimizer to calculate the best order for compiling the joins according to filters.
For instance in:
SELECT *
FROM
table1
INNER JOIN table2 ON table1.id = table2.id AND table2.column = Y
INNER JOIN table3 ON table3.id2 = table2.id2 AND table3.column = Z
WHERE
table1.column = X
Mysql (/oracle/sqlserver etc...) would try to compute beforehand each intermediary resultset to provide the best performances, and actually here the engine is doing a pretty good job.
However, everything relies on the statistics it actually has about the tables and the indexes you've setup in your architecture. These 2 points (apart from filling up the tables with datas...) are the only ones you can influence to help the optimizer to make good decisions by giving it the right and accurate information.
I hope it helped.
ps: have a look at this. This is about operators and order of precedence in queries compilation under oracle yet it is probably a good thing to know anyway:
http://ezinearticles.com/?Oracle-SQL---The-Importance-of-Order-of-Precedence&id=1597846