In most cases, when I tried to remove OR condition and replace them with a UNION (which holds each of the conditions separately), it performed significantly better, as those parts of the query were index-able again.
Is there a rule of thumb (and maybe some documentation to support it) on when this 'trick' stops being useful? Will it be useful for 2 OR conditions? for 10 OR conditions? As the amount of UNIONs increases, and the UNION distinct part may have its own overhead.
What would be your rule of thumb on this?
Small example of the transformation:
SELECT
a, b
FROM
tbl
WHERE
a = 1 OR b = 2
Transformed to:
(SELECT
tbl.a, tbl.b
FROM
tbl
WHERE
tbl.b = 2)
UNION DISTINCT
(SELECT
tbl.a, tbl.b
FROM
tbl
WHERE
tbl.a = 1)
I suggest there is no useful Rule of Thumb (RoT). Here is why...
As you imply, more UNIONs implies slower work, while more ORs does not (at least not much). The SELECTs of a union are costly because they are separate. I would estimate that a UNION of N SELECTs takes about N+1 or N+2 units of time, where one indexed SELECT takes 1 unit of time. In contrast, multiple ORs does little to slow down the query, since fetching all rows of the table is the costly part.
How fast each SELECT of a UNION runs depends on how good the index is and how few rows are fetched. This can vary significantly. (Hence, it makes it hard to devise a RoT.)
A UNION starts by generating a temp table into which each SELECT adds the rows it finds. This is some overhead. In newer versions (5.7.3 / MariaDB 10.1), there are limited situations where the temp table can be avoided. (This eliminates the hypothetical +1 or +2, thereby adding more complexity into devising a RoT.)
If it is UNION DISTINCT (the default) instead of UNION ALL, there needs to be a dedup-pass, probably involving a sort over the temp table. Note: This means that the even the new versions cannot avoid the temp table. UNION DISTINCT precisely mimics the OR, yet you may know that ALL would give the same answer.
Related
I was playing around with SQLite and I ran into an odd performance issue with CROSS JOINS on very small data sets. For example, any cross join I do in SQLite takes about 3x or longer than the same cross join in mysql. For example, here would be an example for 3,000 rows in mysql:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins? I have had a lot of luck using SQLite on a single table/database, but whenever joining tables, it seems be become a bit more problematic.
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins?
Yes. The algorithm used by SQLite is very simple. In SQLite, joins are executed as nested loop joins. The database goes through one table, and for each row, searches matching rows from the other table.
SQLite is unable to figure out how to use an index to speed the join and without indices, an k-way join takes time proportional to N^k. MySQL for example, creates some "ghostly" indexes which helps the iteration process to be faster.
It has been commented already by Shawn that this question would need much more details in order to get a really accurate answer.
However, as a general answer, please be aware that this note in the SQLite documentation states that the algorithm used to perform CROSS JOINs may be suboptimal (by design!), and that their usage is generally discouraged:
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite. The "CROSS JOIN" join operator produces the same result as the "INNER JOIN", "JOIN" and "," operators, but is handled differently by the query optimizer in that it prevents the query optimizer from reordering the tables in the join. An application programmer can use the CROSS JOIN operator to directly influence the algorithm that is chosen to implement the SELECT statement. Avoid using CROSS JOIN except in specific situations where manual control of the query optimizer is desired. Avoid using CROSS JOIN early in the development of an application as doing so is a premature optimization. The special handling of CROSS JOIN is an SQLite-specific feature and is not a part of standard SQL.
This clearly indicates that the SQLite query planner handles CROSS JOINs differently than other RDBMS.
Note: nevertheless, I am unsure that this really applies to your use case, where both derived tables being joined have the same number of records.
Why MySQL might be faster: It uses the optimization that it calls "Using join buffer (Block Nested Loop)".
But... There are many things that are "wrong" with the query. I would hate for you to draw a conclusion on comparing DB engines based on your findings.
It could be that one DB will create an index to help with join, even if none were already there.
SELECT * probably hauls around all the columns, unless the Optimizer is smart enough to toss all the columns except for territory.
A LIMIT without an ORDER BY gives you random value. You might think that the resultset is necessarily 3000 rows of the value "3000" in each, but it is perfectly valid to come up with other results. (Depending on what you ORDER BY, it still may not be deterministic.)
Having a COUNT(*) without a column saying what it is counting (territory) seems unrealistic.
You have the same subquery twice. Some engine may be smart enough to evaluate it only once. Or you could reformulate it with WITH to (possibly) give the Optimizer a big hint of such. (I think the example below shows how it would be reformulated in MySQL 8.0 or MariaDB 10.2; I don't know about SQLite).
If you are pitting one DB against the other, please use multiple queries that relate to your application.
This is not necessarily a "small" dataset, since the intermediate table (unless optimized away) has 9,000,000 rows.
I doubt if I have written more than one cross join in a hundred queries, maybe a thousand. Its performance is hardly worth worrying about.
WITH w AS ( SELECT territory FROM main_s LIMIT 3000 )
SELECT COUNT(*)
FROM w AS x1
JOIN w AS x2
GROUP BY x1.territory;
As noted above, using CROSS JOIN in SQLite restricts the optimiser from reordering tables so that you can influence the order the nested loops that perform the join will take.
However, that's a red herring here as you are limiting rows in both sub selects to 3000 rows, and its the same table, so there is no optimisation to be had there anyway.
Lets see what your query actually does:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You say; produce an intermediate result set of 9 million rows (3000 x 3000), group them on x.territory and return count of the size of the group.
So let's say the row size of your table is 100 bytes.
You say, for each of 3000 rows of 100 bytes, give me 3000 rows of 100 bytes.
Hence you get 9 million rows of 200 bytes length, an intermediate result set of 1.8GB.
So here are some optimisations you could make.
select COUNT(*) from (
select territory from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You don't use anything other than territory from x, so select just that. Lets assume it is 8 bytes, so now you create an intermediate result set of:
9M x 108 = 972MB
So we nearly halve the amount of data. Lets try the same for x2.
But wait, you are not using any data fields from x2. You are just using it multiply the result set by 3000. If we do this directly we get:
select COUNT(*) * 3000 from (
select territory from main_s limit 3000
) group by territory
The intermediate result set is now:
3000 x 8 = 24KB which is now 0.001% of the original.
Further, now that SELECT * is not being used, it's possible the optimiser will be able to use an index on main_s that includes territory as a covering index (meaning it doesn't need to lookup the row to get territory).
This is done when there is a WHERE clause, it will try to chose a covering index that will also satisfy the query without using row lookups, but it's not explicit in the documentation if this is also done when WHERE is not used.
If you determined a covering index was not being use (assuming one exists), then counterintuitively (because sorting takes time), you could use ORDER BY territory in the sub select to cause the covering index to be used.
select COUNT(*) * 3000 from (
select territory from main_s limit 3000 order by territory
) group by territory
Check the optimiser documentation here:
https://www.sqlite.org/draft/optoverview.html
To summarise:
The optimiser uses the structure of your query to look for hints and clues about how the query may be optimised to run quicker.
These clues take the form of keywords such as WHERE clauses, ORDER By, JOIN (ON), etc.
Your query as written provides none of these clues.
If I understand your question correctly, you are interested in why other SQL systems are able to optimise your query as written.
The most likely reasons seem to be:
Ability to eliminate unused columns from sub selects (likely)
Ability to use covering indexes without WHERE or ORDER BY (likely)
Ability to eliminate unused sub selects (unlikely)
But this is a theory that would need testing.
Sqlite uses CROSS JOIN as a flag to the query-planner to disable optimizations. The docs are quite clear:
Programmers can force SQLite to use a particular loop nesting order for a join by using the CROSS JOIN operator instead of just JOIN, INNER JOIN, NATURAL JOIN, or a "," join. Though CROSS JOINs are commutative in theory, SQLite chooses to never reorder the tables in a CROSS JOIN. Hence, the left table of a CROSS JOIN will always be in an outer loop relative to the right table.
https://www.sqlite.org/optoverview.html#crossjoin
I have two tables with ~1M rows indexed by their Id's.
the fallowing query...
SELECT t.* FROM transactions t
INNER JOIN integration it ON it.id_trans = t.id_trans
WHERE t.id_trans = '5440073'
OR it.id_integration = '439580587'
This query takes about 30s. But ...
SELECT ... WHERE t.id_trans = '5440073'
takes less than 100ms and
SELECT ... WHERE it.id_integration = '439580587'
also takes less than 100ms. Even
SELECT ... WHERE t.id_trans = '5440073'
UNION
SELECT ... WHERE it.id_integration = '439580587'
takes less then 100ms
Why does the OR clause takes so much time even if the parts being so fast?
Why is OR so slow, but UNION is so fast?
Do you understand why UNION is fast? Because it can use two separate indexes to good advantage, and gather some result rows from each part of the UNION, then combine the results together.
But why can't OR do that? Simply put, the Optimizer is not smart enough to try that angle.
In your case, the tests are on different tables; this leads to radically different query plans (see EXPLAIN SELECT ...) for the two parts of the UNION. Each can be well optimized, so each is fast.
Assuming each part delivers only a few rows, the subsequent overhead of UNION is minor -- namely to gather the two small sets of row, dedup them (if you use UNION DISTINCT instead of UNION ALL), and deliver the results.
Meanwhile, the OR query effectively gather all combinations of the two tables, then filtered out based on the two parts of the OR. The intermediate stage may involve a huge temp table, only to have most of the rows tossed.
(Another example of inflate-deflate is JOINs + GROUP BY. The workarounds are different.)
I would suggest writing the query using UNION ALL:
SELECT t.*
FROM transactions t
WHERE t.id_trans = '5440073'
UNION ALL
SELECT t.*
FROM transactions t JOIN
integration it
ON it.id_trans = t.id_trans
WHERE t.id_trans <> '5440073' AND
it.id_integration = '439580587';
Note: If the ids are really numbers (and not strings), then drop the single quotes. Mixing types can sometimes confuse the optimizer and prevent the use of indexes.
There are quite a few "why is my GROUP BY so slow" questions on SO, most of them seem to be resolved with indexes.
My situation is different. Indeed I GROUP BY on non-indexed data but this is on purpose and it's not something I can change.
However, when I compare the performance of GROUP BY with the performance of a similar query without a GROUP BY (that also doesn't use indexes) - the GROUP BY query is slower by an order of magnitude.
Slow query:
SELECT someFunc(col), COUNT(*) FROM tbl WHERE col2 = 42 GROUP BY someFunc(col)
The results are something like this:
someFunc(col) COUNT(*)
=========================
a 100000
b 80000
c 20
d 10
Fast(er) query:
SELECT 'a', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'a'
UNION
SELECT 'b', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'b'
UNION
SELECT 'c', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'c'
UNION
SELECT 'd', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'd'
This query yields the same results and is about ten times faster despite actually running multiple separate queries.
I realize that they are not the same from MySQL point of view, because MySQL doesn't know in advance that someFunc(col) can only yield four different values, but still it seems like too big of a difference.
I'm thinking that this has to do with some work GROUP BY is doing behind the scenes (creating temporary tables and stuff like that).
Are there configuration parameters that I could tweak to make the GROUP BY faster?
Is there a way to hint MySQL to do things differently within the query itself? (e.g. refrain from creating a temporary table).
EDIT:
In fact what I referred to as someFunc(col) above is actually a JSON_EXTRACT(). I just tried to copy the specific data being extracted into its own (unindexed) column and it makes GROUP BY extremely fast, and indeed faster than the alternative UNIONed queries.
The question remains: why? JSON_EXTRACT() might be slow but it should be just as slow with the four queries (in fact slower because more rows are scanned). Also, I've read that MySQL JSON is designed for fast reads.
The difference I'm seeing is between more than 200 seconds (GROUP BY with JSON_EXTRACT()) and 1-2 seconds (GROUP BY on a CONCAT() on an actual unindexed column).
First, for this query:
SELECT someFunc(col), COUNT(*)
FROM tbl
WHERE col2 = 42
GROUP BY someFunc(col);
You should have an index on tbl(col2, col). This is a covering index for the query so it should improve performance.
Small note: The second version should use UNION ALL rather than UNION. The performance difference for eliminating duplicates is small on 4 rows, but UNION is a bad habit in these cases.
I'm not sure what would cause 10x performance slow down. I can readily think of two things that would make the second version faster.
First, this query is calling someFunc() twice for each row being processed. If this is an expensive operation, then that would account for half the increase in query load. This could be much larger if the first version is calling someFunc() on all rows, rather than just on matching rows.
To see if this is an issue, you can try:
SELECT someFunc(col) as someFunc_col, COUNT(*)
FROM tbl
WHERE col2 = 42
GROUP BY someFunc_col;
Second, doing 4 smaller GROUP BYs is going to be a bit faster than doing 1 bigger one. This is because GROUP BY uses a sort, and sorting is O(n log(n)). So, sorting 100,000 rows and 80,000 rows should be faster than sorting 180,000. Your case has about half the data in two groups. This might account for up to 50% difference (although I would be surprised if it were this large).
I have two queries plus its own EXPLAIN's results:
One:
SELECT *
FROM notifications
WHERE id = 5204 OR seen = 3
Benchmark (for 10,000 rows): 0.861
Two:
SELECT h.* FROM ((SELECT n.* from notifications n WHERE id = 5204)
UNION ALL
(SELECT n.* from notifications n WHERE seen = 3)) h
Benchmark (for 10,000 rows): 2.064
The result of two queries above is identical. Also I have these two indexes on notifications table:
notifications(id) -- this is PK
notification(seen)
As you know, OR usually prevents effective use of indexes, That's why I wrote second query (by UNION). But after some tests I figured it out which still using OR is much faster that using UNION. So I'm confused and I really cannot choose the best option in my case.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better. May you please help me should I use which approach?
The query plan for the OR case appears to indicate that MySQL is indeed using indexes, so evidently yes, it can do, at least in this case. That seems entirely reasonable, because there is an index on seen, and id is the PK.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better.
If "logical and reasonable explanations" are contradicted by reality, then it is safe to assume that the logic is flawed or the explanations are wrong or inapplicable. Performance is notoriously difficult to predict; performance testing is essential where speed is important.
May you please help me should I use which approach?
You should use the one that tests faster on input that adequately models that which the program will see in real use.
Note also, however, that your two queries are not semantically equivalent: if the row with id = 5204 also has seen = 3 then the OR query will return it once, but the UNION ALL query will return it twice. It is pointless to choose between correct code and incorrect code on any basis other than which one is correct.
index_merge, as the name suggests, combines the primary keys of two indexes using the Sort Merge Join or Sort Merge Union for AND and OR conditions, appropriately, and then looks up the rest of the values in the table by PK.
For this to work, conditions on both indexes should be so that each index would yield primary keys in order (your conditions are).
You can find the strict definition of the conditions in the docs, but in a nutshell, you should filter by all parts of the index with an equality condition, plus possibly <, =, or > on the PK.
If you have an index on (col1, col2, col3), this should be col1 = :val1 AND col2 = :val2 AND col3 = :val3 [ AND id > :id ] (the part in the square brackets is not necessary).
The following conditions won't work:
col1 = :val1 -- you omit col2 and col3
col1 = :val1 AND col2 = :val2 AND col3 > :val3 -- you can only use equality on key parts
As a free side effect, your output is sorted by id.
You could achieve the similar results using this:
SELECT *
FROM (
SELECT 5204 id
UNION ALL
SELECT id
FROM mytable
WHERE seen = 3
AND id <> 5204
) q
JOIN mytable m
ON m.id = q.id
, except that in earlier versions of MySQL the derived table would have to be materialized which would definitely make the query performance worse, and your results would not have been ordered by id anymore.
In short, if your query allows index_merge(union), go for it.
The answer is contained in your question. The EXPLAIN output for OR says Using union(PRIMARY, seen) - that means that the index_merge optimization is being used and the query is actually executed by unioning results from the two indexes.
So MySQL can use index in some cases and it does in this one. But the index_merge is not always available or is not used because the statistics of the indexes say it won't be worth it. In those cases OR may be a lot slower than UNION (or not, you need to always check both versions if you are not sure).
In your test you "got lucky" and MySQL did the right optimization for you automatically. It is not always so.
I am optimizing a query which involves a UNION ALL of two queries.
Both Queries have been more than optimized and run at less than one second separately.
However, when I perform the union of both, it takes around 30 seconds to calculate everything.
I won't bother you with the specific query, since they are optimized as they get, So let's call them Optimized_Query_1 and Optimized_Query_2
Number of rows from Optimized_Query_1 is roughly 100K
Number of rows from Optimized_Query_2 is roughly 200K
SELECT * FROM (
Optimized_Query_1
UNION ALL
Optimized_Query_2
) U
ORDER BY START_TIME ASC
I do require for teh results to be in order, but I find that with or without the ORDER BY at the end the query takes as much time so shouldn't make no difference.
Apparently the UNION ALL creates a temporary table in memory, from where the final results are then given, is there any way to work around this?
Thanks
You can't optimize UNION ALL. It simply stacks the two results sets on top of each other. Compared to UNION where an extra step is required to remove duplicates, UNION ALL is a straight stacking of the two result sets. The ORDER BY is likely taking additional time.
You can try creating a VIEW out of this query.