I am optimizing a query which involves a UNION ALL of two queries.
Both Queries have been more than optimized and run at less than one second separately.
However, when I perform the union of both, it takes around 30 seconds to calculate everything.
I won't bother you with the specific query, since they are optimized as they get, So let's call them Optimized_Query_1 and Optimized_Query_2
Number of rows from Optimized_Query_1 is roughly 100K
Number of rows from Optimized_Query_2 is roughly 200K
SELECT * FROM (
Optimized_Query_1
UNION ALL
Optimized_Query_2
) U
ORDER BY START_TIME ASC
I do require for teh results to be in order, but I find that with or without the ORDER BY at the end the query takes as much time so shouldn't make no difference.
Apparently the UNION ALL creates a temporary table in memory, from where the final results are then given, is there any way to work around this?
Thanks
You can't optimize UNION ALL. It simply stacks the two results sets on top of each other. Compared to UNION where an extra step is required to remove duplicates, UNION ALL is a straight stacking of the two result sets. The ORDER BY is likely taking additional time.
You can try creating a VIEW out of this query.
Related
I have two tables with ~1M rows indexed by their Id's.
the fallowing query...
SELECT t.* FROM transactions t
INNER JOIN integration it ON it.id_trans = t.id_trans
WHERE t.id_trans = '5440073'
OR it.id_integration = '439580587'
This query takes about 30s. But ...
SELECT ... WHERE t.id_trans = '5440073'
takes less than 100ms and
SELECT ... WHERE it.id_integration = '439580587'
also takes less than 100ms. Even
SELECT ... WHERE t.id_trans = '5440073'
UNION
SELECT ... WHERE it.id_integration = '439580587'
takes less then 100ms
Why does the OR clause takes so much time even if the parts being so fast?
Why is OR so slow, but UNION is so fast?
Do you understand why UNION is fast? Because it can use two separate indexes to good advantage, and gather some result rows from each part of the UNION, then combine the results together.
But why can't OR do that? Simply put, the Optimizer is not smart enough to try that angle.
In your case, the tests are on different tables; this leads to radically different query plans (see EXPLAIN SELECT ...) for the two parts of the UNION. Each can be well optimized, so each is fast.
Assuming each part delivers only a few rows, the subsequent overhead of UNION is minor -- namely to gather the two small sets of row, dedup them (if you use UNION DISTINCT instead of UNION ALL), and deliver the results.
Meanwhile, the OR query effectively gather all combinations of the two tables, then filtered out based on the two parts of the OR. The intermediate stage may involve a huge temp table, only to have most of the rows tossed.
(Another example of inflate-deflate is JOINs + GROUP BY. The workarounds are different.)
I would suggest writing the query using UNION ALL:
SELECT t.*
FROM transactions t
WHERE t.id_trans = '5440073'
UNION ALL
SELECT t.*
FROM transactions t JOIN
integration it
ON it.id_trans = t.id_trans
WHERE t.id_trans <> '5440073' AND
it.id_integration = '439580587';
Note: If the ids are really numbers (and not strings), then drop the single quotes. Mixing types can sometimes confuse the optimizer and prevent the use of indexes.
In most cases, when I tried to remove OR condition and replace them with a UNION (which holds each of the conditions separately), it performed significantly better, as those parts of the query were index-able again.
Is there a rule of thumb (and maybe some documentation to support it) on when this 'trick' stops being useful? Will it be useful for 2 OR conditions? for 10 OR conditions? As the amount of UNIONs increases, and the UNION distinct part may have its own overhead.
What would be your rule of thumb on this?
Small example of the transformation:
SELECT
a, b
FROM
tbl
WHERE
a = 1 OR b = 2
Transformed to:
(SELECT
tbl.a, tbl.b
FROM
tbl
WHERE
tbl.b = 2)
UNION DISTINCT
(SELECT
tbl.a, tbl.b
FROM
tbl
WHERE
tbl.a = 1)
I suggest there is no useful Rule of Thumb (RoT). Here is why...
As you imply, more UNIONs implies slower work, while more ORs does not (at least not much). The SELECTs of a union are costly because they are separate. I would estimate that a UNION of N SELECTs takes about N+1 or N+2 units of time, where one indexed SELECT takes 1 unit of time. In contrast, multiple ORs does little to slow down the query, since fetching all rows of the table is the costly part.
How fast each SELECT of a UNION runs depends on how good the index is and how few rows are fetched. This can vary significantly. (Hence, it makes it hard to devise a RoT.)
A UNION starts by generating a temp table into which each SELECT adds the rows it finds. This is some overhead. In newer versions (5.7.3 / MariaDB 10.1), there are limited situations where the temp table can be avoided. (This eliminates the hypothetical +1 or +2, thereby adding more complexity into devising a RoT.)
If it is UNION DISTINCT (the default) instead of UNION ALL, there needs to be a dedup-pass, probably involving a sort over the temp table. Note: This means that the even the new versions cannot avoid the temp table. UNION DISTINCT precisely mimics the OR, yet you may know that ALL would give the same answer.
I all,
i have 2 similar very LARGE table(1M rows each) with the same layout, i would union them and sorting by a common column: start . also i would put a condition in "start" ie : start>X.
the problem is that the view doesnt take care abount start's index and the the complexity rise up much, a simple query takes about 15 seconds and inserting a LIMIT doesnt fix because the results are cutted off first.
CREATE VIEW CDR AS
(SELECT start, duration, clid, FROM cdr_md ORDER BY start LIMIT 1000)
UNION ALL
(SELECT start, duration, clid, FROM cdr_1025 ORDER BY start LIMIT 1000)
ORDER BY start ;
a query to:
SELECT * FROM CDR WHERE start>10
doesnt returns expected results cause LIMIT keyword cuts off results prior.
the expected results would be as a query like this:
CREATE VIEW CDR AS
(SELECT start, duration, clid, FROM cdr_md WHERE start>X ORDER BY start LIMIT 1000)
UNION ALL
(SELECT start, duration, clid, FROM cdr_1025 WHERE start>X ORDER BY start LIMIT 1000)
ORDER BY start ;
Is there a way to avoid this problem ?
Thaks all
Fabrizio
i have 2 similar table ... with the same layout
This is contrary to the Principle of Orthogonal Design.
Don't do it. At least not without very good reason—with suitable indexes, 1 million records per table is easily enough for MySQL to handle without any need for partitioning; and even if one did need to partition the data, there are better ways than this manual kludge (which can give rise to ambiguous, potentially inconsistent data and lead to redundancy and complexity in your data manipulation code).
Instead, consider combining your tables into a single one with suitable columns to distinguish the records' differences. For example:
CREATE TABLE cdr_combined AS
SELECT *, 'md' AS orig FROM cdr_md
UNION ALL
SELECT *, '1025' AS orig FROM cdr_1025
;
DROP TABLE cdr_md, cdr_1025;
If you will always be viewing your data along the previously "partitioned" axis, include the distinguishing columns as index prefixes and performance will generally improve versus having separate tables.
You then won't need to perform any UNION and your VIEW definition effectively becomes:
CREATE VIEW CDR AS
SELECT start, duration, clid, FROM cdr_combined ORDER BY start
However, be aware that queries on views may not always perform as well as using the underlying tables directly. As documented under Restrictions on Views:
View processing is not optimized:
It is not possible to create an index on a view.
Indexes can be used for views processed using the merge algorithm. However, a view that is processed with the temptable algorithm is unable to take advantage of indexes on its underlying tables (although indexes can be used during generation of the temporary tables).
[site_list] ~100,000 rows... 10mb in size.
site_id
site_url
site_data_most_recent_record_id
[site_list_data] ~ 15+ million rows and growing... about 600mb in size.
record_id
site_id
site_connect_time
site_speed
date_checked
columns in bold are unique index keys.
I need to return 50 most recently updated sites AND the recent data that goes with it - connect time, speed, date...
This is my query:
SELECT SQL_CALC_FOUND_ROWS
site_list.site_url,
site_list_data.site_connect_time,
site_list_data.site_speed,
site_list_data.date_checked
FROM site_list
LEFT JOIN site_list_data
ON site_list.site_data_most_recent_record_id = site_list_data.record_id
ORDER BY site_data.date_checked DESC
LIMIT 50
Without the ORDER BY and SQL_CALC_FOUND_ROWS(I need it for pagination), the query takes about 1.5 seconds, with those it takes over 2 seconds or more which is not good enough because that particular page where this data will be shown is getting 20K+ pageviews/day and this query is apparently too heavy(server almost dies when I put this live) and too slow.
Experts of mySQL, how would you do this? What if the table got to 100 million records? Caching this huge result into a temp table every 30 seconds is the only other solution I got.
You need to add a heuristic to the query. You need to gate the query to get reasonable performance. It is effectively sorting your site_list_date table by date descending -- the ENTIRE table.
So, if you know that the top 50 will be within the last day or week, add a "and date_checked > <boundary_date>" to the query. Then it should reduce the overall result set first, and THEN sort it.
SQL_CALC_ROWS_FOUND is slow use COUNT instead. Take a look here
A couple of observations.
Both ORDER BY and SQL_CALC_FOUND_ROWS are going to add to the cost of your performance. ORDER BY clauses can potentially be improved with appropriate indexing -- do you have an index on your date_checked column? This could help.
What is your exact need for SQL_CALC_FOUND_ROWS? Consider replacing this with a separate query that uses COUNT instead. This can be vastly better assuming your Query Cache is enabled.
And if you can use COUNT, consider replacing your LEFT JOIN with an INNER JOIN as this will help performance as well.
Good luck.
My problem is this:
select * from
(
select * from barcodesA
UNION ALL
select * from barcodesB
)
as barcodesTOTAL, boxes
where barcodesTotal.code=boxes.code;
Table barcodesA has 4000 entries
Table barcodesB has 4000 entries
Table boxes has like 180.000 entries
It takes 30 seconds to proccess the query.
Another problematic query:
select * from
viewBarcodesTotal, boxes
where barcodesTotal.code=boxes.code;
viewBarcodesTotal contains the UNION ALL from both barcodes tables. It also takes forever.
Meanwhile,
select * from barcodesA , boxes where barcodesA.code=boxes.code
UNION ALL
select * from barcodesB , boxes where barcodesB.code=boxes.code
This one takes <1second.
The question is obviously WHY?, is my code bugged? is mysql bugged?
I have to migrate from access to mysql, and i would have to rewrite all my code if the first option in bugged.
Add an index on boxes.code if you don't already have one. Joining 8000 records (4K+4K) to the 180,000 will benefit from an index on the 180K side of the equation.
Also, be explicit and specify the fields you need back in your SELECT statements. Using * in a production-use query is bad form as it encourages not having to think about what fields (and how big they might be), not to mention the fact that you have 2 different tables in your example, barcodesa and barcodesb with potentially different data types and column orders that you're UNIONing....
The REASON for the performance difference...
The first query says... First, do a complete union of EVERY record in A UNIONed with EVERY record in B, THEN Join it to boxes on the code. The union does not have an index to be optimized against.
By explicitly applying your SECOND query instance, each table individually IS optimized on the join (apparently there IS an index per performance of second, but I would ensure both tables have index on "code" column).