SQL – Eliminating duplicates in UNION ALL using a WHERE clause - mysql

On toptal.com I found this question to which they provided the following solution:
SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X
Can someone explain the usage of X and Y here? Are these variables? Does it also work this way in MySQL? I can't seem to get this kind of query running on a test server.
Also, they make the following claim:
The key is the AND a!=X part. This gives you the benefits of the UNION (a.k.a., UNION DISTINCT) command, while avoiding much of its performance hit.
But if this were true, why would anyone ever use UNION DISTINCT? In particular, why wouldn't it be implemented using this supposedly more efficient way?

The x is just a placeholder in that pseudocode for the 'real' filter. You might know that the field a is the only one that might be duplicated on both sides of your union, but the query optimiser might not, so doing the union in that way is more performance friendly. That answer would only apply in certain circumstances, depending on the context of the data.
It's not a well written question.

You can use a OR in the WHERE clause like this:
SELECT *
FROM mytable
WHERE a=X OR (b=Y AND a!=X);

Related

why does sub query with group by do full scan twice?

I am testing two types of queries.
The first type looks like below:
explain select * from ord_order;
explain select * from (select * from ord_order) as tbl;
These two execution plan shows the same behavior (full scan once).
However, the second type looks like below:
explain select * from ord_order
group by bundle_or_order_number;
explain select * from
(select * from ord_order
group by bundle_or_order_number) as tbl;
the second query do the full scan twice!
Can someone explain it? Thanks.
First, it doesn't matter because your queries are malformed. Don't use select * with group by. It just doesn't make sense -- and current versions of MySQL do not support it. The question is: What rows do the columns come from? You should be using aggregation functions.
Why are the two queries different? In MySQL lingo, the difference is whether or not the derived table (a subquery in the from clause) is materialized. Whether or not subqueries are materialized depends on the nature of the subquery and what MySQL decides to do in the version you are using.
You can read about optimizing derived tables in the documentation.

EXISTS vs ALL, ANY, SOME

I'm trying to understand the difference between EXISTS and ALL in MySQL. Let me give you an example:
SELECT *
FROM table1
WHERE NOT EXISTS (
SELECT *
FROM table2
WHERE table2.val < table1.val
);
SELECT *
FROM table1
WHERE val <= ALL( SELECT val FROM table2 );
A quote from MySQL docs:
Traditionally, an EXISTS subquery starts with SELECT *, but it could
begin with SELECT 5 or SELECT column1 or anything at all. MySQL
ignores the SELECT list in such a subquery, so it makes no difference. [1]
Reading this, it seems to me that mysql should be able to translate both queries to the same relational algebra expression. Both queries are just a simple comparison between values from two tables. However, that doesn't seem to be the case. I tried both queries and the second one performs much better than the first one.
How are this queries exactly handled by the optimizer?
Why the optimizer can't make the first query perform as the second one?
Is it always more efficient to use an ALL/ANY/SOME condition?
The queries in your question are not equivalent, so they will have different execution plans regardless of how well they're optimized. If you used NOT val > ANY(...) then it would be equivalent.
You should always use EXPLAIN to see the execution plan of a query and realize that the execution plan can change as your data changes. Testing and understanding the execution plan will help you determine which methods perform better. There is no hard and fast rule for ALL/ANY/SOME and they're often optimized down to an EXISTS.

Most effective way to use group function in another column

I have a query that looks something like this:
SELECT COUNT(DISTINCT A) as a_distinct,
COUNT(DISTINCT B) as b_distinct,
COUNT(DISTINCT A)/COUNT(DISTINCT B) as a_b_ratio
FROM
sometable_ab
As we can see this looks very inefficient as aggregate functions are run twice even though they have been calculated. I could only think of one solution to the problem that is breaking it into two queries. Is that the only probably solution. Or is their a better more efficient solution that could be done. I am using Redshift DB which mostly uses postgresql but a solution with even MYSQL would be acceptable as I cannot think of a way in any DB to do this efficiently.
If you are worried about the performance impact, just use a subquery:
SELECT a_distinct, b_distinct, a_distinct / b_distinct as a_b_ratio
FROM (SELECT COUNT(DISTINCT A) as a_distinct,
COUNT(DISTINCT B) as b_distinct
FROM sometable_ab
) ab
For most aggregation functions, this would be irrelevant, but count(distinct) can be a performance hog.
This is ANSI standard SQL and should work in any database you mention.
Using a subquery still counts as one query for any RDBMS. More importantly, count() never returns NULL, but 0 if no row is found (or no non-null value for the given expression in any row). This would lead you straight into a division by zero exception. Fix it with NULLIF (also standard SQL). You'll get NULL in this case.
SELECT *, a_distinct / NULLIF(b_distinct, 0) AS a_b_ratio
FROM (
SELECT count(DISTINCT a) AS a_distinct
, count(DISTINCT b) AS b_distinct
FROM sometable_ab
) sub;

Performans of nested queries

I want to ask a question about database queries. In case of query such like where clause of the query is coming from the another query. For example
select ? from ? where ? = select ? from ?
This is the simple example so it is easy to write this. But for the more complex case, i want to know what is the best way in case of performance. Join? seperate queries? nested or another?
Thank you for answers.
Best Regards.
You should test it. These things depend a lot on the details of the query and of the indices it can use.
In my experience JOINs tend to be faster than nested queries in MySQL. In some cases MySQL isn't very smart and appears to run the subquery for every row produced by the outer query.
You can read more about these things in the official documentation:
Optimizing subqueries: http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Rewriting subqueries as joins: http://dev.mysql.com/doc/refman/5.6/en/rewriting-subqueries.html
This is case dependent. In case you have a very less result in the inner query you should go for it. The flow works in the manner where in the inner query is executed first and the result set is being used in the outer query.
Meanwhile joins give you a Cartesian product which is again a heavy operation.
As Mitch and Joni stated, it depends. But generally a join will offer the best performance. You're trying to avoid running the nested query for each row of the outer query. A good query optimizer may do this for you anyway, by interpreting what you're trying to do and essentially "fixing" your mistake. But with the vast majority of queries, you should be writing it as a join in the first place. That way you're being explicit about what you're trying to do and you're fully understanding yourself what is being done, and what the most efficient way to do the work is.
I EXPECT the joins to be quicker, mainly because you have an equivalence and an explicit JOIN. Still use explain to see the differences in how the SQl engine will interpret them.
I would not expect these to be so different, where you can get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
Since almost everyone is saying that joins will give the optimal performance I just logged in to say the exact opposite experience I had.
So some days back I was writing a query for 3-4 tables which had huge amount of data. I wrote a big sql query with joins and it was taking around 2-3 hours to execute it. Then I restructured it, created a nested select query, put as many where constraints as I can inside the nested one & made it as stricter as possible and then the performance improved by >90%, it now takes less than 4 mins to run.
This is just my experience and may be theoretically joins are better. I just felt to share my experience. Its better to try out different things, getting additional knowledge about the tables, it's indexes etc would help a lot.
Update:
And I just found out what I did is actually suggested in this optimization reference page of MySQL. http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Pasting it here for quick reference:
Replace a join with a subquery. For example, try this:
SELECT DISTINCT column1 FROM t1 WHERE t1.column1 IN ( SELECT column1
FROM t2);
Instead of this:
SELECT DISTINCT t1.column1 FROM t1, t2 WHERE t1.column1 =
t2.column1;
Move clauses from outside to inside the subquery. For example, use
this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1 UNION ALL SELECT s1
FROM t2); Instead of this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1) OR s1 IN (SELECT s1
FROM t2); For another example, use this query:
SELECT (SELECT column1 + 5 FROM t1) FROM t2; Instead of this query:
SELECT (SELECT column1 FROM t1) + 5 FROM t2;

Striping of results from MySQL UNION

I am trying to conceptualize how to set up UNION of 3 tables that will allow for ordering in a striping fashion.
Top 5 from the UNION of Tables A,B,C
with results ordered like so:
A
B
C
A
B
C
....
Is this sort of thing possibe with SQL and more specifically MySQL?
Personally, I would pull the three queries out separately, and then process through them in your favourite programming language. The queries should run faster like this, as they would not be so complex.
It should be possible using only SQL though, by adding a couple of columns to your output for each of the three queries, and then wrapping the whole lot in an outer select such as;
SELECT * FROM ( <PUT THE FULL UNION HERE> ) ORDER BY table_name, row_count
You'd need to alter each of the queries like this;
##rowCount=0;
SELECT 'A' AS table_name, (##rowCount+1) AS row_count, <remaining fields>
FROM table_A;
Now, I'm not totally sure of the syntax for the incrementing row counter, but I've seen it done elsewhere (probably on StackOverflow somewhere), so maybe someone else can help with that part? (Or you may find the answer by searching this site...)