Most effective way to use group function in another column - mysql

I have a query that looks something like this:
SELECT COUNT(DISTINCT A) as a_distinct,
COUNT(DISTINCT B) as b_distinct,
COUNT(DISTINCT A)/COUNT(DISTINCT B) as a_b_ratio
FROM
sometable_ab
As we can see this looks very inefficient as aggregate functions are run twice even though they have been calculated. I could only think of one solution to the problem that is breaking it into two queries. Is that the only probably solution. Or is their a better more efficient solution that could be done. I am using Redshift DB which mostly uses postgresql but a solution with even MYSQL would be acceptable as I cannot think of a way in any DB to do this efficiently.

If you are worried about the performance impact, just use a subquery:
SELECT a_distinct, b_distinct, a_distinct / b_distinct as a_b_ratio
FROM (SELECT COUNT(DISTINCT A) as a_distinct,
COUNT(DISTINCT B) as b_distinct
FROM sometable_ab
) ab
For most aggregation functions, this would be irrelevant, but count(distinct) can be a performance hog.
This is ANSI standard SQL and should work in any database you mention.

Using a subquery still counts as one query for any RDBMS. More importantly, count() never returns NULL, but 0 if no row is found (or no non-null value for the given expression in any row). This would lead you straight into a division by zero exception. Fix it with NULLIF (also standard SQL). You'll get NULL in this case.
SELECT *, a_distinct / NULLIF(b_distinct, 0) AS a_b_ratio
FROM (
SELECT count(DISTINCT a) AS a_distinct
, count(DISTINCT b) AS b_distinct
FROM sometable_ab
) sub;

Related

Using "dynamic" variables in a MySQL query [duplicate]

I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.

The efficiency of HAVING vs subquery and why

When I study the SQL HAVING tutorial, it says: HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery.
Sometimes, HAVING statement is equivalent to subquery, like these:
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
having sum(total_amt_usd) >= 250000
select *
from (
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
) as subtable
where sum_amount >= 250000
I want to know which one is recommended and the reason why this one is faster or more efficient than the other.
As with any performance question, you should try it on your data. But, the two should be essentially equivalent. If you are interested in such questions, then you should learn how to read execution plans.
Just one note about MySQL. MySQL tends to materialize subqueries. This might incur a little extra overhead by writing the group by results before filtering them, but you probably would not notice the difference.

SQL – Eliminating duplicates in UNION ALL using a WHERE clause

On toptal.com I found this question to which they provided the following solution:
SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X
Can someone explain the usage of X and Y here? Are these variables? Does it also work this way in MySQL? I can't seem to get this kind of query running on a test server.
Also, they make the following claim:
The key is the AND a!=X part. This gives you the benefits of the UNION (a.k.a., UNION DISTINCT) command, while avoiding much of its performance hit.
But if this were true, why would anyone ever use UNION DISTINCT? In particular, why wouldn't it be implemented using this supposedly more efficient way?
The x is just a placeholder in that pseudocode for the 'real' filter. You might know that the field a is the only one that might be duplicated on both sides of your union, but the query optimiser might not, so doing the union in that way is more performance friendly. That answer would only apply in certain circumstances, depending on the context of the data.
It's not a well written question.
You can use a OR in the WHERE clause like this:
SELECT *
FROM mytable
WHERE a=X OR (b=Y AND a!=X);

Does MySQL eliminate common subexpressions between SELECT and HAVING/GROUP BY clause

I often see people answer MySQL questions with queries like this:
SELECT DAY(date), other columns
FROM table
GROUP BY DAY(date);
SELECT somecolumn, COUNT(*)
FROM table
HAVING COUNT(*) > 1;
I always like to give the column an alias and refer to that in the GROUP BY or HAVING clause, e.g.
SELECT DAY(date) AS day, other columns
FROM table
GROUP BY day;
SELECT somecolumn, COUNT(*) AS c
FROM table
HAVING c > 1;
Is MySQL smart enough to notice that the expressions in the later clauses are the same as in SELECT, and only do it once? I'm not sure how to test this -- EXPLAIN doesn't show any difference, but it doesn't seem to show how it's doing the grouping or filtering in the first place; it seems mainly useful for optimizing joins and WHERE clauses.
I tend to be pessimistic about MySQL optimization, so I like to give it all the help I can.
I think this can be tested using sleep() function,
for example take a look at this demo: http://sqlfiddle.com/#!2/0bc1b/1
Select * FROM t;
| X |
|---|
| 1 |
| 2 |
| 2 |
SELECT x+sleep(1)
FROM t
GROUP BY x+sleep(1);
SELECT x+sleep(1) As name
FROM t
GROUP BY name;
Execution times of both queries are about 3000 ms ( 3 seconds ).
There are 3 records in the table, and for each record the query sleeps for 1 second only,
so it means that the expression is evaluated only once for each record, not twice.
After consulting with one of the MySQL engineers, I proffer this lengthy answer.
Caching - no part of a query is 'remembered' for later use in that (or subsequent) query. (Contrast: the Query cache.)
Common subexpression elimination - no. This is a common Compiler technique, but MySQL does not use it. Example: (a-b)*(a-b) will do the subtract twice.
Removal of a constant from a loop - yes, with limitations. This is another Compiler technique.
A variety of SQL-centric hacks - yes; see below.
Re-evaluation of a subquery - it depends. Also, the Optimizer is gradually getting better.
VIEWs - it depends. There are still cases where a VIEW is destined to perform worse than the equivalent SELECT. Example: no condition pushdown into a UNION in a VIEW. Actually, this is more a matter of delayed action.
I think that some newer versions of MariaDB have a "subquery cache".
(Caveat: I do not have 100% confidence in any of my answer, but I do believe that most of it is correct, as of MySQL 5.7, MariaDB 10.1, etc)
Think of a multi-row SELECT as a loop. Many, maybe all, "deterministic" expressions are evaluated once. Example: Constant date expressions, even involving function calls. But...
NOW() is specifically evaluated once at the beginning of a query. Furthermore, the value is passed to Slaves when replicating. That is, by the time the query is stored on a slave, NOW() could be out of date. (SYSDATE() is another animal.)
Especially with the advent of only_full_group_by, GROUP BY needs to know if it matches the SELECT expressions. So, this looks for similar code.
HAVING and ORDER BY can use aliases from the SELECT list (unlike WHERE and GROUP BY). So SELECT expr AS x ... HAVING expr seems to reevaluate expr, but SELECT expr AS x ... HAVING x seems to reach for the already-evaluated expr.
The Windowing functions of MariaDB 10.2 have some pretty severe restrictions on where they can/cannot be reused; I don't have a complete picture of them yet.
Generally, none of this matters -- the re-evaluation of an expression (DATE(date) or even COUNT(*)) will get the same answer. Furthermore, the rummaging through the rows is usually much more costly than expression evaluation. So, unless you have a good stopwatch, you won't tell the difference.

Is there a tool that helps me understand why a certain row was included in a the result of an SQL query?

I'm maintaining a legacy app where the SQL queries look like the devil's handywork - before I give up and rewrite the whole damn thing I'm hoping to use better tools to do a surgical change :)
I want to know which "WHERE" clause resulted in the inclusion of certain rows when there are a lot of ORs. Even better, why certain rows were excluded from the result set.
(Specifically I am using MySQL)
Move your where condition to a case statement. Break up each 'OR' into it's own column and the result will show you which returned true and false.
I haven't heard of a tool that would do this for you.
This is a variation on Pirion's approach.
Instead of using a case statement, but each where clause in the select statement. So, a condition like:
select t.*
from t
where A or B or C or D
would become:
select t.*,
(Acomp) as A,
(Bcomp) as B,
(Ccomp) as C,
(Dcomp) as D
from t
where Acomp or Bcomp or Ccomp or Dcomp
MySQL has the nice features that a boolean is returns as a 0 (false) or 1 (true). This will allow you see all the conditions that a given row matches.
You might then be able to simplify the logic by removing or combining conditions.
If the conditions are computationally intensive or very long (such as using subqueries), you might want to do this using a subquery:
select t.*
from (select t.*,
(Acomp) as A,
(Bcomp) as B,
(Ccomp) as C,
(Dcomp) as D
from t
) t
where A or B or C or D;