How can I apply arithmetic operations to aggregated columns in MySQL? - mysql

TL;DR
Is there a way to use aggregated results in arithmetic operations?
Details
I want to take two aggregated columns (SUM(..), COUNT(..)) and operate them together, eg:
-- doesn't work
SELECT
SUM(x) AS x,
COUNT(y) AS y,
(x / y) AS x_per_y -- Problem HERE
FROM
my_tab
GROUP BY groupable_col;
That doesn't work, but I've found this does:
SELECT
SUM(x) AS x,
COUNT(y) AS y,
SUM(x) / COUNT(y) AS x_per_y -- notice the repeated aggregate
FROM
my_tab
GROUP BY groupable_col;
But if I need many columns that operate on aggregates, it quickly becomes very repetitive, and I'm not sure how to tell whether or not MySQL can optimize so that I'm not calculating aggregates multiple times.
I've searched SO, for a while now, as well as asked some pros, and the best alternative I can come up with is nested selects, ~~which my db doesn't support.~~
EDIT: it did support them, I had been doing something wrong, and ruled out nested selects prematurely
Also, MySQL documentation seems to support it, but I can't get something like this to work (example at very bottom of link)
https://dev.mysql.com/doc/refman/5.5/en/group-by-handling.html

One way is using subquery:
select x,
y,
x / y as x_per_y
from (
select SUM(x) as x,
COUNT(y) as y
from my_tab
group by groupable_col
) t
Also note that the value of count(y) can be zero (when all y are null).
MySQL handles this case automatically and produce NULL in case the denominator is zero.
Some DBMSes throw divide by zero error in this case, which is usually handled by producing null in that case:
select x,
y,
case when y > 0 then x / y end as x_per_y
from (
select SUM(x) as x,
COUNT(y) as y
from my_tab
group by groupable_col
) t

Related

MySQL: Where can AND / OR / XOR be used as anything other than logical operators in WHERE clauses?

I have noticed that the MySQL logical binary operators (AND, OR and XOR) not always act as such. In particular, AND can appear as part of a BETWEEN ... AND ... expression.
Besides BETWEEN, is there any other case where any of these three tokens (words) can appear as part of a WHERE clause but not act as a logical operator?
The AND in BETWEEN .. AND is not the same thing as the AND operator. It just reuses the same keyword. I'm looking at jOOQ's's parser source code, and I can tell you there are at least (far from exhaustive):
AND
In window function frame clauses, such as ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
In the MERGE statement's WHEN clauses (not supported by MySQL 8 yet), such as WHEN [ NOT ] MATCHED [ AND ... ] THEN
In the X BETWEEN Y AND Z predicate
In the temporal query FOR PORTION OF .. BETWEEN .. AND syntax (not supported by MySQL 8 yet)
OR
In CREATE OR REPLACE statements, such as CREATE OR REPLACE VIEW
NOT
NOT NULL predicate
NOT LIKE predicate
IF NOT EXISTS clause in DDL and a lot of other DDL clauses, like SET NOT NULL, DROP NOT NULL, etc. etc.
Other tokens
There are also "soft matches", which may be relevant if your "parser" is not smart enough to actually parse the SQL language. These may include functions like BIT_AND()
Conclusion for your use case
SQL is not a trivial language to parse. A minimalistic parser cannot easily transform all sorts of boolean expressions (or other things) to produce "equivalent" unions. This is very hard! Your case may not be correct, depending on the projection. E.g. these two queries are not the same:
-- May produce duplicate values for col
SELECT col
FROM t
WHERE (a OR b) AND x AND y
-- Does not produce duplicate values for col
SELECT col
FROM t
WHERE a AND x AND y
UNION
SELECT col
FROM t
WHERE b AND x AND y
UNION ALL cannot be used here because then you'd get too many duplicates that you didn't get before. You'd have to produce this query (akin to what Oracle does when it applies the "concatenation transformation" e.g. via the /*+USE_CONCAT*/ hint):
SELECT col
FROM t
WHERE a AND x AND y
UNION ALL
SELECT col
FROM t
WHERE b AND x AND y AND NOT (a AND x AND y) -- Exclude previous UNION subquery predicate here
This will get more complicated as your boolean expressions get more complex.
But have you really gained anything? Hard to say. Have you possibly broken your query? Probably, because what happens if you already have a UNION? Or ORDER BY? Or DISTINCT? Or LIMIT?

GROUP BY Syntax Mysql - Leaving out a groupable column

I have Table A with columns X,Y,Z.
X is an FK, Y is a description. Each X has exactly one corresponding Y. So if X stays the same over multiple records, Y stays the same too.
So there may be any number of records where X and Y are the same.
Now I'm running the following query:
SELECT X, Y
FROM A
GROUP BY X;
Will this query work?
Y is supposed to be grouped alongside X, but I didnt explicitely specify it in the query.
Does MySQL still implicitely act this way though? And is this behavior reliable/standardized?
Furthermore, will the results vary based on the datatype of Y. For example, is there a difference if Y is either VARCHAR, CHAR or INT? In case of an int, will the result be a SUM() of the grouped records?
Is the behavior MySQL will expose in such a case normed/standardized and where can I look it up?
Each X has exactly one corresponding Y
SELECT X, Y FROM A GROUP BY X;
Will this query work?
Technically, what happens when you run this query under MySQL depends on wether sql mode ONLY_FULL_GROUP_BY is enabled or not:
it it is enabled, the query errors: all non-aggregated columns must appear in the GROUP BY clause (you need to add Y to the GROUP BY clause)
else, the query executes, and gives you an arbitrary value of Y for each X; but since Y is functionnaly dependant on X, the value is actually predictable, so this is OK.
Generally, although the SQL standard does recognizes the notion of functionnaly-dependant column, it is a good practice to always include all non-aggregated colums in the GROUP BY clause. It is also a requirement in most databases other than MySQL (and, starting MySQL 5.7, ONLY_FULL_GROUP_BY is enabled by default). This also prevents you from various pitfalls and unpredictable behaviors.
Using ANY_VALUE() makes the query both valid and explicit about its purpose:
SELECT X, ANY_VALUE(Y) FROM A GROUP BY X;
Note that if you only want the distinct combinations of X, Y, it is simpler to use SELECT DISTINCT:
SELECT DISTINCT X, Y FROM A;
Your query will work if Y is functionally dependent on X (depending on SQL mode being used), but if you are trying to get distinct X,Y pairs from the table, it is better to use DISTINCT. The GROUP BY is meant to be used with the aggregate functions.
So you should use:
SELECT DISTINCT X, Y
FROM A;
A sample case where you would use GROUP BY would be with an aggregate functions:
SELECT DISTINCT X, Y, COUNT(*)
FROM A
GROUP BY X, Y;

Where in sub-query or in main query, what is preferable?

Simplified example 1:
SELECT * FROM (
SELECT x, y, z FROM table1
WHERE x = 'test'
-- union, etc, etc, complicated stuff...
) AS t
-- union, etc, etc, complicated stuff...
Simplified example 2:
SELECT * FROM (
SELECT x, y, z FROM table1
-- union, etc, etc, complicated stuff...
) AS t
-- union, etc, etc, complicated stuff...
WHERE x = 'test'
Which of the above is more popular? Is more performant? Is recommended for other reasons? Does it help to filter the results "early", before doing union and similar operations? Thanks.
In MySQL you definitely want the filtering condition in the subquery. MySQL materializes subqueries. The smaller the subquery the faster the query.
In addition, MySQL may be able to use an index for the condition.

Conditional probability p(y|x) in SQL

How to calculate conditional probability in vendor agnostic SQL code while reading a precomputed table (histogram) just once?
Let's imagine we have a query which returns a histogram relation. The histogram contains following attributes: {x, y, cnt}, where cnt is the count of occurrences of nominal attributes x and y. And calculation of the histogram is time consuming.
Once we have the histogram, we want to calculate conditional probability p(y|x). A possible way how to do that is to take p(y|x) = count(y,x) / count(x) as outlined in the following query:
with histogram as (
// Long and time consuming subquery returning {x, y, cnt}
), x_count as (
select x
, sum(cnt) as cnt
from histogram
group by x
)
select y
, x
, cnt/x_count.cnt as probability
from histogram
join x_count
using(x)
However, common table expressions (CTEs) are not portable (e.g. MySQL does not work with them). Is there a way how to rewrite the CTE that:
The same query can be executed without change at MySQL, MSSQL and PostgreSQL?
Relation histogram is calculated just once?
All I can think of is to materialize the histogram into a table. Process the histogram. And delete the histogram.
First, just because you declare something as a CTE does not mean that it is run only once. For instance, SQL Server does not materialize CTEs, so using your logic it would run the histogram once for each reference. It is the same as a view.
In addition, the using clause is not supported by all databases.
So, the one thing that you could do that is vendor agnostic is to use a view. There is a slight hitch, because dropping a view that already exists is vendor-specific. But the following would generally work to express the query:
create view histogram as -- you might want to give this a more unique name
// Long and time consuming subquery returning {x, y, cnt}
select h.y, h.x, cnt / total.cnt as probability
from histogram h join
(select x, sum(cnt) as cnt
from histogram
group by x
) total
on h.x = total.x;
drop view histogram;
Of course, this runs the histogram query multiple times. So, you could solve this using temporary tables:
create table histogram (
x ??, -- I don't know what the types are
y ??,
cnt ??
);
insert into histogram (x, y, cnt)
select . . . ; -- your complicated query here
select y, x, cnt * 1.0 / total.cnt as probability
from histogram h join
(select x, sum(cnt) as cnt
from histogram
group by x
) total
on h.x = total.x;
drop table histogram;
Unfortunately, dropping an existing table is database specific. This does meet your requirements, though.
My advice would be to drop MySQL from the requirement -- it is rather degraded from the perspective of ANSI functionality. Then simply do:
select h.*, cnt * 1.0 / sum(cnt) over (partition by x) as probability
from histogram h;
(The * 1.0 is because some databases do integer division and cnt sounds like it might be an integer.)
This would be the simplest way to represent the query without re-calculating histogram. And, it will work in a lot of databases -- SQL Server, Postgres, Oracle, Teradata, DB2, BigQuery, RedShift, Hive. In fact, I think it will work in pretty much all current versions of what is commonly called a "database" except MySQL, SQLite, and MS Access.

increasing performance on a SELECT query with large 3D point data set

I have a large dataset (around 1.9 million rows) of 3D points that I'm selecting from. The statement I use most often is similar to:
SELECT * FROM points
WHERE x > 100 AND x < 200
AND y > 100 AND y < 200
AND z > 100 AND z < 200
AND otherParameter > 10
I have indicies on x, y, and z as well as the otherParameter. I've also tried adding a multi-part index to x,y,z but that hasn't helped.
Any advice on how to make this SELECT query quicker?
B-Tree indexes won't help much for such a query.
What you need as an R-Tree index and the minimal bounding parallelepiped query over it.
Unfortunately, MySQL does not support R-Tree indexes over 3d points, only 2d. However, you may create an index over, say, X and Y together which will be more selective that any of the B-Tree indexes on X and Y alone:
ALTER TABLE points ADD xy POINT;
UPDATE points
SET xy = Point(x, y);
ALTER TABLE points MODIFY xy POINT NOT NULL;
CREATE SPATIAL INDEX sx_points_xy ON points (xy);
SELECT *
FROM points
WHERE MBRContains(LineString(Point(100, 100), Point(200, 200), xy)
AND z BETWEEN 100 and 200
AND otherParameter > 10;
This is only possible if your table is MyISAM.
I don't have mySQL to test but I'm curious how efficient its INTERSECT is:
select points.*
from points
join
(
select id from points where x > 100 AND x < 200
intersect
select id from points where y > 100 AND y < 200
intersect
select id from points where z > 100 AND z < 200
) as keyset
on points.id = keyset.id
Not necessarily recommending this -- but it's something to try, especially if you have separate indexes on x, y, and z.
EDIT: Since mySQl doesn't support INTERSECT the query above could be rewritten using JOINS of inline views. Each view would contain a keyset and each view would have the advantage of the separate indexes you have placed on x, y, and z. The performance would depend on the numnber of keys returned and on the intersect/join algorithm.
I first tested the intersect approach (in SQLite) to see if there were ways to improve performance in spatial queries short of using their R-Tree module. INTERSECT was actually slower than using a single non-composite index on one of the spatial values and then scanning the subset of the base table to get the other spatial values. But the results can vary depending on the size of the database. After the table has reached gargantuan size and disk i/o becomes more important as a performance factor, it may be more efficient to intersect discrete keysets, each of which has been instantiated from an index, than to do a scan of the base table subequent to an initial fetch-from-index.