I have Table A with columns X,Y,Z.
X is an FK, Y is a description. Each X has exactly one corresponding Y. So if X stays the same over multiple records, Y stays the same too.
So there may be any number of records where X and Y are the same.
Now I'm running the following query:
SELECT X, Y
FROM A
GROUP BY X;
Will this query work?
Y is supposed to be grouped alongside X, but I didnt explicitely specify it in the query.
Does MySQL still implicitely act this way though? And is this behavior reliable/standardized?
Furthermore, will the results vary based on the datatype of Y. For example, is there a difference if Y is either VARCHAR, CHAR or INT? In case of an int, will the result be a SUM() of the grouped records?
Is the behavior MySQL will expose in such a case normed/standardized and where can I look it up?
Each X has exactly one corresponding Y
SELECT X, Y FROM A GROUP BY X;
Will this query work?
Technically, what happens when you run this query under MySQL depends on wether sql mode ONLY_FULL_GROUP_BY is enabled or not:
it it is enabled, the query errors: all non-aggregated columns must appear in the GROUP BY clause (you need to add Y to the GROUP BY clause)
else, the query executes, and gives you an arbitrary value of Y for each X; but since Y is functionnaly dependant on X, the value is actually predictable, so this is OK.
Generally, although the SQL standard does recognizes the notion of functionnaly-dependant column, it is a good practice to always include all non-aggregated colums in the GROUP BY clause. It is also a requirement in most databases other than MySQL (and, starting MySQL 5.7, ONLY_FULL_GROUP_BY is enabled by default). This also prevents you from various pitfalls and unpredictable behaviors.
Using ANY_VALUE() makes the query both valid and explicit about its purpose:
SELECT X, ANY_VALUE(Y) FROM A GROUP BY X;
Note that if you only want the distinct combinations of X, Y, it is simpler to use SELECT DISTINCT:
SELECT DISTINCT X, Y FROM A;
Your query will work if Y is functionally dependent on X (depending on SQL mode being used), but if you are trying to get distinct X,Y pairs from the table, it is better to use DISTINCT. The GROUP BY is meant to be used with the aggregate functions.
So you should use:
SELECT DISTINCT X, Y
FROM A;
A sample case where you would use GROUP BY would be with an aggregate functions:
SELECT DISTINCT X, Y, COUNT(*)
FROM A
GROUP BY X, Y;
Related
I have noticed that the MySQL logical binary operators (AND, OR and XOR) not always act as such. In particular, AND can appear as part of a BETWEEN ... AND ... expression.
Besides BETWEEN, is there any other case where any of these three tokens (words) can appear as part of a WHERE clause but not act as a logical operator?
The AND in BETWEEN .. AND is not the same thing as the AND operator. It just reuses the same keyword. I'm looking at jOOQ's's parser source code, and I can tell you there are at least (far from exhaustive):
AND
In window function frame clauses, such as ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
In the MERGE statement's WHEN clauses (not supported by MySQL 8 yet), such as WHEN [ NOT ] MATCHED [ AND ... ] THEN
In the X BETWEEN Y AND Z predicate
In the temporal query FOR PORTION OF .. BETWEEN .. AND syntax (not supported by MySQL 8 yet)
OR
In CREATE OR REPLACE statements, such as CREATE OR REPLACE VIEW
NOT
NOT NULL predicate
NOT LIKE predicate
IF NOT EXISTS clause in DDL and a lot of other DDL clauses, like SET NOT NULL, DROP NOT NULL, etc. etc.
Other tokens
There are also "soft matches", which may be relevant if your "parser" is not smart enough to actually parse the SQL language. These may include functions like BIT_AND()
Conclusion for your use case
SQL is not a trivial language to parse. A minimalistic parser cannot easily transform all sorts of boolean expressions (or other things) to produce "equivalent" unions. This is very hard! Your case may not be correct, depending on the projection. E.g. these two queries are not the same:
-- May produce duplicate values for col
SELECT col
FROM t
WHERE (a OR b) AND x AND y
-- Does not produce duplicate values for col
SELECT col
FROM t
WHERE a AND x AND y
UNION
SELECT col
FROM t
WHERE b AND x AND y
UNION ALL cannot be used here because then you'd get too many duplicates that you didn't get before. You'd have to produce this query (akin to what Oracle does when it applies the "concatenation transformation" e.g. via the /*+USE_CONCAT*/ hint):
SELECT col
FROM t
WHERE a AND x AND y
UNION ALL
SELECT col
FROM t
WHERE b AND x AND y AND NOT (a AND x AND y) -- Exclude previous UNION subquery predicate here
This will get more complicated as your boolean expressions get more complex.
But have you really gained anything? Hard to say. Have you possibly broken your query? Probably, because what happens if you already have a UNION? Or ORDER BY? Or DISTINCT? Or LIMIT?
I'm trying to add features to a preexisting application and I came across a MySQL view something like this:
SELECT
AVG(table_name.col1),
AVG(table_name.col2),
AVG(table_name.col3),
table_name.personID,
table_name.col4
FROM table_name
GROUP BY table_name.personID;
OK so there's a few aggregate functions. You can select personID because you're grouping by it. But it also is selecting a column that is not in an aggregate function and is not a part of the GROUP BY clause. How is this possible??? Does it just pick a random value because the values definitely aren't unique per group?
Where I come from (MSSQL Server), that's an error. Can someone explain this behavior to me and why it's allowed in MySQL?
It's true that this feature permits some ambiguous queries, and silently returns a result set with an arbitrary value picked from that column. In practice, it tends to be the value from the row within the group that is physically stored first.
These queries aren't ambiguous if you only choose columns that are functionally dependent on the column(s) in the GROUP BY criteria. In other words, if there can be only one distinct value of the "ambiguous" column per value that defines the group, there's no problem. This query would be illegal in Microsoft SQL Server (and ANSI SQL), even though it cannot logically result in ambiguity:
SELECT AVG(table1.col1), table1.personID, persons.col4
FROM table1 JOIN persons ON (table1.personID = persons.id)
GROUP BY table1.personID;
Also, MySQL has an SQL mode to make it behave per the standard: ONLY_FULL_GROUP_BY
FWIW, SQLite also permits these ambiguous GROUP BY clauses, but it chooses the value from the last row in the group.†
† At least in the version I tested. What it means to be arbitrary is that either MySQL or SQLite could change their implementation in the future, and have some different behavior. You should therefore not rely on the behavior staying they way it is currently in ambiguous cases like this. It's better to rewrite your queries to be deterministic and not ambiguous. That's why MySQL 5.7 now enables ONLY_FULL_GROUP_BY by default.
I should have Googled for just a bit longer... It seems I found my answer.
MySQL extends the use of GROUP BY so
that you can use nonaggregated columns
or calculations in the SELECT list
that do not appear in the GROUP BY
clause. You can use this feature to
get better performance by avoiding
unnecessary column sorting and
grouping. For example, you do not need
to group on customer.name in the
following query
In standard SQL, you would have to add
customer.name to the GROUP BY clause.
In MySQL, the name is redundant.
Still, that just seems... wrong.
Let's say you have a query like this:
SELECT g, v
FROM t
GROUP BY g;
In this case, for each possible value for g, MySQL picks one of the corresponding values of v.
However, which one is chosen, depends on some circumstances.
I read somewhere that for each group of g, the first value of v is kept, in the order how the records were inserted into the table t.
This is quite ugly, because the records in a table should be treated as a set where the order of the elements should not matter. This is so "mysql-ish"...
If you want to determine which value for v to keep, you need to apply a subselect for t like this:
SELECT g, v
FROM (
SELECT *
FROM t
ORDER BY g, v DESC
) q
GROUP BY g;
This way you define which order the records of the subquery are processed by the external query, thus you can trust which value of v it will pick for the individual values of g.
However, if you need some WHERE conditions then be very careful. If you add the WHERE condition to the subquery then it will keep the behaviour, it will always return the value you expect:
SELECT g, v
FROM (
SELECT *
FROM t
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
GROUP BY g;
This is what you expect, the subselect filters and orders the table. It keeps the records where g has the given value and the external query returns that g and the first value for v.
However, if you add the same WHERE condition to the outer query then you get a non-deterministic result:
SELECT g, v
FROM (
SELECT *
FROM t
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g;
Surprisingly, you may get different values for v when executing the same query again and again which is... strange. The expected behaviour is to get all the records in the appropriate order from the subquery, filtering them in the outer query and then picking the same as it picked in the previous example. But it does not.
It picks a value for v seemingly randomly. The same query returned different values for v if I executed more (~20) times, but the distribution was not uniform.
If instead of adding an outer WHERE, you specify a HAVING condition like this:
SELECT g, v
FROM (
SELECT *
FROM t1
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g
HAVING g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9';
Then you get a consistent behaviour again.
CONCLUSION
I would suggest not to rely on this technique at all. If you really want/need to then avoid WHERE conditions in the outer query. Use it in the inner query if you can or a HAVING clause in the outer query.
I tested it with this data:
CREATE TABLE t1 (
v INT,
g VARCHAR(36)
);
INSERT INTO t1 VALUES (1, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
INSERT INTO t1 VALUES (2, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
in MySQL 5.6.41.
Maybe it is just a bug that gets/got fixed in newer versions, please give feedback if you have experience with newer versions.
select * from personel where p_id IN(select
min(dbo.personel.p_id)
FROM
personel
GROUP BY dbo.personel.p_adi)
TL;DR
Is there a way to use aggregated results in arithmetic operations?
Details
I want to take two aggregated columns (SUM(..), COUNT(..)) and operate them together, eg:
-- doesn't work
SELECT
SUM(x) AS x,
COUNT(y) AS y,
(x / y) AS x_per_y -- Problem HERE
FROM
my_tab
GROUP BY groupable_col;
That doesn't work, but I've found this does:
SELECT
SUM(x) AS x,
COUNT(y) AS y,
SUM(x) / COUNT(y) AS x_per_y -- notice the repeated aggregate
FROM
my_tab
GROUP BY groupable_col;
But if I need many columns that operate on aggregates, it quickly becomes very repetitive, and I'm not sure how to tell whether or not MySQL can optimize so that I'm not calculating aggregates multiple times.
I've searched SO, for a while now, as well as asked some pros, and the best alternative I can come up with is nested selects, ~~which my db doesn't support.~~
EDIT: it did support them, I had been doing something wrong, and ruled out nested selects prematurely
Also, MySQL documentation seems to support it, but I can't get something like this to work (example at very bottom of link)
https://dev.mysql.com/doc/refman/5.5/en/group-by-handling.html
One way is using subquery:
select x,
y,
x / y as x_per_y
from (
select SUM(x) as x,
COUNT(y) as y
from my_tab
group by groupable_col
) t
Also note that the value of count(y) can be zero (when all y are null).
MySQL handles this case automatically and produce NULL in case the denominator is zero.
Some DBMSes throw divide by zero error in this case, which is usually handled by producing null in that case:
select x,
y,
case when y > 0 then x / y end as x_per_y
from (
select SUM(x) as x,
COUNT(y) as y
from my_tab
group by groupable_col
) t
How to calculate conditional probability in vendor agnostic SQL code while reading a precomputed table (histogram) just once?
Let's imagine we have a query which returns a histogram relation. The histogram contains following attributes: {x, y, cnt}, where cnt is the count of occurrences of nominal attributes x and y. And calculation of the histogram is time consuming.
Once we have the histogram, we want to calculate conditional probability p(y|x). A possible way how to do that is to take p(y|x) = count(y,x) / count(x) as outlined in the following query:
with histogram as (
// Long and time consuming subquery returning {x, y, cnt}
), x_count as (
select x
, sum(cnt) as cnt
from histogram
group by x
)
select y
, x
, cnt/x_count.cnt as probability
from histogram
join x_count
using(x)
However, common table expressions (CTEs) are not portable (e.g. MySQL does not work with them). Is there a way how to rewrite the CTE that:
The same query can be executed without change at MySQL, MSSQL and PostgreSQL?
Relation histogram is calculated just once?
All I can think of is to materialize the histogram into a table. Process the histogram. And delete the histogram.
First, just because you declare something as a CTE does not mean that it is run only once. For instance, SQL Server does not materialize CTEs, so using your logic it would run the histogram once for each reference. It is the same as a view.
In addition, the using clause is not supported by all databases.
So, the one thing that you could do that is vendor agnostic is to use a view. There is a slight hitch, because dropping a view that already exists is vendor-specific. But the following would generally work to express the query:
create view histogram as -- you might want to give this a more unique name
// Long and time consuming subquery returning {x, y, cnt}
select h.y, h.x, cnt / total.cnt as probability
from histogram h join
(select x, sum(cnt) as cnt
from histogram
group by x
) total
on h.x = total.x;
drop view histogram;
Of course, this runs the histogram query multiple times. So, you could solve this using temporary tables:
create table histogram (
x ??, -- I don't know what the types are
y ??,
cnt ??
);
insert into histogram (x, y, cnt)
select . . . ; -- your complicated query here
select y, x, cnt * 1.0 / total.cnt as probability
from histogram h join
(select x, sum(cnt) as cnt
from histogram
group by x
) total
on h.x = total.x;
drop table histogram;
Unfortunately, dropping an existing table is database specific. This does meet your requirements, though.
My advice would be to drop MySQL from the requirement -- it is rather degraded from the perspective of ANSI functionality. Then simply do:
select h.*, cnt * 1.0 / sum(cnt) over (partition by x) as probability
from histogram h;
(The * 1.0 is because some databases do integer division and cnt sounds like it might be an integer.)
This would be the simplest way to represent the query without re-calculating histogram. And, it will work in a lot of databases -- SQL Server, Postgres, Oracle, Teradata, DB2, BigQuery, RedShift, Hive. In fact, I think it will work in pretty much all current versions of what is commonly called a "database" except MySQL, SQLite, and MS Access.
I'm trying to add features to a preexisting application and I came across a MySQL view something like this:
SELECT
AVG(table_name.col1),
AVG(table_name.col2),
AVG(table_name.col3),
table_name.personID,
table_name.col4
FROM table_name
GROUP BY table_name.personID;
OK so there's a few aggregate functions. You can select personID because you're grouping by it. But it also is selecting a column that is not in an aggregate function and is not a part of the GROUP BY clause. How is this possible??? Does it just pick a random value because the values definitely aren't unique per group?
Where I come from (MSSQL Server), that's an error. Can someone explain this behavior to me and why it's allowed in MySQL?
It's true that this feature permits some ambiguous queries, and silently returns a result set with an arbitrary value picked from that column. In practice, it tends to be the value from the row within the group that is physically stored first.
These queries aren't ambiguous if you only choose columns that are functionally dependent on the column(s) in the GROUP BY criteria. In other words, if there can be only one distinct value of the "ambiguous" column per value that defines the group, there's no problem. This query would be illegal in Microsoft SQL Server (and ANSI SQL), even though it cannot logically result in ambiguity:
SELECT AVG(table1.col1), table1.personID, persons.col4
FROM table1 JOIN persons ON (table1.personID = persons.id)
GROUP BY table1.personID;
Also, MySQL has an SQL mode to make it behave per the standard: ONLY_FULL_GROUP_BY
FWIW, SQLite also permits these ambiguous GROUP BY clauses, but it chooses the value from the last row in the group.†
† At least in the version I tested. What it means to be arbitrary is that either MySQL or SQLite could change their implementation in the future, and have some different behavior. You should therefore not rely on the behavior staying they way it is currently in ambiguous cases like this. It's better to rewrite your queries to be deterministic and not ambiguous. That's why MySQL 5.7 now enables ONLY_FULL_GROUP_BY by default.
I should have Googled for just a bit longer... It seems I found my answer.
MySQL extends the use of GROUP BY so
that you can use nonaggregated columns
or calculations in the SELECT list
that do not appear in the GROUP BY
clause. You can use this feature to
get better performance by avoiding
unnecessary column sorting and
grouping. For example, you do not need
to group on customer.name in the
following query
In standard SQL, you would have to add
customer.name to the GROUP BY clause.
In MySQL, the name is redundant.
Still, that just seems... wrong.
Let's say you have a query like this:
SELECT g, v
FROM t
GROUP BY g;
In this case, for each possible value for g, MySQL picks one of the corresponding values of v.
However, which one is chosen, depends on some circumstances.
I read somewhere that for each group of g, the first value of v is kept, in the order how the records were inserted into the table t.
This is quite ugly, because the records in a table should be treated as a set where the order of the elements should not matter. This is so "mysql-ish"...
If you want to determine which value for v to keep, you need to apply a subselect for t like this:
SELECT g, v
FROM (
SELECT *
FROM t
ORDER BY g, v DESC
) q
GROUP BY g;
This way you define which order the records of the subquery are processed by the external query, thus you can trust which value of v it will pick for the individual values of g.
However, if you need some WHERE conditions then be very careful. If you add the WHERE condition to the subquery then it will keep the behaviour, it will always return the value you expect:
SELECT g, v
FROM (
SELECT *
FROM t
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
GROUP BY g;
This is what you expect, the subselect filters and orders the table. It keeps the records where g has the given value and the external query returns that g and the first value for v.
However, if you add the same WHERE condition to the outer query then you get a non-deterministic result:
SELECT g, v
FROM (
SELECT *
FROM t
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g;
Surprisingly, you may get different values for v when executing the same query again and again which is... strange. The expected behaviour is to get all the records in the appropriate order from the subquery, filtering them in the outer query and then picking the same as it picked in the previous example. But it does not.
It picks a value for v seemingly randomly. The same query returned different values for v if I executed more (~20) times, but the distribution was not uniform.
If instead of adding an outer WHERE, you specify a HAVING condition like this:
SELECT g, v
FROM (
SELECT *
FROM t1
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g
HAVING g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9';
Then you get a consistent behaviour again.
CONCLUSION
I would suggest not to rely on this technique at all. If you really want/need to then avoid WHERE conditions in the outer query. Use it in the inner query if you can or a HAVING clause in the outer query.
I tested it with this data:
CREATE TABLE t1 (
v INT,
g VARCHAR(36)
);
INSERT INTO t1 VALUES (1, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
INSERT INTO t1 VALUES (2, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
in MySQL 5.6.41.
Maybe it is just a bug that gets/got fixed in newer versions, please give feedback if you have experience with newer versions.
select * from personel where p_id IN(select
min(dbo.personel.p_id)
FROM
personel
GROUP BY dbo.personel.p_adi)