MySQL - Selecting a Column not in Group By - mysql

I'm trying to add features to a preexisting application and I came across a MySQL view something like this:
SELECT
AVG(table_name.col1),
AVG(table_name.col2),
AVG(table_name.col3),
table_name.personID,
table_name.col4
FROM table_name
GROUP BY table_name.personID;
OK so there's a few aggregate functions. You can select personID because you're grouping by it. But it also is selecting a column that is not in an aggregate function and is not a part of the GROUP BY clause. How is this possible??? Does it just pick a random value because the values definitely aren't unique per group?
Where I come from (MSSQL Server), that's an error. Can someone explain this behavior to me and why it's allowed in MySQL?

It's true that this feature permits some ambiguous queries, and silently returns a result set with an arbitrary value picked from that column. In practice, it tends to be the value from the row within the group that is physically stored first.
These queries aren't ambiguous if you only choose columns that are functionally dependent on the column(s) in the GROUP BY criteria. In other words, if there can be only one distinct value of the "ambiguous" column per value that defines the group, there's no problem. This query would be illegal in Microsoft SQL Server (and ANSI SQL), even though it cannot logically result in ambiguity:
SELECT AVG(table1.col1), table1.personID, persons.col4
FROM table1 JOIN persons ON (table1.personID = persons.id)
GROUP BY table1.personID;
Also, MySQL has an SQL mode to make it behave per the standard: ONLY_FULL_GROUP_BY
FWIW, SQLite also permits these ambiguous GROUP BY clauses, but it chooses the value from the last row in the group.†
† At least in the version I tested. What it means to be arbitrary is that either MySQL or SQLite could change their implementation in the future, and have some different behavior. You should therefore not rely on the behavior staying they way it is currently in ambiguous cases like this. It's better to rewrite your queries to be deterministic and not ambiguous. That's why MySQL 5.7 now enables ONLY_FULL_GROUP_BY by default.

I should have Googled for just a bit longer... It seems I found my answer.
MySQL extends the use of GROUP BY so
that you can use nonaggregated columns
or calculations in the SELECT list
that do not appear in the GROUP BY
clause. You can use this feature to
get better performance by avoiding
unnecessary column sorting and
grouping. For example, you do not need
to group on customer.name in the
following query
In standard SQL, you would have to add
customer.name to the GROUP BY clause.
In MySQL, the name is redundant.
Still, that just seems... wrong.

Let's say you have a query like this:
SELECT g, v
FROM t
GROUP BY g;
In this case, for each possible value for g, MySQL picks one of the corresponding values of v.
However, which one is chosen, depends on some circumstances.
I read somewhere that for each group of g, the first value of v is kept, in the order how the records were inserted into the table t.
This is quite ugly, because the records in a table should be treated as a set where the order of the elements should not matter. This is so "mysql-ish"...
If you want to determine which value for v to keep, you need to apply a subselect for t like this:
SELECT g, v
FROM (
SELECT *
FROM t
ORDER BY g, v DESC
) q
GROUP BY g;
This way you define which order the records of the subquery are processed by the external query, thus you can trust which value of v it will pick for the individual values of g.
However, if you need some WHERE conditions then be very careful. If you add the WHERE condition to the subquery then it will keep the behaviour, it will always return the value you expect:
SELECT g, v
FROM (
SELECT *
FROM t
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
GROUP BY g;
This is what you expect, the subselect filters and orders the table. It keeps the records where g has the given value and the external query returns that g and the first value for v.
However, if you add the same WHERE condition to the outer query then you get a non-deterministic result:
SELECT g, v
FROM (
SELECT *
FROM t
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g;
Surprisingly, you may get different values for v when executing the same query again and again which is... strange. The expected behaviour is to get all the records in the appropriate order from the subquery, filtering them in the outer query and then picking the same as it picked in the previous example. But it does not.
It picks a value for v seemingly randomly. The same query returned different values for v if I executed more (~20) times, but the distribution was not uniform.
If instead of adding an outer WHERE, you specify a HAVING condition like this:
SELECT g, v
FROM (
SELECT *
FROM t1
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g
HAVING g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9';
Then you get a consistent behaviour again.
CONCLUSION
I would suggest not to rely on this technique at all. If you really want/need to then avoid WHERE conditions in the outer query. Use it in the inner query if you can or a HAVING clause in the outer query.
I tested it with this data:
CREATE TABLE t1 (
v INT,
g VARCHAR(36)
);
INSERT INTO t1 VALUES (1, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
INSERT INTO t1 VALUES (2, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
in MySQL 5.6.41.
Maybe it is just a bug that gets/got fixed in newer versions, please give feedback if you have experience with newer versions.

select * from personel where p_id IN(select
min(dbo.personel.p_id)
FROM
personel
GROUP BY dbo.personel.p_adi)

Related

MYSQL - Non deterministic order when using LIMIT and GROUP BY despite using ORDER BY id

Notes about the database
It was generated using Prisma so unfortunately the column names in the many-to-many tables are named "A" and "B". "A" refers to the table which comes first in the alphabet and "B" the second. For example, in _ReadingToWord, "A" refers to Reading.id and "B" refers to Word.id because "r" comes before "w" in the alphabet.
The problem
I have the below query that uses a limit statement to implement paging.
The problem I am having is that the result order is non-deterministic. (If I execute the query a bunch of times, some of the time the order will be different).
I am ordering by id which is a primary key so I thought that should ensure a consistent order.
Can anyone explain why the ordering is non-deterministic and how to fix it?
select * from (
SELECT w.id,
hiragana,
group_concat( distinct(concat(coalesce(r.downStep, -1) + 1 , "," ,r.katakana)) order by r.downStep SEPARATOR ' ')
from Hiragana a join _HiraganaToWord b on a.id = b.A join
Word w on w.id = b.B join _ReadingToWord rtw on w.id = rtw.B join
Reading r on r.id = rtw.A
WHERE hiragana like "あ%"
group by w.id
)
as groupQuery
order by length(hiragana), hiragana, id asc limit 600,5;
Sample runs
You are experiencing one of the subtle side-effects of disabling only_full_group_by:
If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. This causes MySQL to accept the preceding query. In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic, which is probably not what you want.
If you would enable that mode, you would get an error like
Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'a.hiragana' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
and searching on stackoverflow for that error message will give you lots and lots of examples for this problem.
So in your query
SELECT w.id, a.hiragana,
...
group by w.id
...
order by hiragana
the values for hiragana are not necessarily deterministic. If, for the same w.id, there are several values for a.hiragana, MySQL can pick any of those. And if you order by that non-deterministically chosen value, you can get different orders. MySQL doesn't actually pick a random row, just doesn't care which one it is, so oftentimes, you get the same (which can make this harder to spot), but not always.
It doesn't have to be the entry with id 31752 for which MySQL has picked a different value for hiragana (it can be any of the previous 600 rows), but I would check that value first - if it has a 2nd value that also starts with "あ" but would be ordered after the value for 47348 (or is longer), it might immediately make things clearer.
You can technically fix this by picking a deterministic value there, e.g. the min or max value:
select * from (
SELECT w.id,
min(hiragana) as hiragana,
...
group by w.id
) as groupQuery
order by length(hiragana), hiragana, id asc limit 600,5;
You have to check if that is what you are actually trying to do (e.g., if there are several choices for hiragana, you don't care which one is chosen, as long as it is a determinic one) and if this fits your required result. Other choices might be group by w.id, a.hiragana or group by w.id, a.id, or maybe you need to completely rewrite your query (as it may not cover this case).

SELECT another column than using with GROUP BY [duplicate]

I'm trying to add features to a preexisting application and I came across a MySQL view something like this:
SELECT
AVG(table_name.col1),
AVG(table_name.col2),
AVG(table_name.col3),
table_name.personID,
table_name.col4
FROM table_name
GROUP BY table_name.personID;
OK so there's a few aggregate functions. You can select personID because you're grouping by it. But it also is selecting a column that is not in an aggregate function and is not a part of the GROUP BY clause. How is this possible??? Does it just pick a random value because the values definitely aren't unique per group?
Where I come from (MSSQL Server), that's an error. Can someone explain this behavior to me and why it's allowed in MySQL?
It's true that this feature permits some ambiguous queries, and silently returns a result set with an arbitrary value picked from that column. In practice, it tends to be the value from the row within the group that is physically stored first.
These queries aren't ambiguous if you only choose columns that are functionally dependent on the column(s) in the GROUP BY criteria. In other words, if there can be only one distinct value of the "ambiguous" column per value that defines the group, there's no problem. This query would be illegal in Microsoft SQL Server (and ANSI SQL), even though it cannot logically result in ambiguity:
SELECT AVG(table1.col1), table1.personID, persons.col4
FROM table1 JOIN persons ON (table1.personID = persons.id)
GROUP BY table1.personID;
Also, MySQL has an SQL mode to make it behave per the standard: ONLY_FULL_GROUP_BY
FWIW, SQLite also permits these ambiguous GROUP BY clauses, but it chooses the value from the last row in the group.†
† At least in the version I tested. What it means to be arbitrary is that either MySQL or SQLite could change their implementation in the future, and have some different behavior. You should therefore not rely on the behavior staying they way it is currently in ambiguous cases like this. It's better to rewrite your queries to be deterministic and not ambiguous. That's why MySQL 5.7 now enables ONLY_FULL_GROUP_BY by default.
I should have Googled for just a bit longer... It seems I found my answer.
MySQL extends the use of GROUP BY so
that you can use nonaggregated columns
or calculations in the SELECT list
that do not appear in the GROUP BY
clause. You can use this feature to
get better performance by avoiding
unnecessary column sorting and
grouping. For example, you do not need
to group on customer.name in the
following query
In standard SQL, you would have to add
customer.name to the GROUP BY clause.
In MySQL, the name is redundant.
Still, that just seems... wrong.
Let's say you have a query like this:
SELECT g, v
FROM t
GROUP BY g;
In this case, for each possible value for g, MySQL picks one of the corresponding values of v.
However, which one is chosen, depends on some circumstances.
I read somewhere that for each group of g, the first value of v is kept, in the order how the records were inserted into the table t.
This is quite ugly, because the records in a table should be treated as a set where the order of the elements should not matter. This is so "mysql-ish"...
If you want to determine which value for v to keep, you need to apply a subselect for t like this:
SELECT g, v
FROM (
SELECT *
FROM t
ORDER BY g, v DESC
) q
GROUP BY g;
This way you define which order the records of the subquery are processed by the external query, thus you can trust which value of v it will pick for the individual values of g.
However, if you need some WHERE conditions then be very careful. If you add the WHERE condition to the subquery then it will keep the behaviour, it will always return the value you expect:
SELECT g, v
FROM (
SELECT *
FROM t
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
GROUP BY g;
This is what you expect, the subselect filters and orders the table. It keeps the records where g has the given value and the external query returns that g and the first value for v.
However, if you add the same WHERE condition to the outer query then you get a non-deterministic result:
SELECT g, v
FROM (
SELECT *
FROM t
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g;
Surprisingly, you may get different values for v when executing the same query again and again which is... strange. The expected behaviour is to get all the records in the appropriate order from the subquery, filtering them in the outer query and then picking the same as it picked in the previous example. But it does not.
It picks a value for v seemingly randomly. The same query returned different values for v if I executed more (~20) times, but the distribution was not uniform.
If instead of adding an outer WHERE, you specify a HAVING condition like this:
SELECT g, v
FROM (
SELECT *
FROM t1
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
ORDER BY g, v DESC
) q
-- WHERE g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9'
GROUP BY g
HAVING g = '737a8783-110c-447e-b4c2-1cbb7c6b72c9';
Then you get a consistent behaviour again.
CONCLUSION
I would suggest not to rely on this technique at all. If you really want/need to then avoid WHERE conditions in the outer query. Use it in the inner query if you can or a HAVING clause in the outer query.
I tested it with this data:
CREATE TABLE t1 (
v INT,
g VARCHAR(36)
);
INSERT INTO t1 VALUES (1, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
INSERT INTO t1 VALUES (2, '737a8783-110c-447e-b4c2-1cbb7c6b72c9');
in MySQL 5.6.41.
Maybe it is just a bug that gets/got fixed in newer versions, please give feedback if you have experience with newer versions.
select * from personel where p_id IN(select
min(dbo.personel.p_id)
FROM
personel
GROUP BY dbo.personel.p_adi)

Correct format for Select in SQL Server

I have what should be a simple query for any database and which always runs in MySQL but not in SQL Server
select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
group by assetid
order by ts desc
The error is:
column tagalerts.id is invalid in the select list because it is not contained in either an aggregate function or the group by clause.
It is not a simple case of adding tagalerts.id to the group by clause because the error repeats for ts and for assetid etc, implying that all the selects need to be in a group or in aggregate functions... either of which will result in a meaningless and inaccurate result.
Splitting the select into a subquery to sort and group correctly (which again works fine with MySQL, as you would expect) makes matters worse
SELECT * from
(select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
order by ts desc
)T1
group by assetid
the order by clause is invalid in views, inline functions, derived tables and expressions unless TOP etc is used
the 'correct output' should be
id ts assetid zonename battlevel
1234 a datetime 1569 Reception 0
3182 another datetime 1572 Reception 0
Either I am reading SQL Server's rules entirely wrong or this is a major flaw with that database.
How can I write this to work on both systems?
In most databases you can't just include columns that aren't in the GROUP BY without using an aggregate function.
MySql is an exception to that. But MS SQL Server isn't.
So you could keep that GROUP BY with only the "assetid".
But then use the appropriate aggregate functions for all the other columns.
Also, use the JOIN syntax for heaven's pudding sake.
A SQL like select * from table1, table2 where table1.id2 = table2.id is using a syntax from the previous century.
SELECT
MAX(node.id) AS id,
MAX(ta.ts) AS ts,
ta.assetid,
MAX(node.zonename) AS zonename,
MAX(ta.battlevel) AS battlevel
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
GROUP BY ta.assetid
ORDER BY ta.ts DESC;
Another trick to use in MS SQL Server is the window function ROW_NUMBER.
But this is probably not what you need.
Example:
SELECT id, ts, assetid, zonename, battlevel
FROM
(
SELECT
node.id,
ta.ts,
ta.assetid,
node.zonename,
ta.battlevel,
ROW_NUMBER() OVER (PARTITION BY ta.assetid ORDER BY ta.ts DESC) AS rn
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
) q
WHERE rn = 1
ORDER BY ts DESC;
I strongly suspect this query is WRONG even in MySql.
We're missing a lot of details (sample data, and we don't know which table all of the columns belong to), but what I do know is you're grouping by assetid, where it looks like one assetid value could have more than one ts (timestamp) value in the group. It also looks like you're counting on the order by ts desc to ensure both that you see recent timestamps in the results first and that each assetid group uses the most recent possible ts timestamp for that group.
MySql only guarantees the former, not the latter. Nothing in this query guarantees that each assetid is using the most recent timestamp available. You could be seeing the wrong timestamps, and then also using those wrong timestamps for the order by. This is the problem the Sql Server rule is there to stop. MySql violates the SQL standard to allow you to write that wrong query.
Instead, you need to look at each column and either add it to the group by (best when all of the values are known to be the same, anyway) or wrap it in an aggregrate function like MAX(), MIN(), AVG(), etc, so there is a deterministic result for which value from the group is used.
If all of the values for a column in a group are the same, then there's no problem adding it to the group by. If the values are different, you want to be precise about which one is chosen for the result set.
While I'm here, the tagalerts, node join syntax has been obsolete for more than 20 years now. It's also good practice to use an alias with every table and prefix every column with the alias. I mention these to explain why I changed it for my code sample below, though I only prefix columns where I am confident in which table the column belongs to.
This query should run on both databases:
SELECT ta.assetid, MAX(ta.id) "id", MAX(ta.ts) "ts",
MAX(n.zonename) "zonename", MAX(battlevel) "battlevel"
FROM tagalerts ta
INNER JOIN node n ON ta.nodeid = n.id
WHERE ack = 0
GROUP BY ta.assetid
ORDER BY ts DESC
There is also a concern here the results may be choosing values from different records in the joined node table. So if battlevel is part of the node table, you might see a result that matches a zonename with a battlevel that never occurs in any record in the data. In Sql Server, this is easily fixed by using APPLY to match only one node record to each tagalert. MySql doesn't support this (APPLY or an equivalent has been in every other major database since at least 2012), but you can simulate with it in this case with two JOINs, where the first join is a subquery that uses GROUP BY to determine values will uniquely identify the needed node record, and second join is to the node table to actually produce that record. Unfortunately, we need to know more about the tables in question to actually write this code for you.

Order by showing bad results in MariaDB

In my query, I need to get the previous row with the current row and then join a few tables. I got the previous row by using SQL variables in my development server(MySQL 5.7), everything works fine, but in my production(MariaDB 10) server that previous row results are bad just mixed, bad part is only that previous row with SQL variables other query parts works good. Before it i thought, that problem is in sql variables part, but now i realized the problem is in "order by" keyword.
My query:
SELECT
customers.title,
calendar.start_time,
calendar.hours_per_time,
calendar.self_certification,
calendar.bulletin_certification,
calendar.extra,
DATE_FORMAT(calendar.date, '%d-%m') AS day_month,
TIME_FORMAT(calendar.start_time, '%H:%i') AS hours_min,
#previous_start AS previous_start,
#previous_start := calendar.start_time,
#previous_end AS previous_end,
#previous_end := calendar.hours_per_time
FROM
(SELECT #previous_start := '00:00', #previous_end := '0.00') AS calendar_prev, calendar
INNER JOIN relationships ON calendar.relation_id = relationships.relation_id
INNER JOIN customers ON customers.customer_id = relationships.customer_id
WHERE relationships.user_id = '$user_id'
AND DATE_FORMAT(calendar.date, '%m-%Y') = '$date'
ORDER BY calendar.date, hours_min ASC
If i remove hours_min from order by part everything works fine in both servers, but then i lose my ordering.
This is my result from development server green part is is from sql variables and here works fine:
And here is from production server with bad results in red part
So how can I keep my order and still have good results? Order is only needed by date(Dato) and hours_min(Fra kl.) columns.
Wrap the query as an inline view, and specify a different ORDER BY on the outer query.
As a simple demonstration of the pattern:
SELECT v.fee
, v.fo
, v.fi
FROM (
SELECT t.fee
, t.fi
, t.fo
JOIN t
ORDER BY t.fi ASC, t.fo ASC
) v
ORDER
BY v.fee DESC
We can process rows "in order" in the query inside the inline view, using an ORDER BY clause on the inner SELECT statement.
The ORDER BY on the outer query can reorder the results returned by the inner query.
NOTES:
The MySQL Reference manual cautions that the behavior of user-defined variables that are set and read within the same statement is not guaranteed. With that said, we do observe a consistent behavior.
It's an order of operations issue. That is, we are carefully constructing our SQL in such a way that MySQL execution plan gets us a predictable order of operations.
What we have discovered is that the ORDER BY is being processed before the expressions in the SELECT list are evaluated.
So, if we need to process rows "in order" such that the user defined variables contain values from the "previous" row when the expressions in the SELECT list are evaluated, then we need to have an ORDER BY that gets the rows in the desired order.
If we want the resulting rows in a different order, we need another ORDER BY operation to be processed later. And we can get that using an inline view (what MySQL refers to as a "derived table"). That's because MySQL materializes that derived table "v" before the outer query is processed.
The SELECT list of the outer query can specify a different order of columns or omit columns. The order of expressions in the SELECT list of the inner query can be dictated by the order of operations required when working with the user-defined variables: the assignments that save the current row to the user-defined variables has to happen AFTER the user-defined variables are evaluated.
Also, I would recommend ditching the comma syntax for the join operation, and replace that with the JOIN keyword. The CROSS keyword is optional, but it does serve as an indication to the future reader that the omission of the ON clause is intentional, and not an oversight.
The INNER keyword is also optional; it has no effect, and my preference is to omit that.
FROM (SELECT #previous_start := '00:00', #previous_end := '0.00') calendar_prev
CROSS
JOIN calendar
JOIN relationships
ON calendar.relation_id = relationships.relation_id
Combining ORDER BY with variables in MariaDB is a bit tricky. Can you make a copy of the table already ordered by calendar.date, hours_min ASC and run the query on that copy without the ORDER BY?

Save order of SELECT result in complex query

I need to sort selected_booking by cost first and then assign the index i to every row. My variant doesn't work properly (outer SELECT breaks the order):
SELECT (#i:=#i + 1) AS i, selected_booking.*
FROM (SELECT * FROM booking ORDER BY cost DESC) AS selected_booking;
Is there any way to save the order of inner selection when doing outer one?
Q: Is there any way to save the order of inner selection when doing outer selection?
A: Absent an ORDER BY clause on the outer query, MySQL is free to return the rows in any order it chooses.
If you want rows from the inline view (derived table) returned in a specific order, you need to specify that in the outer query... you'd need to add an ORDER BY clause on the outer query.
NOTE: The behavior of user-defined variables as in your query is not guaranteed, the MySQL Reference Manual warns of this. But in spite of that warning, we do observe repeatable behavior in MySQL 5.1 and 5.5.
It's not at all clear why you need an inline view (aka a derived table, in the MySQL venacular) in the example you give.
It seems like this query would return the result you seem to want:
SET #i = 0 ;
SELECT #i:=#i+1 AS i
, b.*
FROM booking b
ORDER BY b.cost DESC ;
Alternatively, you could do this in a single statement, and initialize #i within the query, rather than a separate SET statement.
SELECT #i:=#i+1 AS i
, b.*
FROM booking b
JOIN (SELECT #i:=0) i
ORDER BY b.cost DESC
(This initialization works, again, because of the way the MySQL processes inline views, the inline view query is run BEFORE the outer query. This isn't guaranteed behavior, and may change in a future release (it may have already changed in 5.6)
NOTE: For improved performance of this query, if a suitable index is available with cost as the leading column, e.g.
... ON booking (cost)
that may allow MySQL to use that index to return rows in order and avoid a "Using filesort" operation.