Strange behavior of MySQL aggregate function - mysql

in mySQL, following works:
Case 1:
Select X , MAX(Y) from table
but in MS SQL Server, you will get
"it is not contained in either an aggregate function or the GROUP BY clause."
the proper way will be
Select X, MAX(Y) from table group by X
more worse, In mySQL, you can:
Case 2:
Select X, Y, MAX(Z) from table group by X
My question is, how MySQL determine the Y in above case?
how about the X value in case 1?
Why MySQL allows such strange behavior?

Mysql's documentation on group by handling explains in great detail how and why mysql behaves under certain configuration settings when you use group by.
As #Mihai already pointed out in his comment, mysql has only full group by sql mode that governs whether to
permit queries for which the select list, HAVING condition, or ORDER
BY list refer to nonaggregated columns that are neither named in the
GROUP BY clause nor are functionally dependent on (uniquely determined
by) GROUP BY columns.
The reason for allowing such relaxation of syntax is that in many cases tables / views may contain fields that are functionally dependent on other fields. In simple words: one field's value determines the other fields value. With the relaxed syntax you only have to include the field(s) that determine the value of the other fields in the group by clause.
If you use the relaxed syntax, but the functional dependency does not exist, then
the server is free to choose any value from each group, so unless they
are the same, the values chosen are indeterminate, which is probably
not what you want.
In practice, mysql picks the 1st value for such fields that it encounters when scanning the data, so it is not completely random. However, relying on this feature is a bit suicidal, since mysql can change this behaviour anytime without any notice.
As I noted already in a comment, mysql is not unique with this approach. Sybase also allows this relaxed syntax:
Transact-SQL extensions to group by and having
Transact-SQL extensions to standard SQL make displaying data more
flexible, by allowing references to columns and expressions that are
not used for creating groups or summary calculations:
A select list that includes aggregates can include extended columns that are not arguments of aggregate functions and are not
included in the group by clause. An extended column affects the
display of final results, since additional rows are displayed.
Its behaviour is different from that of mysql's though.

Related

Order of variable assignment evaluation in SELECT can differ from order of rows returned. Under what conditions can this occur?

I was recently trying to use a user-defined variable to capture some information from the last row returned in my result set.
What I mean is, for example if I have a list of names from 'Aaron' to 'Zzarx',
SELECT #n:=Name FROM people ORDER BY Name;
SELECT #n;
The second SELECT should return 'Zzarx'.
That's the simple case. It works as expected; variable assignment reliably occurs in the same order as rows are sent to the client, so the last assignment corresponds to the last returned row.
But strange things seem to happen when the query is more complicated:
SELECT DISTINCT IFNULL(#n:=Name,'unknown') FROM people ORDER BY <some non-indexed expression> LIMIT 10;
SELECT #n;
Executing something like this on MariaDB v10.3.16 I get a final value of #n (from the second SELECT) that doesn't correspond to any of the rows returned by the first SELECT!. (Note that Name is a NOT NULL column, so the IFNULL() is actually redundant, but is still necessary to trigger this behaviour).
Note that it only seems to happen when ALL of the following hold:
SELECT DISTINCT
ORDER BY can't use an index
The variable assignment happens inside some expression
My theory is that:
SELECT DISTINCT forces early evaluation of the returned column expressions.
ORDER BY (non-indexed expression) forces an explicit sort operation after column data has been evaluated.
The SQL engine is smart enough to recognize the simple SELECT #var := (expression) pattern and evaluate #var only as the row is sent to the client, but can't make that optimization if the #var:=... assignment is embedded inside a larger expression, as in the IFNULL() in my example.
However, this is all only guesswork.
The manual page on user-defined variables doesn't really say anything useful in this regard (neither MySQL's nor MariaDB's).
It seems to me that using a #variable to capture something from the last-returned row in a multi-row query is a useful and probably quite commonplace trick, but now I'm not sure whether or when I can rely on it. Similarly for lots of row-numbering and other clever schemes I've seen that utilize #variables in the result set part of a SELECT.
Does someone here on SO have any definitive information on how this is supposed to work, and specifically, under what conditions will the order of evaluations of row variable-assignment expressions be guaranteed to correspond to the actual order of rows returned?
...Because this seems to be quite an important thing to know!
Another, slightly less pathological example:
Say table t has 1000 rows:
SET #n:=0;
SELECT #n:=#n+1 FROM t ORDER BY 1 DESC LIMIT 5;
SELECT #n;
Returned result sets are:
1000
999
998
997
996
and
1000
Note that once again, the final value of #n does NOT correspond to the last row returned, and indeed given the semantics of the query, in this case it can't.
Although you are not using 8.0.13, the following will be coming soon. You have found a reason why it is coming.
----- 2018-10-22 8.0.13 General Availability -- -- Important Change -----
Setting user variables in statements other than
SET is
now deprecated due to issues that included those listed here:
The order of evaluation for expressions involving user variables was
undefined.
The default result type of a variable is based on its type at the
beginning of the statement, which could have unintended effects when a
variable holding a value of one type at the beginning of a statement
was assigned a new value of a different type in the same statement.
HAVING, GROUP BY, and ORDER BY clauses, when referring to a variable
that was assigned a value in the select expression list, did not work
as expected because the expression was evaluated on the client and so
it was possible for stale column values from a previous row to be
used.
Syntax such as SELECT #var, #var:=#var+1 is still accepted in MySQL
8.0 for backward compatibility, but is subject to removal in a future release.
-- From the "change log".
Think of DISTINCT as similar to GROUP BY.
SELECT #v := ... FROM t ORDER BY x;
Case 1: INDEX(x) but the Optimizer may choose to fetch the rows, then sort them.
Case 2: INDEX(x) and the Optimizer chooses to fetch the rows based on the index.
SELECT #v := ... FROM t GROUP BY w ORDER BY x;
This almost certainly requires generating a temp table (for ordering), maybe two (one for grouping and one for ordering). The only rational way to run the query is to evaluate the expressions (including #v) in the SELECT, gather the results, then proceed to grouping and ordering. So, the evaluation order is not likely to be that of x. But it might mimic w.
What about PARTITIONing? Currently, there is no parallelism in MySQL's evaluation of a SELECT. But, what if that came into existence? Let's take an 'obvious' case -- separate threads working on separate PARTITIONs of the table. All bets are off in the order of evaluation.
Once that' is implemented, the how about splitting up even a non-partitioned SELECT to get some parallelism?
You are not going to win the argument.
Yes, it may stay "deprecated" for a long time. Or maybe there will be a sql_mode that runs queries the "old" way. Or the existence of #variables inhibits certain optimizations (in favor of predictability). Etc.
May I suggest that you write a "feature request" at bugs.mysql.com , stating what you would like to see. (You could also do it at mariadb.com, but they look at the former.)

What does MySQL do if no aggregating function specified?

Recently I found that despite the fact that patientID is duplicating in my Samples table, the following query works
SELECT * FROM Samples GROUP BY patientID
and returns multiple values for multiple columns.
What aggregation function it uses by default?
First, this is badly formed SQL and you should simply not use it.
But what does it do? It returns a result set with one row per PatientId. The additional columns specified by the SELECT * come from indeterminate rows in the data. There is no guarantee that the extra columns even come from the same row.
In practice, the values seem to come from the first row encountered. However, MySQL is quite explicit that you cannot depend on this behavior. In general, you should avoid using aggregation statements that have unaggregated columns in the SELECT that are not in the GROUP BY. Other databases do not support this syntax (unless the GROUP BY keys form a unique/primary key on the data being aggregated).
MySQL doesn't appear to use an aggregation function at all. The records chosen in this case are indeterminate, as the documentation states:
In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
But you might be wondering why this feature even exists in the first place. If you are writing a query where you know that all the values in a column be the same, then this feature can possibly save you some work by not having to write a join or subquery to make the GROUP BY strictly compliant.
You have tuo use an aggregate function such as SUM, AVG, COUNT
dev.mysql GROUP BY
None. If the ONLY_FULL_GROUP_BY sql mode is not enabled, then MySQL allows
MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. This causes MySQL to accept the preceding query. In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
This sql mode is enabled by default rom v5.7.5 only.
Since you've not specified the version of the MySQL server, there are two possible answers.
Prior MySQL 5.7.5, the above query is valid, but with the following comment for all the columns not listed in GROUP BY nor aggregated:
The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
(https://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html)
Since MySQL 5.7.5, this behaviour was changed and MySQL implements the SQL99 standard:
SQL99 and later permits such nonaggregates per optional feature T301 if they are functionally dependent on GROUP BY columns
(https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html)
So some columns could be valid, however the query itself is not valid, since not all columns are functionally dependent on the patientID column (there could be both blood and skin sample).
In general, it is a bad behaviour to use SELECT *, and to not define what to do with all the resulting columns in an aggregating query.
TL;DR;
MySQL prior 5.7.5 will execute the query and the result is unpredictable, MySQL after 5.7.5 will throw an error.

Rewrite a group-by over a randomly-ordered sub-query using only one select

Here's the thing. I'm having 3 tables, and I'm doing this query:
select t.nomefile, t.tipo_nome, t.ordine
from
(select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
order by RAND()
) as t
group by t.tipo_nome
order by t.ordine
It's applied to 3 tables, all in relationship 1-N, which need to be joined and then take 1 random result from each different result in the higher level table. This query works just fine, the problem is that I'm being asked to rewrite this query USING ONLY ONE SELECT. I've come with another way of doing this with only one select, the thing is that according to SQL sintax the GROUP BY must be before the ORDER BY, so it's pointless to order by random when you already have only the first record for each value in the higher level table.
Someone has a clue on how to write this query using only one select?
Generally, if I am not much mistaken, an ORDER BY clause in the subquery of a query like this has to do with a technique that allows you to pull non-GROUP BY columns (in the outer query) according the order specified. And so you may be out of luck here, because that means the subquery is important to this query.
Well, because in this specific case the order chosen is BY RAND() and not by a specific column/set of columns, you may have a very rough equivalent by doing both the joins and the grouping on the same level, like this:
select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
group by tipo_nome
order by categorie.ordine
You must understand, though, why this is not an exact equivalent. The thing is, MySQL does allow you to pull non-GROUP BY columns in a GROUP BY query, but if they are not correlated to the GROUP BY columns, then the values returned would be... no, not random, the term used by the manual is indeterminate. On the other hand, the technique mentioned in the first paragraph takes advantage of the fact that if the row set is ordered explicitly and unambiguously prior to grouping, then the non-GROUP BY column values will always be the same*. So indeterminateness has to do with the fact that "normally" rows are not ordered explicitly before grouping.
Now you can probably see the difference. The original version orders the rows explicitly. Even if it's BY RAND(), it is intentionally so, to ensure (as much as possible) different results in the output most of the times. But the modified version is "robbed" of the explicit ordering, and so you are likely to get identical results for many executions in a row, even if they are kind of "random".
So, in general, I consider your problem unsolvable for the above stated reasons, and if you choose to use something like the suggested modified version, then just be aware that it is likely to behave slightly differently from the original.
* The technique may not be well documented, by the way, and may have been found rather empirically than by following manuals.
I was not able to understand the reasons behind the request to rewrite this query, however, i found out that there is a solution which uses the "select" word only once. Here's the query:
SELECT g.type, SUBSTRING_INDEX(GROUP_CONCAT(
i.nomefile ORDER BY
RAND()),',',1) nomefile
FROM t_gallerie_new g JOIN t_gallerie_immagini_new i ON g.id=i.id_ref
GROUP BY g.type;
for anyone interested in this question.
NOTE: The use of GROUP_CONCAT has a couple of downsides: It is not recommended to use this keyword when using medium/large tables since it could increase the server side payload. Also, there is a limit to the size of the string returned by GROUP_CONTACT, by default 1024, so, it's necessary to modify a parameter in the mySql server to be able to receive a bigger string from this instruction.

SQL row return order

I have only used SQL rarely until recently when I began using it daily. I notice that if no "order by" clause is used:
When selecting part of a table the rows returned appear to be in the same order as they appear if I select the whole table
The order of rows returned by a selecting from a join seemes to be determined by the left most member of a join.
Is this behaviour a standard thing one can count on in the most common databases (MySql, Oracle, PostgreSQL, Sqlite, Sql Server)? (I don't really even know whether one can truly count on it in sqlite). How strictly is it honored if so (e.g. if one uses "group by" would the individual groups each have that ordering)?
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Whilst some RDBMSes will return rows in specific orders in some situations even when an ORDER BY clause is omitted, such behaviour should never be relied upon.
Section 20.2 <direct select statement: multiple rows>, subsection "General Rules" of
the SQL-92 specification:
4) If an <order by clause> is not specified, then the ordering of
the rows of Q is implementation-dependent.
If you want order, include an ORDER BY. If you don't include an ORDER BY, you're telling SQL Server:
I don't care what order you return the rows, just return the rows
Since you don't care, SQL Server is going to decide how to return the rows what it deems will be the most efficient manner possible right now (or according to the last time the plan for this specific query was cached). Therefore you should not rely on the behavior you observe. It can change from one run of the query to the next, with data changes, statistics changes, index changes, service packs, cumulative updates, upgrades, etc. etc. etc.
For PostgreSQL, if you omit the ORDER BY clause you could run the exact same query 100 times while the database is not being modified, and get one run in the middle in a different order than the others. In fact, each run could be in a different order.
One reason this could happen is that if the plan chosen involves a sequential scan of a table's heap, and there is already a seqscan of that table's heap in process, your query will start it's scan at whatever point the other scan is already at, to reduce the need for disk access.
As other answers have pointed out, if you want the data in a certain order, specify that order. PostgreSQL will take the requested order into consideration in choosing a plan, and may use an index that provides data in that order, if that works out to be cheaper than getting the rows some other way and then sorting them.
GROUP BY provides no guarantee of order; PostgreSQL might sort the data to do the grouping, or it might use a hash table and return the rows in order of the number generated by the hashing algorithm (i.e., pretty random). And that might change from one run to the next.
It never ceased to amaze me when I was a DBA that this feature of SQL was so often thought of as quirky. Consider a simple program that runs against a text file and produces some output. If the program never changes, and the data never changes, you'd expect the output to never change.
As for this:
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Not strictly true - on every RDBMS I've ever worked on (Oracle, Informix, SQL Server, DB2 to name a few) a DISTINCT clause also has the same effect as an ORDER BY as finding unique values involves a sort by definition.
EDIT (6/2/14):
Create a simple table
For DISTINCT and ORDER BY, both the plan and the cost is the same since it is ostensibly the same operation to be performed
And not surprisingly, the effect is thus the same

Is MySQL breaking the standard by allowing selecting columns that are not part of the group by clause?

I am used to Microsoft technologies including SQL Server. Today I ran across a Q&A where the following passage from the MySQL documentation was quoted:
Standard SQL would reject your query because you can not SELECT
non-aggregate fields that are not part of the GROUP BY clause in an
aggregate query. MySQL extends the use of GROUP BY so that the select
list can refer to nonaggregated columns not named in the GROUP BY
clause. This means that the preceding query is legal in MySQL. You
can use this feature to get better performance by avoiding unnecessary
column sorting and grouping. However, this is useful primarily when
all values in each nonaggregated column not named in the GROUP BY are
the same for each group. The server is free to choose any value from
each group, so unless they are the same, the values chosen are
indeterminate.
Is MySQL breaking the standard by allowing this? How? What is the result of allowing this?
Standard SQL would reject your query because you can not SELECT non-aggregate fields that are not part of the GROUP BY clause in an aggregate query
This is correct, up to 1992.
But it is plainly wrong, from 2003 and beyond.
From SQL-2003 standard, 6IWD6-02-Foundation-2011-01.pdf, from http://www.wiscorp.com/, paragraph-7.12 (query specification), page 398:
If T is a grouped table, then let G be the set of grouping columns of T. In each ((value expression)) contained
in ((select list)) , each column reference that references a column of T shall reference some column C that
is functionally dependent on G or shall be contained in an aggregated argument of a ((set function specification))
whose aggregation query is QS
Now MYSQL, has implemented this feature by allowing not only columns that are functionally dependent on the grouping columns but allowing all columns. This is causing some problems with users that do not understand how grouping works and get indeterminate results where they don't expect.
But you are right to say that MySQL has added a feature that conflicts with SQL-standards (although you seem to think that for the wrong reason). It's not entirely accurate as they have added a SQL-standard feature but not in the best way (more like the easy way) but it does conflict with the latest standards.
To answer your question, the reason for this MySQL feature (extension) is I suppose to be accordance with latest SQL-standards (2003+). Why they chose to implement it this way (not fully compliant), we can only speculate.
As #Quassnoi and #Johan answered with examples, it's mainly a performance and maintainability issue. But one can't easily change the RDBMS to be clever enough (Skynet excluded) to recognize functionally dependent columns, so MySQL developers made a choice:
We (MySQL) give you (MySQL users) this feature which is in SQL-2003 standards. It improves speed in certain GROUP BY queries but there's a catch. You have to be careful (and not the SQL engine) so columns in the SELECT and HAVING lists are functionally dependent on the GROUP BY columns. If not, you may get indeterminate results.
If you want to disable it, you can set sql_mode to ONLY_FULL_GROUP_BY.
It's all in the MySQL docs: Extensions to GROUP BY (5.5) - although not in the above wording but as in your quote (they even forgot to mention that it's a deviation from standard SQL-2003 while not standard SQL-92). This kind of choices is common I think in all software, other RDBMS included. They are made for performance, backward compatibility and a lot of other reasons. Oracle has the famous '' is the same as NULL for example and SQL-Server has probably some, too.
There is also this blog post by Peter Bouman, where MySQL developers' choice is defended: Debunking GROUP BY myths.
In 2011, as #Mark Byers informed us in a comment (in a related question at DBA.SE), PostgreSQL 9.1 added a new feature (release date: September 2011) designed for this purpose. It is more restrictive than MySQL's implementation and closer to the standard.
Later, in 2015 MySQL announced that in 5.7 version, the behaviour is improved to conform with the standard and actually recognize functional dependencies, (even better than the Postgres implementation). The documentation: MySQL Handling of GROUP BY (5.7) and another blog post by Peter Bouman: MySQL 5.7.5: GROUP BY respects functional dependencies!
Is MySQL breaking the standard by allowing this? How?
It lets you write a query like that:
SELECT a.*, COUNT(*)
FROM a
JOIN b
ON b.a = a.id
GROUP BY
a.id
Other systems would require you to add all columns from a into the GROUP BY list which makes the query larger, less maintanable and less efficient.
In this form (with grouping by the PK), this does not contradict the standard since every column in a is functionally dependent on its primary key.
However, MySQL does not really check the functional dependency and lets you select columns not functionally dependent on the grouping set. This can yield indeterminate results and should not be relied upon. The only thing guaranteed is that the column values belong to some of the records sharing the grouping expression (not even to one record!).
This behavior can be disabled by setting sql_mode to ONLY_FULL_GROUP_BY.
Short answer
It's a speed hack
That is enabled by default, but that can be disabled with this setting: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html
Long answer
The reason for the non-standard shorthand group by clause is that it's a speed hack.
MySQL lets the programmer determine whether the selected fields are functionally dependent on the group by clause.
The DB does not do any testing, but just selects the first result that it finds as the value of the field.
This results in considerable speed ups.
Consider this code:
SELECT f1, f2, f3, f4 FROM t1 GROUP BY f2
-- invalid in most SQL flavors, valid in MySQL
MySQL will just select the first value it finds, spending a minimum amount of time.
f1,f3, f4 will be from the same row, but this relation will fall apart if multiple tables with joins are involved.
In order to do the same something simular in SQL-server you'd have to do
SELECT MIN(f1), f2, MIN(f3), MIN(f4) FROM t1 GROUP BY f2
-- valid SQL, but really a hack
The DB will now have to examine all results to find the minimum value, huffing and puffing.
f1, f3, f4 will most likely have no relation to each other and will not be from the same row.
If however you do:
SELECT id as `primary_key`, count(*) as rowcount, count(f2) as f2count, f2, f3, f4
FROM t1
GROUP BY id
All the rest of the fields will be functionally dependent on id.
Rowcount will always be 1, and f2count will be either 0 (if f2 is null) or 1.
On joins, where lots of tables are involved, in a 1-n configuration like so:
Example:
Website 1 -> n Topics 1 -> n Threads 1 -> n Posts 1 -> 1 Person.
And you do a complicated select involving all tables and just do a GROUP BY posts.id
Obviously all other fields are functionally dependent on posts.id (and ONLY on posts.id).
So it makes no sense to list more fields in the group by clause, or to force you to use aggregate functions.
In order to speed things up. MySQL does not force you to do this.
But you do need to understand the concept of functional dependency and the relations in the tables and the join you've written, so it puts a pot of burden on the programmer.
However using:
SELECT
posts.id, MIN(posts.f2)
,MIN(threads.id), min(threads.other)
,MIN(topics.id), ....
,MIN(website.id), .....
,MIN(Person.id), ...
FROM posts p
INNER JOIN threads t on (p.thread_id = t.id)
INNER JOIN topic to on (t.topic_id = to.id)
INNER JOIN website w ON (w.id = to.website_id)
INNER JOIN person pe ON (pe.id = p.person_id)
GROUP BY posts.id //NEVER MIND THE SYNTAX ERROR WITH THE ALIASES
Puts exactly the same mental burden on the programmer.
All the big DBMSs have their own flavours and extensions; otherwise why would there ever be more than one of them?
Following the SQL Standards stringently is nice and all, but providing extensions with more functionality is even better. The quote from the documentation explains how this functionality is useful.
There isn't much of a conflict in this case, so I don't really see the issue.