What does MySQL do if no aggregating function specified? - mysql

Recently I found that despite the fact that patientID is duplicating in my Samples table, the following query works
SELECT * FROM Samples GROUP BY patientID
and returns multiple values for multiple columns.
What aggregation function it uses by default?

First, this is badly formed SQL and you should simply not use it.
But what does it do? It returns a result set with one row per PatientId. The additional columns specified by the SELECT * come from indeterminate rows in the data. There is no guarantee that the extra columns even come from the same row.
In practice, the values seem to come from the first row encountered. However, MySQL is quite explicit that you cannot depend on this behavior. In general, you should avoid using aggregation statements that have unaggregated columns in the SELECT that are not in the GROUP BY. Other databases do not support this syntax (unless the GROUP BY keys form a unique/primary key on the data being aggregated).

MySQL doesn't appear to use an aggregation function at all. The records chosen in this case are indeterminate, as the documentation states:
In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
But you might be wondering why this feature even exists in the first place. If you are writing a query where you know that all the values in a column be the same, then this feature can possibly save you some work by not having to write a join or subquery to make the GROUP BY strictly compliant.

You have tuo use an aggregate function such as SUM, AVG, COUNT
dev.mysql GROUP BY

None. If the ONLY_FULL_GROUP_BY sql mode is not enabled, then MySQL allows
MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. This causes MySQL to accept the preceding query. In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
This sql mode is enabled by default rom v5.7.5 only.

Since you've not specified the version of the MySQL server, there are two possible answers.
Prior MySQL 5.7.5, the above query is valid, but with the following comment for all the columns not listed in GROUP BY nor aggregated:
The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
(https://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html)
Since MySQL 5.7.5, this behaviour was changed and MySQL implements the SQL99 standard:
SQL99 and later permits such nonaggregates per optional feature T301 if they are functionally dependent on GROUP BY columns
(https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html)
So some columns could be valid, however the query itself is not valid, since not all columns are functionally dependent on the patientID column (there could be both blood and skin sample).
In general, it is a bad behaviour to use SELECT *, and to not define what to do with all the resulting columns in an aggregating query.
TL;DR;
MySQL prior 5.7.5 will execute the query and the result is unpredictable, MySQL after 5.7.5 will throw an error.

Related

Strange behavior of MySQL aggregate function

in mySQL, following works:
Case 1:
Select X , MAX(Y) from table
but in MS SQL Server, you will get
"it is not contained in either an aggregate function or the GROUP BY clause."
the proper way will be
Select X, MAX(Y) from table group by X
more worse, In mySQL, you can:
Case 2:
Select X, Y, MAX(Z) from table group by X
My question is, how MySQL determine the Y in above case?
how about the X value in case 1?
Why MySQL allows such strange behavior?
Mysql's documentation on group by handling explains in great detail how and why mysql behaves under certain configuration settings when you use group by.
As #Mihai already pointed out in his comment, mysql has only full group by sql mode that governs whether to
permit queries for which the select list, HAVING condition, or ORDER
BY list refer to nonaggregated columns that are neither named in the
GROUP BY clause nor are functionally dependent on (uniquely determined
by) GROUP BY columns.
The reason for allowing such relaxation of syntax is that in many cases tables / views may contain fields that are functionally dependent on other fields. In simple words: one field's value determines the other fields value. With the relaxed syntax you only have to include the field(s) that determine the value of the other fields in the group by clause.
If you use the relaxed syntax, but the functional dependency does not exist, then
the server is free to choose any value from each group, so unless they
are the same, the values chosen are indeterminate, which is probably
not what you want.
In practice, mysql picks the 1st value for such fields that it encounters when scanning the data, so it is not completely random. However, relying on this feature is a bit suicidal, since mysql can change this behaviour anytime without any notice.
As I noted already in a comment, mysql is not unique with this approach. Sybase also allows this relaxed syntax:
Transact-SQL extensions to group by and having
Transact-SQL extensions to standard SQL make displaying data more
flexible, by allowing references to columns and expressions that are
not used for creating groups or summary calculations:
A select list that includes aggregates can include extended columns that are not arguments of aggregate functions and are not
included in the group by clause. An extended column affects the
display of final results, since additional rows are displayed.
Its behaviour is different from that of mysql's though.

Rewrite a group-by over a randomly-ordered sub-query using only one select

Here's the thing. I'm having 3 tables, and I'm doing this query:
select t.nomefile, t.tipo_nome, t.ordine
from
(select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
order by RAND()
) as t
group by t.tipo_nome
order by t.ordine
It's applied to 3 tables, all in relationship 1-N, which need to be joined and then take 1 random result from each different result in the higher level table. This query works just fine, the problem is that I'm being asked to rewrite this query USING ONLY ONE SELECT. I've come with another way of doing this with only one select, the thing is that according to SQL sintax the GROUP BY must be before the ORDER BY, so it's pointless to order by random when you already have only the first record for each value in the higher level table.
Someone has a clue on how to write this query using only one select?
Generally, if I am not much mistaken, an ORDER BY clause in the subquery of a query like this has to do with a technique that allows you to pull non-GROUP BY columns (in the outer query) according the order specified. And so you may be out of luck here, because that means the subquery is important to this query.
Well, because in this specific case the order chosen is BY RAND() and not by a specific column/set of columns, you may have a very rough equivalent by doing both the joins and the grouping on the same level, like this:
select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
group by tipo_nome
order by categorie.ordine
You must understand, though, why this is not an exact equivalent. The thing is, MySQL does allow you to pull non-GROUP BY columns in a GROUP BY query, but if they are not correlated to the GROUP BY columns, then the values returned would be... no, not random, the term used by the manual is indeterminate. On the other hand, the technique mentioned in the first paragraph takes advantage of the fact that if the row set is ordered explicitly and unambiguously prior to grouping, then the non-GROUP BY column values will always be the same*. So indeterminateness has to do with the fact that "normally" rows are not ordered explicitly before grouping.
Now you can probably see the difference. The original version orders the rows explicitly. Even if it's BY RAND(), it is intentionally so, to ensure (as much as possible) different results in the output most of the times. But the modified version is "robbed" of the explicit ordering, and so you are likely to get identical results for many executions in a row, even if they are kind of "random".
So, in general, I consider your problem unsolvable for the above stated reasons, and if you choose to use something like the suggested modified version, then just be aware that it is likely to behave slightly differently from the original.
* The technique may not be well documented, by the way, and may have been found rather empirically than by following manuals.
I was not able to understand the reasons behind the request to rewrite this query, however, i found out that there is a solution which uses the "select" word only once. Here's the query:
SELECT g.type, SUBSTRING_INDEX(GROUP_CONCAT(
i.nomefile ORDER BY
RAND()),',',1) nomefile
FROM t_gallerie_new g JOIN t_gallerie_immagini_new i ON g.id=i.id_ref
GROUP BY g.type;
for anyone interested in this question.
NOTE: The use of GROUP_CONCAT has a couple of downsides: It is not recommended to use this keyword when using medium/large tables since it could increase the server side payload. Also, there is a limit to the size of the string returned by GROUP_CONTACT, by default 1024, so, it's necessary to modify a parameter in the mySql server to be able to receive a bigger string from this instruction.

SQL row return order

I have only used SQL rarely until recently when I began using it daily. I notice that if no "order by" clause is used:
When selecting part of a table the rows returned appear to be in the same order as they appear if I select the whole table
The order of rows returned by a selecting from a join seemes to be determined by the left most member of a join.
Is this behaviour a standard thing one can count on in the most common databases (MySql, Oracle, PostgreSQL, Sqlite, Sql Server)? (I don't really even know whether one can truly count on it in sqlite). How strictly is it honored if so (e.g. if one uses "group by" would the individual groups each have that ordering)?
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Whilst some RDBMSes will return rows in specific orders in some situations even when an ORDER BY clause is omitted, such behaviour should never be relied upon.
Section 20.2 <direct select statement: multiple rows>, subsection "General Rules" of
the SQL-92 specification:
4) If an <order by clause> is not specified, then the ordering of
the rows of Q is implementation-dependent.
If you want order, include an ORDER BY. If you don't include an ORDER BY, you're telling SQL Server:
I don't care what order you return the rows, just return the rows
Since you don't care, SQL Server is going to decide how to return the rows what it deems will be the most efficient manner possible right now (or according to the last time the plan for this specific query was cached). Therefore you should not rely on the behavior you observe. It can change from one run of the query to the next, with data changes, statistics changes, index changes, service packs, cumulative updates, upgrades, etc. etc. etc.
For PostgreSQL, if you omit the ORDER BY clause you could run the exact same query 100 times while the database is not being modified, and get one run in the middle in a different order than the others. In fact, each run could be in a different order.
One reason this could happen is that if the plan chosen involves a sequential scan of a table's heap, and there is already a seqscan of that table's heap in process, your query will start it's scan at whatever point the other scan is already at, to reduce the need for disk access.
As other answers have pointed out, if you want the data in a certain order, specify that order. PostgreSQL will take the requested order into consideration in choosing a plan, and may use an index that provides data in that order, if that works out to be cheaper than getting the rows some other way and then sorting them.
GROUP BY provides no guarantee of order; PostgreSQL might sort the data to do the grouping, or it might use a hash table and return the rows in order of the number generated by the hashing algorithm (i.e., pretty random). And that might change from one run to the next.
It never ceased to amaze me when I was a DBA that this feature of SQL was so often thought of as quirky. Consider a simple program that runs against a text file and produces some output. If the program never changes, and the data never changes, you'd expect the output to never change.
As for this:
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Not strictly true - on every RDBMS I've ever worked on (Oracle, Informix, SQL Server, DB2 to name a few) a DISTINCT clause also has the same effect as an ORDER BY as finding unique values involves a sort by definition.
EDIT (6/2/14):
Create a simple table
For DISTINCT and ORDER BY, both the plan and the cost is the same since it is ostensibly the same operation to be performed
And not surprisingly, the effect is thus the same

SQL: What is the default Order By of queries?

What is the default order of a query when no ORDER BY is used?
There is no such order present. Taken from http://forums.mysql.com/read.php?21,239471,239688#msg-239688
Do not depend on order when ORDER BY is missing.
Always specify ORDER BY if you want a particular order -- in some situations the engine can eliminate the ORDER BY because of how it
does some other step.
GROUP BY forces ORDER BY. (This is a violation of the standard. It can be avoided by using ORDER BY NULL.)
SELECT * FROM tbl -- this will do a "table scan". If the table has
never had any DELETEs/REPLACEs/UPDATEs, the records will happen to be
in the insertion order, hence what you observed.
If you had done the same statement with an InnoDB table, they would
have been delivered in PRIMARY KEY order, not INSERT order. Again,
this is an artifact of the underlying implementation, not something to
depend on.
There's none. Depending on what you query and how your query was optimised, you can get any order. There's even no guarantee that two queries which look the same will return results in the same order: if you don't specify it, you cannot rely on it.
I've found SQL Server to be almost random in its default order (depending on age and complexity of the data), which is good as it forces you to specify all ordering.
(I vaguely remember Oracle being similar to SQL Server in this respect.)
MySQL by default seems to order by the record structure on disk, (which can include out-of-sequence entries due to deletions and optimisations) but it often initially fools developers into not bother using order-by clauses because the data appears to default to primary-key ordering, which is not the case!
I was surprised to discovere today, that MySQL 5.6 and 4.1 implicitly sub-order records which have been sorted on a column with a limited resolution in the opposite direction. Some of my results have identical sort-values and the overall order is unpredictable. e.g. in my case it was a sorted DESC by a datetime column and some of the entries were in the same second so they couldn't be explicitly ordered. On MySQL 5.6 they select in one order (the order of insertion), but in 4.1 they select backwards! This led to a very annoying deployment bug.
I have't found documentation on this change, but found notes on on implicit group order in MySQL:
By default, MySQL sorts all GROUP BY col1, col2, ... queries as if you specified ORDER BY col1, col2, ... in the query as well.
However:
Relying on implicit GROUP BY sorting in MySQL 5.5 is deprecated. To achieve a specific sort order of grouped results, it is preferable to use an explicit ORDER BY clause.
So in agreement with the other answers - never rely on default or implicit ordering in any database.
The default ordering will depend on indexes used in the query and in what order they are used. It can change as the data/statistics change and the optimizer chooses different plans.
If you want the data in a specific order, use ORDER BY

Is MySQL breaking the standard by allowing selecting columns that are not part of the group by clause?

I am used to Microsoft technologies including SQL Server. Today I ran across a Q&A where the following passage from the MySQL documentation was quoted:
Standard SQL would reject your query because you can not SELECT
non-aggregate fields that are not part of the GROUP BY clause in an
aggregate query. MySQL extends the use of GROUP BY so that the select
list can refer to nonaggregated columns not named in the GROUP BY
clause. This means that the preceding query is legal in MySQL. You
can use this feature to get better performance by avoiding unnecessary
column sorting and grouping. However, this is useful primarily when
all values in each nonaggregated column not named in the GROUP BY are
the same for each group. The server is free to choose any value from
each group, so unless they are the same, the values chosen are
indeterminate.
Is MySQL breaking the standard by allowing this? How? What is the result of allowing this?
Standard SQL would reject your query because you can not SELECT non-aggregate fields that are not part of the GROUP BY clause in an aggregate query
This is correct, up to 1992.
But it is plainly wrong, from 2003 and beyond.
From SQL-2003 standard, 6IWD6-02-Foundation-2011-01.pdf, from http://www.wiscorp.com/, paragraph-7.12 (query specification), page 398:
If T is a grouped table, then let G be the set of grouping columns of T. In each ((value expression)) contained
in ((select list)) , each column reference that references a column of T shall reference some column C that
is functionally dependent on G or shall be contained in an aggregated argument of a ((set function specification))
whose aggregation query is QS
Now MYSQL, has implemented this feature by allowing not only columns that are functionally dependent on the grouping columns but allowing all columns. This is causing some problems with users that do not understand how grouping works and get indeterminate results where they don't expect.
But you are right to say that MySQL has added a feature that conflicts with SQL-standards (although you seem to think that for the wrong reason). It's not entirely accurate as they have added a SQL-standard feature but not in the best way (more like the easy way) but it does conflict with the latest standards.
To answer your question, the reason for this MySQL feature (extension) is I suppose to be accordance with latest SQL-standards (2003+). Why they chose to implement it this way (not fully compliant), we can only speculate.
As #Quassnoi and #Johan answered with examples, it's mainly a performance and maintainability issue. But one can't easily change the RDBMS to be clever enough (Skynet excluded) to recognize functionally dependent columns, so MySQL developers made a choice:
We (MySQL) give you (MySQL users) this feature which is in SQL-2003 standards. It improves speed in certain GROUP BY queries but there's a catch. You have to be careful (and not the SQL engine) so columns in the SELECT and HAVING lists are functionally dependent on the GROUP BY columns. If not, you may get indeterminate results.
If you want to disable it, you can set sql_mode to ONLY_FULL_GROUP_BY.
It's all in the MySQL docs: Extensions to GROUP BY (5.5) - although not in the above wording but as in your quote (they even forgot to mention that it's a deviation from standard SQL-2003 while not standard SQL-92). This kind of choices is common I think in all software, other RDBMS included. They are made for performance, backward compatibility and a lot of other reasons. Oracle has the famous '' is the same as NULL for example and SQL-Server has probably some, too.
There is also this blog post by Peter Bouman, where MySQL developers' choice is defended: Debunking GROUP BY myths.
In 2011, as #Mark Byers informed us in a comment (in a related question at DBA.SE), PostgreSQL 9.1 added a new feature (release date: September 2011) designed for this purpose. It is more restrictive than MySQL's implementation and closer to the standard.
Later, in 2015 MySQL announced that in 5.7 version, the behaviour is improved to conform with the standard and actually recognize functional dependencies, (even better than the Postgres implementation). The documentation: MySQL Handling of GROUP BY (5.7) and another blog post by Peter Bouman: MySQL 5.7.5: GROUP BY respects functional dependencies!
Is MySQL breaking the standard by allowing this? How?
It lets you write a query like that:
SELECT a.*, COUNT(*)
FROM a
JOIN b
ON b.a = a.id
GROUP BY
a.id
Other systems would require you to add all columns from a into the GROUP BY list which makes the query larger, less maintanable and less efficient.
In this form (with grouping by the PK), this does not contradict the standard since every column in a is functionally dependent on its primary key.
However, MySQL does not really check the functional dependency and lets you select columns not functionally dependent on the grouping set. This can yield indeterminate results and should not be relied upon. The only thing guaranteed is that the column values belong to some of the records sharing the grouping expression (not even to one record!).
This behavior can be disabled by setting sql_mode to ONLY_FULL_GROUP_BY.
Short answer
It's a speed hack
That is enabled by default, but that can be disabled with this setting: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html
Long answer
The reason for the non-standard shorthand group by clause is that it's a speed hack.
MySQL lets the programmer determine whether the selected fields are functionally dependent on the group by clause.
The DB does not do any testing, but just selects the first result that it finds as the value of the field.
This results in considerable speed ups.
Consider this code:
SELECT f1, f2, f3, f4 FROM t1 GROUP BY f2
-- invalid in most SQL flavors, valid in MySQL
MySQL will just select the first value it finds, spending a minimum amount of time.
f1,f3, f4 will be from the same row, but this relation will fall apart if multiple tables with joins are involved.
In order to do the same something simular in SQL-server you'd have to do
SELECT MIN(f1), f2, MIN(f3), MIN(f4) FROM t1 GROUP BY f2
-- valid SQL, but really a hack
The DB will now have to examine all results to find the minimum value, huffing and puffing.
f1, f3, f4 will most likely have no relation to each other and will not be from the same row.
If however you do:
SELECT id as `primary_key`, count(*) as rowcount, count(f2) as f2count, f2, f3, f4
FROM t1
GROUP BY id
All the rest of the fields will be functionally dependent on id.
Rowcount will always be 1, and f2count will be either 0 (if f2 is null) or 1.
On joins, where lots of tables are involved, in a 1-n configuration like so:
Example:
Website 1 -> n Topics 1 -> n Threads 1 -> n Posts 1 -> 1 Person.
And you do a complicated select involving all tables and just do a GROUP BY posts.id
Obviously all other fields are functionally dependent on posts.id (and ONLY on posts.id).
So it makes no sense to list more fields in the group by clause, or to force you to use aggregate functions.
In order to speed things up. MySQL does not force you to do this.
But you do need to understand the concept of functional dependency and the relations in the tables and the join you've written, so it puts a pot of burden on the programmer.
However using:
SELECT
posts.id, MIN(posts.f2)
,MIN(threads.id), min(threads.other)
,MIN(topics.id), ....
,MIN(website.id), .....
,MIN(Person.id), ...
FROM posts p
INNER JOIN threads t on (p.thread_id = t.id)
INNER JOIN topic to on (t.topic_id = to.id)
INNER JOIN website w ON (w.id = to.website_id)
INNER JOIN person pe ON (pe.id = p.person_id)
GROUP BY posts.id //NEVER MIND THE SYNTAX ERROR WITH THE ALIASES
Puts exactly the same mental burden on the programmer.
All the big DBMSs have their own flavours and extensions; otherwise why would there ever be more than one of them?
Following the SQL Standards stringently is nice and all, but providing extensions with more functionality is even better. The quote from the documentation explains how this functionality is useful.
There isn't much of a conflict in this case, so I don't really see the issue.