I knew that normal columns followed by aggregate functions are allowed only if a Group By including them follows.
But then why is the following working?
mysql> select payee,sum(amount) from checks;
+---------+-------------+
| payee | sum(amount) |
+---------+-------------+
| Ma Bell | 893.76 |
+---------+-------------+
1 row in set (0.00 sec)
This behavior is an "extension" to MySql:
MySQL extends the use of GROUP BY so that the select list can refer
to nonaggregated columns not named in the GROUP BY clause.
However, This behavior is actually a configurable setting in MySql:
ONLY_FULL_GROUP_BY
Do not permit queries for which the select list or (as of MySQL
5.1.10) HAVING list refers to nonaggregated columns that are not named in the GROUP BY clause.
It is best to respect the group by and add all non-aggregated columns, especially if there's the possibility that you might someday migrate to a server that has ONLY_FULL_GROUP_BY turned on.
EDIT Manual reference which does a better job explaining than I do and note the details: http://dev.mysql.com/doc/refman/5.0/en/group-by-extensions.html
This is intended to work fine. Without an aggregate function it does it on all rows returned in the query. The reason you see little on this I think is that this is rarely what you are actually trying to accomplish. You frequently drop the group by when you have a where clause which you know would only return things from which you were planning to group by anyway. i.e. if you query is:
select payee,sum(amount) from checks where payee = 'Ma Bell'
The group by in the following is technically redundant:
select payee,sum(amount) from checks where payee = 'Ma Bell' group by payee
Personally - I typically include the GROUP BY clause as I THINK it is more consistently supported cross platform... not 100% sure of that though.
Again, in your query above I would again ask - even though it technically works, are you getting the result you are after without a where clause?
Related
https://www.mysqltutorial.org/tryit/query/mysql-inner-join/#2
Hi folks!
I wonder why after I delete the GROUP BY orderNumber then it fetches only one row:
Is it their "tutorial" database mistake or is it a correct MySQL behavior? If it's correct, then why does it produces this exactly result?
SQL "aggregate functions" including SUM(), COUNT(), MIN(), MAX() among others require a frame to aggregate over. Typically that is one or more other columns to apply the SUM() or other aggregate onto, and GROUP BY is how you specify that frame.
An aggregate query with no GROUP BY implies you are taking the SUM() of all rows matched by the query's WHERE clause filter.
MySQL is unlike most other RDBMS in that it allows you to remove the GROUP BY with unaggregated columns in SELECT and still get some rowset back from your query. In Oracle, MS SQL Server, or Postgresql, the query without the GROUP BY would be a syntax error. They would also treat it as an error if you used GROUP BY orderNumber while still including status in the SELECT list. A GROUP BY should include every column which is in the SELECT list that isn't being used in the aggregate SUM(), COUNT(), MIN(), MAX(), etc.
But MySQL is lenient about its presence and instead tries to guess over which frame to apply your SUM() aggregate. Some of the time it can get the answer you were actually expecting, but most other times the values it gives you for the non-aggregated columns are essentially indeterminate. It will collapse several possible values down to just one, and you have no way to pick which one you get.
That is the query result you are seeing. MySQL chose orderNumber = 10100 and status = 'Shipped' to go with your SUM() even though they are not specifically related to that sum. The sum in your result 9604190.61 is the sum of quantityOrdered * priceEach for ALL rows in that table despite what the orderNumber says.
Documentation on MySQL's GROUP BY handling
So the most reliable version of your query and the only version which would work outside of MySQL, where you can actually predict the results would be:
SELECT
T1.orderNumber,
status,
SUM(quantityOrdered * priceEach) total
FROM
orders AS T1
INNER JOIN
orderdetails AS T2 ON T1.orderNumber = T2.orderNumber
GROUP BY
orderNumber,
status /* added */
;
Note that the tutorial omitted status from the GROUP BY even though it is in SELECT. That would be an error in most other RDBMS.
MySQL's default handling of this misfeature has changed with recent versions. Prior to 5.7, the ONLY_FULL_GROUP_BY mode was disabled by default, arguably causing a lot of developers to grow dependent on the grouping behavior. In recent versions, ONLY_FULL_GROUP_BY is enabled by default and prevents queries with a missing or incomplete GROUP BY.
I have a table emp with following structure and data:
name dept salary
----- ----- -----
Jack a 2
Jill a 1
Tom b 2
Fred b 1
When I execute the following SQL:
SELECT * FROM emp GROUP BY dept
I get the following result:
name dept salary
----- ----- -----
Jill a 1
Fred b 1
On what basis did the server decide return Jill and Fred and exclude Jack and Tom?
I am running this query in MySQL.
Note 1: I know the query doesn't make sense on its own. I am trying to debug a problem with a 'GROUP BY' scenario. I am trying to understand the default behavior for this purpose.
Note 2: I am used to writing the SELECT clause same as the GROUP BY clause (minus the aggregate fields). When I came across the behavior described above, I started wondering if I can rely on this for scenarios such as:
select the rows from emp table where the salary is the lowest/highest in the dept.
E.g.: The SQL statements like this works on MySQL:
SELECT A.*, MIN(A.salary) AS min_salary FROM emp AS A GROUP BY A.dept
I didn't find any material describing why such SQL works, more importantly if I can rely on such behavior consistently. If this is a reliable behavior then I can avoid queries like:
SELECT A.* FROM emp AS A WHERE A.salary = (
SELECT MAX(B.salary) FROM emp B WHERE B.dept = A.dept)
Read MySQL documentation on this particular point.
In a nutshell, MySQL allows omitting some columns from the GROUP BY, for performance purposes, however this works only if the omitted columns all have the same value (within a grouping), otherwise, the value returned by the query are indeed indeterminate, as properly guessed by others in this post. To be sure adding an ORDER BY clause would not re-introduce any form of deterministic behavior.
Although not at the core of the issue, this example shows how using * rather than an explicit enumeration of desired columns is often a bad idea.
Excerpt from MySQL 5.0 documentation:
When using this feature, all rows in each group should have the same values
for the columns that are omitted from the GROUP BY part. The server is free
to return any value from the group, so the results are indeterminate unless
all values are the same.
This is a bit late, but I'll put this up for future reference.
The GROUP BY takes the first row that has a duplicate and discards any rows that match after it in the result set. So if Jack and Tom have the same department, whoever appears first in a normal SELECT will be the resulting row in the GROUP BY.
If you want to control what appears first in the list, you need to do an ORDER BY. However, SQL does not allow ORDER BY to come before GROUP BY, as it will throw an exception. The best workaround for this issue is to do the ORDER BY in a subquery and then a GROUP BY in the outer query. Here's an example:
SELECT * FROM (SELECT * FROM emp ORDER BY name) as foo GROUP BY dept
This is the best performing technique I've found. I hope this helps someone out.
As far as I know, for your purposes the specific rows returned can be considered to be random.
Ordering only takes place after GROUP BY is done
You can put a:
SET GLOBAL sql_mode=(SELECT REPLACE(##sql_mode,'ONLY_FULL_GROUP_BY',''));
before your query to enforce SQL standard GROUP BY behavior
I find that the best thing to do is to consider this type of query unsupported. In most other database systems, you can't include columns that aren't either in the GROUP BY clause or in an aggregate function in the HAVING, SELECT or ORDER BY clauses.
Instead, consider that your query reads:
SELECT ANY(name), dept, ANY(salary)
FROM emp
GROUP BY dept;
...since this is what's going on.
Hope this helps....
I think ANSI SQL requires that the select includes only fields from the GROUP BY clause, plus aggregate functions.
This behaviour of MySQL looks like returns some row, possibly the last one the server read, or any row it had at hand, but don't rely on that.
This would select the most recent row for each person:
SELECT * FROM emp
WHERE ID IN
(
SELECT
MAX(ID) AS ID
FROM
emp
GROUP BY
name
)
If you are grouping by department does it matter about the other data? I know Sql Server will not even allow this query. If there is a possibility of this sounds like there might be other issues.
Try using ORDER BY to pick the row that you want.
SELECT * FROM emp GROUP BY dept ORDER BY name ASC;
Will return the following:
name dept salary
----- ----- -----
jack a 2
fred b 1
For My SQL 'Group By', what is the criteria of picking one row from many rows? For example if I use group by user_id would it choose the row in some order or in some random way?
For example this table
id user_id message created_at
1 1 a 2016-08-25 07:00:15
2 2 c 2016-08-25 08:00:15
3 1 b 2016-08-25 09:46:15
4 2 d 2016-08-25 10:49:12
who will group by user_id find which row to take for user_id=1 row 1 or 3 because I could find any solution.
It will find the one specified in the aggregation (MAX(), MIN() etc.) statement, as you should only select grouped or aggregated columns when using GROUP BY.
Otherwise it is not determined which value will be chosen, it is pretty random.
Also see the MySQL manual:
https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
MySQL 5.7.5 and up implements detection of functional dependence. If
the ONLY_FULL_GROUP_BY SQL mode is enabled (which it is by default),
MySQL rejects queries for which the select list, HAVING condition, or
ORDER BY list refer to nonaggregated columns that are neither named in
the GROUP BY clause nor are functionally dependent on them.
So since MySQL 5.7 you explicitly have to enable an option so mysql can execute those queries.
Before MySQL 5.7 it allowed those queries but, as mentioned, chose the values of the nonaggegated and nongrouped fields randomly.
Group by works on a specific field. If you group by user_id and SELECT any other column then that column from that particular GROUP will be selected randomly.
That is why it is not recommended to SELECT the field which is not in GROUP BY clause.
who will group by user_id find which row to take for user_id=1 row 1
or 3 because i could find any solution.
Yes it will take other fields randomly.
If you have a query like
select user_id from yourtable group by user_id
then it does not matter from which record the values come from. However, if you have a query like
select user_id, created_at from yourtable group by user_id
where you have a field in the select list that is not subject of an aggregate function (max(), min(), etc), then as MySQL documentation on MySQL Handling of GROUP BY says:
In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
In reality, MySQL will pick the value for such fields from the 1st record it encounters while assembling the resultset.
Pls alo note that unless such fields are functionally dependent on the fields in the group by, the query is against all sql standards. In MySQL you can use the only_full_group_by sql mode setting (also part of the strict sql mode) to determine if MySQL accepts such queries at all. In the more recent versions of MySQL this qsl mode is turned on by default preventing you to run such queries without changing the settings.
The GROUP BY clause does not return rows from the database. It generates values using the rows filtered by the WHERE clause.
There are three types of columns that are valid in the expressions present in the SELECT clause of a query that contains a GROUP BY clause:
columns that also appear in the GROUP BY clause;
columns that are functionally dependent on the columns that appear in the GROUP BY clause;
any column can be used as argument of a GROUP BY aggregate function.
A GROUP BY query whose columns present in the SELECT clause do not follow the rules above is invalid SQL.
Up to version 5.7.5, MySQL allows invalid GROUP BY queries. It is explained in the documentation that for the columns that do not follow the rules above, "the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want."
Since version 5.7.5 MySQL rejects such invalid queries. Other RDBMSes (SQL Server, Oracle etc) do not allow them too because, well, they are invalid SQL.
It seems like in version 5.7 of MySQL, they added one nasty thing which was (or still is) a real headache for those who deal with SQL Server.
The thing is: MySQL throws an error, when you try to SELECT DISTINCT rows for one set of columns and want to ORDER BY another set of columns. Previously, in version 5.6 and even in some builds of version 5.7 you could do this, but now it is prohibited (at least by default).
I hope there exists some configuration, some variable that we could set to make it work. But unfortunately I do not know that nasty variable. I hope someone knows that.
EDIT
This is some typical query in my case that worked literally for years (until the last build of MySQL 5.7):
SELECT DISTINCT a.attr_one, a.attr_two, a.attr_three, b.attr_four FROM table_one a
LEFT JOIN table_two b ON b.some_idx = a.idx
ORDER BY b.id_order
And, indeed, if I now include b.id_order to the SELECT part (as MySQL suggests doing), then what I will get, will be rubbish.
In most cases, a DISTINCT clause can be considered as a special case of GROUP BY. For example,
ONLY_FULL_GROUP_BY
MySQL 5.7.5 and up implements detection of functional dependence. If the ONLY_FULL_GROUP_BY SQL mode is enabled (which it is by default), MySQL rejects queries for which the select list, HAVING condition, or ORDER BY list refer to nonaggregated columns that are neither named in the GROUP BY clause nor are functionally dependent on them. (Before 5.7.5, MySQL does not detect functional dependency and ONLY_FULL_GROUP_BY is not enabled by default. For a description of pre-5.7.5 behavior )
If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. This causes MySQL to accept the preceding query. In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Result set sorting occurs after values have been chosen, and ORDER BY does not affect which value within each group the server chooses. Disabling ONLY_FULL_GROUP_BY is useful primarily when you know that, due to some property of the data, all values in each nonaggregated column not named in the GROUP BY are the same for each group.
for more http://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by
for particular answer
SELECT DISTINCT attr_one,
attr_two,
attr_three,
attr_four
FROM
(SELECT a.attr_one,
a.attr_two,
a.attr_three,
b.attr_four
FROM table_one a
LEFT JOIN table_two b ON b.some_idx = a.idx
ORDER BY b.id_order) tmp
I have read the post on the link you mentioned, and looks like been given the clear explanation of why the error is thrown and how to avoid it.
In your case you may want to try the following (not tested of course):
SELECT a.attr_one, a.attr_two, a.attr_three, b.attr_four
FROM table_one a
LEFT JOIN table_two b ON b.some_idx = a.idx
GROUP BY a.attr_one, a.attr_two, a.attr_three, b.attr_four
ORDER BY max(b.id_order)
You should choose whether to use ORDER BY max(b.id_order), or ORDER BY min(b.id_order) or other aggregate function
I have a table called child like this
+---------+-----+
| name | age |
+---------+-----+
| Alfred | 5 |
| Maria | 6 |
+---------+-----+
When I run SELECT 'name' FROM 'child' I get both rows. No problem. It is what I expected.
But if I run SELECT 'name', MAX('age') FROM 'child' I get:
+---------+------------+
| name | MAX(`age`) |
+---------+------------+
| Alfredo | 6 |
+---------+------------+
This result is extrange for me.. I expected both rows like before, why it is outputting just one row? why Alfredo is outputted since Maria is who is 6 years old? where can I find documentation about this behaviour?
You need to use GROUP BY to get more than one row. Otherwise the aggregate function MAX() is applied on all rows. Notice, that Alfredo's age is actually 5. The name is the group in this case.
MySQL is kind of special here, since it doesn't follow ANSI-Standard SQL. Usually an error is thrown, when you don't specify a column from the select clause in the group by clause or apply an aggregate function on it. MySQL allows this (this will be changed in future versions, btw) and displays a random row from this group. So don't do this.
To get two rows in your example, you'd have to do
SELECT name, MAX(age) FROM your_table GROUP BY name;
Each name is a "group". If you would have another Alfredo with age 25 in your table, the result would be Alfredo - 25 and Maria - 6.
It gets more complicated than this when you want to get the row which belongs to the group-wise maximum. Here are some examples how to solve this.
More info to read.
To be on the safe side, you can disable this by setting the sql_mode only_full_group_by. Ask your administrator if you don't have the rights to do so.
Use of SQL aggregate functions should be accompanied by a GROUP BY clause. Here's a good place to start: https://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html
You should SQL aggregate functions like Average, Max, etc with group by sql statements only. Otherwise you will get undefined behaviours like this one.
Here if you write max(age) only, everything looks good and you get 6, but now you also ask it to print the name(with no condition, i.e. asking it to print all names while max will only be one row), so it tries to do something intelligent and printing the first row is what it does in your case.
MAX() is an aggregate function to be used with GROUP BY. When the GROUP BY clause is missing, any RDBMS will produce a single group from all the selected rows and it will return a single row.
When grouping is involved, the expressions that appear in the SELECT clause are evaluated independently. There is no relationship between name and MAX(age). MAX(age) is the maximum value of column age from the rows filtered by the WHERE clause (all the rows in your case).
The standard SQL language does not allow SELECTing columns that are not dependent on the GROUP BY columns or used in aggregate functions.
MySQL allows this before version 5.7.5. Starting with version 5.7.5 it adheres to the standard and rejects such queries with errors. The old behaviour can still be achieved using configuration.
As explained in the documentation, for SELECT columns that are neither dependent on the GROUP BY columns nor used in aggregate functions, "the server is free to choose any value from each group". This is undefined behaviour.
Back to your query:
SELECT 'name', MAX('age') FROM 'child'
It has no WHERE all the rows are included. Then, because of MAX(age) (which is an aggregate function), MySQL creates a group that contains all the filtered rows (all the rows) and evaluates each of the expressions from the SELECT clause.
MAX(age) is very clear, it evaluates to the maximum value found the column age of the rows from the group. That is 6 and nothing more. No reference to the row where it was extracted from is kept.
Selecting a name is affected by the undefined behaviour exposed above. The server will select any value and, this time, it seems it preferred to pick the value from the first row. It could be different on another server. It could be different on the same server after you add, remove or update a row on that table. It just cannot be predicted.
Why this behaviour?
Why the server doesn't get the value from the same row where it got the value of MAX(age)? Is it that difficult to accomplish? -- This is how a lot of beginners think when they start working with SQL.
The short answer is: because there is no such row.
Let's say SQL should select name from the same row it selected MAX('age').
Let's put more aggregate functions in the query:
SELECT 'name', MAX('age'), MIN('age'), AVG('age'), COUNT(*) FROM 'child'
If the above assertion would be correct, SQL should get name from the same row that contains MAX(age) (row #2). What if there are two rows containing that value?
But, on the same time it should get name from the same row that contains MIN(age) (ahem, this is row #1).
Or, it should get it from the row where is finds AVG(age) (which is 5.5; oops, there is no such row).
What about the row that contains COUNT(*) in column... errr... in what column should it check for COUNT(*)? Btw, COUNT(*) is not an age or a name, it is just a number. It doesn't make any sense to compare it with values you store in the table.