In this book I'm currently reading while following a course on databases, the following example of an illegal query using an aggregate operator is given:
Find the name and age of the oldest sailor.
Consider the following attempt to answer this query:
SELECT S.sname, MAX(S.age)
FROM Sailors S
The intent is for this query to return not only the maximum age but
also the name of the sailors having that age. However, this query is
illegal in SQL--if the SELECT clause uses an aggregate operation, then
it must use only aggregate operations unless the query contains a GROUP BY clause!
Some time later while doing an exercise using MySQL, I faced a similar problem, and made a mistake similar to the one mentioned. However, MySQL didn't complain and just spit out some tables which later turned out not to be what I needed.
Is the query above really illegal in SQL, but legal in MySQL, and if so, why is that?
In what situation would one need to make such a query?
Further elaboration of the question:
The question isn't about whether or not all attributes mentioned in a SELECT should also be mentioned in a GROUP BY.
It's about why the above query, using atributes together with aggregate operations on attributes, without any GROUP BY is legal in MySQL.
Let's say the Sailors table looked like this:
+----------+------+
| sname | age |
+----------+------+
| John Doe | 30 |
| Jane Doe | 50 |
+----------+------+
The query would then return:
+----------+------------+
| sname | MAX(S.age) |
+----------+------------+
| John Doe | 50 |
+----------+------------+
Now who would need that? John Doe ain't 50, he's 30!
As stated in the citation from the book, this is a first attempt to get the name and age of the oldest sailor, in this example, Jane Doe at the age of 50.
SQL would say this query is illegal, but MySQL just proceeds and spits out "garbage".
Who would need this kind of result?
Why does MySQL allow this little trap for newcomers?
By the way, it is default MySQL behavior. But it can be changed by setting ONLY_FULL_GROUP_BY server mode in the my.ini file or in the session -
SET sql_mode = 'ONLY_FULL_GROUP_BY';
SELECT * FROM sakila.film_actor GROUP BY actor_id;
Error: 'sakila.film_actor.film_id' isn't in GROUP BY
ONLY_FULL_GROUP_BY - Do not permit queries for which the select list refers to nonaggregated columns that are not named in the GROUP BY clause.
Is the query above really illegal in SQL, but legal in MySQL
Yes
if so, why is that
I don't know the reasons for the design decisions made in MySQL, but considering that you can get the actual related data from the same row(s) as the aggregate came from (e.g., MAX or MIN) with only slightly more work, I don't see any advantage in returning additional column data from arbitrary rows.
I strongly dislike this "feature" in MySQL and it trips up many people who learn aggregates on MySQL and then move to a different dbms, and suddenly realize they never quite knew what they were doing.
Based on a link which a_horse_with_no_name provided in a comment, I have arrived at my own answer:
It seems that the MySQL way of using GROUP BY differs from the SQL way, in order to permit leaving out columns, from the GROUP BY clause, when they are functionally dependant on other included columns anyways.
Lets say we have a table displaying the activity of a bank account.
It's not a very thought-out table, but it's the only one we have, and that will have to do.
Instead of keeping track of an amount, we imagine an account starts at '0', and all transactions to it is recorded instead, so the amount is the sum of the transactions. The table could look like this:
+------------+----------+-------------+
| costumerID | name | transaction |
+------------+----------+-------------+
| 1337 | h4x0r | 101 |
| 42 | John Doe | 500 |
| 1337 | h4x0r | -101 |
| 42 | John Doe | -200 |
| 42 | John Doe | 500 |
| 42 | John Doe | -200 |
+------------+----------+-------------+
It is clear that the 'name' is functionally dependant on the 'costumerID'.
(The other way around would also be possible in this example.)
What if we wanted to know the costumerID, name and current amount of each customer?
In such a situation, two very similar queries would return the following right result:
+------------+----------+--------+
| costumerID | name | amount |
+------------+----------+--------+
| 42 | John Doe | 600 |
| 1337 | h4x0r | 0 |
+------------+----------+--------+
This query can be executed in MySQL, and is legal according to SQL.
SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID, name
This query can be executed in MySQL, and is NOT legal according to SQL.
SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID
The following line would make the query return and error instead, since it would now have to follow the SQL way of using aggregation operations and GROUP BY:
SET sql_mode = 'ONLY_FULL_GROUP_BY';
The argument for allowing the second query in MySQL, seems to be that it is assumed that all columns mentioned in SELECT, but not mentioned in GROUP BY, are either used inside an aggregate operation, (the case with 'transaction'), or are functionally dependent on other included columns, (the case with 'name'). In the case of 'name', we can be sure that the correct 'name' is chosen for all group entries, since it is functionally dependant on 'costumerID', and therefore there is only one possibly name for each group of costumerID's.
This way of using GROUP BY seems flawed tough, since it doesn't do any further checks on what is left out from the GROUP BY clause. People can pick and choose columns from their SELECT statement to put in their GROUP BY clause as they see fit, even if it makes no sense to include or leave out any particular column.
The Sailor example illustrates this flaw very well.
When using aggregation operators (possibly in conjunction with GROUP BY), each group entry in the returned set has only one value for each of its columns. In the case of Sailors, since the GROUP BY clause is left out, the whole table is put into one single group entry. This entry needs a name and a maximum age. Choosing a maximum age for this entry is a no-brainer, since MAX(S.age) only returns one value. In the case of S.sname though, wich is only mentioned in SELECT, there are now as many choices as there are unique sname's in the whole Sailor table, (in this case two, John and Jane Doe). MySQL doens't have any clue which to choose, we didn't give it any, and it didn't hit the brakes in time, so it has to just pick whatever comes first, (Jane Doe). If the two rows were switched, it would actually give "the right answer" by accident. It just seems plain dumb that something like this is allowed in MySQL, that the result of a query using GROUP BY could potententially depend on the ordering of the table, if something is left out in the GROUP BY clause. Apparently, that's just how MySQL rolls. But still couldn't it at least have the courtesy of warning us when it has no clue what it's doing because of a "flawed" query? I mean, sure, if you give the wrong instructions to a program, it probably wouldn't (or shouldn't) do as you want, but if you give unclear instructions, I certainly wouldn't want it to just start guessing or pick whatever comes first... -_-'
MySQL allows this non-standard SQL syntax because there is at least one specific case in which it makes the SQL nominally easier to write. That case is when you're joining two tables which have a PRIMARY / FOREIGN KEY relationship (whether enforced by the database or not) and you want an aggregate value from the FOREIGN KEY side and multiple columns from the PRIMARY KEY side.
Consider a system with Customer and Orders tables. Imagine you want all the fields from the customer table along with the total of the Amount field from the Orders table. In standard SQL you would write:
SELECT C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip
Notice the unwieldy GROUP BY clause, and imagine what it would look like if there were more columns you wanted from customer.
In MySQL, you could write:
SELECT C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID
or even (I think, I haven't tried it):
SELECT C.*, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID
Much easier to write. In this particular case it's safe as well, since you know that only one row from the Customer table will contribute to each group (assuming CustomerID is PRIMARY or UNIQUE KEY).
Personally, I'm not a big fan of this exception to standard SQL syntax (since there are many cases where it's not safe to use this syntax and rely on getting values from any particular row in the group), but I can see where it makes certain kinds of queries easier and (in the case of my second MySQL example) possible.
Related
Here is the SQL problem.
Table: Countries
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| country_id | int |
| country_name | varchar |
+---------------+---------+
country_id is the primary key for this table.
Each row of this table contains the ID and the name of one country.
Table: Weather
+---------------+------+
| Column Name | Type |
+---------------+------+
| country_id | int |
| weather_state | int |
| day | date |
+---------------+------+
(country_id, day) is the primary key for this table.
Each row of this table indicates the weather state in a country for one day.
Write an SQL query to find the type of weather in each country for November 2019.
The type of weather is:
Cold if the average weather_state is less than or equal 15,
Hot if the average weather_state is greater than or equal to 25, and
Warm otherwise.
Return result table in any order.
One of the MySQL solutions is as follows:
SELECT country_name, CASE WHEN AVG(weather_state) <= 15 THEN 'Cold' WHEN AVG(weather_state) >= 25 THEN 'Hot'
ELSE 'Warm'
END AS weather_type
FROM Weather w
JOIN Countries c
ON w.country_id = c.country_id
AND LEFT(w.day, 7) = '2019-11'
GROUP BY w.country_id
How does the "case when AVG(weather_state)" get executed, if the group by gets executed after the select statement?
How does the "case when AVG(weather_state)" get executed, if the group by gets executed after the select statement?
AVG(weather_state) computes the per-group average of column weather_state. It and other aggregate functions can be used in a select clause, from which you can conclude that the grouping defined by a group by clause must be visible in the context where the select clause is evaluated. In this sense, at least, group by gets executed before select. Pretty much everything else does too.
It is possible for an aggregate query to be identifiable only from the select clause. In such cases, the select clause needs to be parsed before it is known that grouping (all rows into a single group) is to be performed. This is the closest I can think of to the execution-order claim you asserted, but it is not at all well characterized as group by being executed after select.
MySQL's implementation details surely present a more complicated picture, but the fact remains that MySQL does provide correct SQL semantics in this regard. Therefore, even if you look at the details, they cannot reasonably be characterized as executing the group by after the select. Whoever told you that was wrong, or at least their lesson was very misleading, or else you misunderstood them.
I have the following Personnel table:
+---------+----------+-------------+
| name | dept_nbr | job_title |
+---------+----------+-------------+
| Michael | 14 | Programmer |
| Kumar | 14 | Programmer |
| Dave | 14 | Programmer |
| Jane | 14 | Manager |
| Carol | 37 | Programmer |
| Joe | 37 | Programmer |
| John | 59 | CEO |
+---------+----------+-------------+
Problem: Find all dept_nbr's (departments) that have fewer than 3 programmers.
Working query:
SELECT DISTINCT dept_nbr
FROM Personnel AS P1
WHERE (SELECT COUNT(P2.dept_nbr)
FROM Personnel AS P2
WHERE P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer') < 3;
Result:
37
59
Notes:
Department 14 is correctly not included as it has 3 programmers (3 is equal to but not fewer than 3). Department 59 has zero programmers, and is also correctly included in the results.
My question:
When the above query executes, how does a generic SQL engine proceed? From what I have read, SQL execution order is (roughly): From, Where, Group By, Having, and Select. So, is the following correct?
1 - The Outer Query passes each row of the Personnel table as P1 into the Inner query.
2.a - The Inner Query scans the entire Personnel table as P2, row by row, looking for rows that satisfy the condition "P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer'".
2.b – Once the Inner Query is done with the entire table, it COUNTs the matching dept_nbr values and returns it to the Outer Query.
3 – In the Outer Query, if the count returned from the Inner Query satisfies the condition "WHERE (Inner Query Count Result) < 3", the corresponding dept_nbr for the P1 row is SELECTed.
4 – Following all rows processed by the Outer Query, the Outer Query does a DISTINCT on the results and displays the unique dept_nbr values.
Is my understanding above correct? Specifically, does the outer query do the DISTINCT at the very end (step #4)? It seems that in this way, the inner query does redundant scanning (for example, it processes dept_nbr = 14 four times, when it really has the answer in the first pass).
I tested the above query on sqlfiddle.com w/ MySQL 5.6.
When the above query executes, how does a generic SQL engine proceed?
From what I have read, SQL execution order is (roughly): From, Where,
Group By, Having, and Select.
This statement is -- generally -- not correct. SQL is parsed in the order that you describe. However, the execution is determined by the optimizer and might have little to do with the original query. Remember: SQL is a descriptive language, not a procedural language. It describes the result set, not the specific steps for calculating it.
That said, MySQL's execution plan is much closer to the query than most other databases (particularly more advanced databases with better optimizers). And, almost any database is going to proceed in the steps you describe for this query. The aggregation in the subquery limits the choices for optimization.
If you want to eliminate the redundancy, then do the select distinct before the filtering:
SELECT dept_nbr
FROM (SELECT DISTINCT dept_nbr FROM Personnel P1) P1
WHERE (SELECT COUNT(P2.dept_nbr)
FROM Personnel AS P2
WHERE P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer'
) < 3;
You can also do this more simply with just an aggregation:
select dept_nbr
from personnel
group by dept_nbr
having sum(job_title = 'Programmer') < 3;
Add EXPLAIN (or EXPLAIN EXTENDED) before your query and it should give you the explain plan which will detail exactly the steps in order of your query. This is a very useful tool when trying to optimize queries.
I have a remarks table which can be linked to any number of other items in a system, in the case of this example we'll use bookings, enquiries and referrals.
Thus in the remarks table we have columns
remark_id | datetime | text | booking_id | enquiry_id | referral_id
1 | 2014-06-28 | abc | 0 | 8 | 0
2 | 2014-06-27 | def | 3 | 0 | 0
2 | 2014-05-31 | ghi | 0 | 0 | 10
Etc...
Each of the item tables will have a field called name. Thus when I want to select a remark the likelihood is I'll need this name.
I'd like to achieve this with a single query, getting a 2d array as follows:
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'name'=>'Harold']
However the query I'd expect to use would be
SELECT r.remark_id,r.datetime,r.text
,b.name AS book,rr.name AS referral,e.name AS enquiry
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id
Leaving me with the output
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'book'=>'Harold', 'referral'=>'', 'enquiry'=>'']
And more processing to do before or during rendering it to a view.
Is there a way to write a query such that it would fill a field from the first NOT NULL string it encountered in one of the joined tables?
Please only suggest using a different database system if you know that MySQL doesn't provide any way to do what I'm asking. If it's the case it can't be done there's no business sense in rewriting the system anyway, but I'd like to ask!
Two ways I can think of:
use UNION:
SELECT remark_id, datetime, text, name
FROM remarks
JOIN bookings ON (remarks.book_id = bookings.book_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN referrals ON (remarks.referral_id = referrals.referral_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN enquiries ON (remarks.enquiry_id = enquiries.enquiry_id)</code>
use IFNULL (probably much slower):
SELECT r.remark_id,r.datetime,r.text,
IFNULL(b.name,IFNULL(rr.name,e.name)) AS name
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id</code>
Variant 2 is really much slower because of the LEFT JOINs.
Also, generally I would not recommend using 0 as value for non-existent links, rather use NULL. This will allow MySQL to speed up the join.
one way to achieve this is with nested if statements:
if(b.name is not null, b.name, if(rr.name is not null, rr.name, e.name)) as name
one drawback is that this gives an implicit priority to books? not sure if that would be an issue.
perhaps the main drawback, though, is that this is kind of "magical" and has goofy syntax so it might be more clear to just handle those cases in the controller after all.
Seems quite messy that you have multiple unused columns for each entry, unless I'm not understanding correctly. If you add more tables, you'd have to adjust each of the views so that it would filter out the new table.
I'd be tempted to redesign your structure so that each of the tables has a remarkgroup_id column, then add the following remark table
remark_id, remarkgroup_id, date, message
This would clean up the extra unused columns and allow you to use simple joining logic.
I have the following (intentionally denormalized for demonstrating purposes) sample CARS table:
| CAR_ID | OWNER_ID | OWNER_NAME | COLOR |
|--------|----------|------------|-------|
| 1 | 1 | John | White |
| 2 | 1 | John | Black |
| 3 | 2 | Mike | White |
| 4 | 2 | Mike | Black |
| 5 | 2 | Mike | Brown |
| 6 | 3 | Tony | White |
If I wanted to count the amount of cars per owner and return this:
| OWNER_ID | OWNER_NAME | TOTAL |
|----------|------------|-------|
| 1 | John | 2 |
| 2 | Mike | 3 |
| 3 | Tony | 1 |
I know I can write the following query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id, owner_name
However, removing owner_name from the GROUP BY clause gives me the same results.
What is the difference between those 2 queries?
Under what circumstances should I group by all non-agreggated fields in the SELECT statement and in which ones shouldn't I?
Can you give an example in which this grouping would return different results when removing a non-aggregated field and explain why?
The first thing to make clear is that SQL is not MySQL.
In standard SQL it is not allowed to group by a subset of the non-aggregated fields. The reason is very simple. Suppose I'm running this query:
SELECT color, owner_name, COUNT(*) FROM cars
GROUP BY color
That query would not make any sense. Even trying to explain it would be impossible. For sure it is selecting colors and counting the amount of cars per color. However, it is also adding the owner_name field and there can be many owners for a given color, as it is the case of the White color. So if there can be many owner_name values for a single color which happens to be the only field in the GROUP BY clause... then which owner_name will be returned?
If it is needed to return an owner_name then some kind of criteria should be added to only select one of them, e.g., the first one alphabetically, which in this case would be John. That criteria would result in adding an aggregate function MIN(owner_name) and then the query will make sense again as it will be grouping by, at least, all the non-agreggated fields in the select statement.
As you can see, there is a clear and practical reason for standard SQL to be inflexible in the grouping. If it wasn't, you could face awkward situations in which the value for a column will be unpredictable, and that is not a nice word, particularly if the query being run is showing you your bank account transactions.
Having said that, then why would MySQL allow queries that might not make sense? And even worse, the error in the query above could be just syntactically detected! The short answer is: performance. The long answer is that there are certain situations in which, based on data relations, getting an unpredictable value from the group will result in a predictable value.
If you haven't figured it out yet, the only way in which you can predict the value you'll get from taking an unpredictable element from a group will be if all the elements in the group are the same. A clear example of this situation is in the sample query in your very same question. Look at how owner_id and owner_name relates in the table. It is clear that given any owner_id, e.g. 2, you can only have one distinct owner_name. Even having many rows, by choosing any, you will get Mike as the result. In formal database jargon this can be explained as owner_id functionally determines owner_name.
Let's take a closer look at that fully working MySQL query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id
Given any owner_id this would return the same owner_name, so adding it to the GROUP BY clause will not result in more rows returned. Even adding an aggregated function MAX(owner_name) will not result in less rows returned. The resulting data will be exacly the same. In both cases, the query would be immediately turned into a legal standard SQL query as at least all the non-aggregated fields would be grouped by. So there are 3 approaches to get the same results.
However, as I mentioned before, this non-standard grouping has a performance advantage. You can check this so underrated link in which this is explained for more detail but I'm going to cite the most important part:
You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. [...] The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
One thing that is worth mentioning is that the results are not necessarily wrong but rather indeterminate. In other words, getting the expected results does not mean you have written the right query. Writing the right query will always give you the expected results.
As you can see, it might be worth applying this MySQL extension to the GROUP BY clause. Anyway, if this is not 100% clear yet then there is a rule of thumb that will make sure that your grouping will always be correct: Always group, at least, by all the non-aggregated fields in the select clause. You might be wasting a few CPU cycles in certain situations but it is better than returning indeterminate results. If you're still terrified about not grouping correctly then changing the ONLY_FULL_GROUP_BY SQL mode could be a last resort :)
May your grouping be correct and performant... or at least correct.
Lets say I have a plant table:
id fruit
1 banana
2 apple
3 orange
I can do these
SELECT * FROM plant ORDER BY id;
SELECT * FROM plant ORDER BY fruit DESC;
which does the obvious thing.
But I was bitten by this, what does this do?
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
All these return just the first row (which is with id = 1).
What's happening underhood?
What are the scenarios where aggregate function will come in handy in ORDER BY?
Your results are more clear if you actually select the aggregate values instead of columns from the table:
SELECT SUM(id) FROM plant ORDER BY SUM(id)
This will return the sum of all id's. This is of course a useless example because the aggregation will always create only one row, hence no need for ordering. The reason you get a row qith columns in your query is because MySQL picks one row, not at random but not deterministic either. It just so happens that it is the first column in the table in your case, but others may get another row depending on storage engine, primary keys and so on. Aggregation only in the ORDER BY clause is thus not very useful.
What you usually want to do is grouping by a certain field and then order the result set in some way:
SELECT fruit, COUNT(*)
FROM plant
GROUP BY fruit
ORDER BY COUNT(*)
Now that's a more interesting query! This will give you one row for each fruit together with the total count for that fruit. Try adding some more apples and the ordering will actually start making sense:
Complete table:
+----+--------+
| id | fruit |
+----+--------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
| 4 | apple |
| 5 | apple |
| 6 | banana |
+----+--------+
The query above:
+--------+----------+
| fruit | COUNT(*) |
+--------+----------+
| orange | 1 |
| banana | 2 |
| apple | 3 |
+--------+----------+
All these queries will all give you a syntax error on any SQL platform that complies with SQL standards.
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
On PostgreSQL, for example, all those queries will raise the same error.
ERROR: column "plant.id" must appear in the GROUP BY clause or be
used in an aggregate function
That means you're using a domain aggregate function without using GROUP BY. SQL Server and Oracle return similar error messages.
MySQL's GROUP BY is known to be broken in several respects, at least as far as standard behavior is concerned. But the queries you posted were a new broken behavior to me, so +1 for that.
Instead of trying to understand what it's doing under the hood, you're probably better off learning to write standard GROUP BY queries. MySQL will process standard GROUP BY statements correctly, as far as I know.
Earlier versions of MySQL docs warned you about GROUP BY and hidden columns. (I don't have a reference, but this text is cited all over the place.)
Do not use this feature if the columns you omit from the GROUP BY part
are not constant in the group. The server is free to return any value
from the group, so the results are indeterminate unless all values are
the same.
More recent versions are a little different.
You can use this feature to get better performance by avoiding
unnecessary column sorting and grouping. However, this is useful
primarily when all values in each nonaggregated column not named in
the GROUP BY are the same for each group. The server is free to choose
any value from each group, so unless they are the same, the values
chosen are indeterminate.
Personally, I don't consider indeterminate a feature in SQL.
When you use an aggregate like that, the query gets an implicit group by where the entire result is a single group.
Using an aggregate in order by is only useful if you also have a group by, so that you can have more than one row in the result.