Could someone explain why the following query throws an error, if I am trying to get the names of all customers along with the total number of customers?
SELECT name, COUNT(*)
FROM CUSTOMER
I know that selecting columns along with an aggregate function requires a GROUP BY statement containing all the column names, but I don't understand the logical principle behind this.
edit:
http://sqlfiddle.com/#!2/90233/595
I guess 'error' isn't quite right, but notice how the current query returns Allison 9 as the only result.
I don't understand why it doesn't return:
Alison 9
Alison 9
Alison 9
Alison 9
Jason 9
...
(This is a new answer based on the comment and looking at the fiddle.)
The issue here is how mysql handles aggregate functions -- which is a non-standard way and different then everyone else.
mysql lets you use aggregate functions (count() is an example of an aggregate function) without a group by. All (or most?) other sql implementations require the group by when you use count(*). When you have a group by you have to say the range in the group by (for example group by name). Also every column has to be in the range or the result of an aggregate function.
SINCE you don't have a range mysql assumes the whole table and since you have a column that is not the result of a aggregate function or in the range (in this case name) mysql does something to make that column the result of an aggregate function. I'm not sure if it is specified in mysql what it does -- lets say "max()". (Fairly sure it is max()). So the real sql that is getting executed is
SELECT ANY_VALUE(name), COUNT(*)
FROM CUSTOMER
Thus you only see one name.
mysql documentation - http://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
After reading the above I see that mysql will use the default aggregate function ANY_VALUE() for columns which are not in the range.
If you just want the total number of customers on each row you could do this
SELECT DISTINCT NAME, COUNT(NAME) OVER () AS CustomerCount
FROM CUSTOMER
In this way you don't need the GROUP BY syntax. Under the covers it is probably doing the same thing as #GordonLinoff 's answer.
I added this because maybe it makes it clearer how group by works.
Select name, Count(*) as 'CountCustomers'
FROM CUSTOMER
Group by name
Order by name
Think of it as giving an instruction of which field to aggregate by. For example, if you had a field with the State of the Customer, you could group by State which would give a count of customers by state.
Also, note you can have multiple aggregate functions in the same select using the "over (partition by" construct.
If you want the names along with the total number of customers, then use a window function:
select name, count(*) as NumCustomersWithName,
sum(count(*)) over () as NumCustomers
from customer
group by name;
Edit:
You actually seem to want:
select name, count(*) over () as NumCustomers
from customer;
In MySQL, you would do this with a subquery:
select name, cnt
from customers cross join
(select count(*) as cnt from customers) x;
The reason your query doesn't work is because it is an aggregation query that returns exactly one row. When you use aggregation functions without a GROUP BY, then the query always returns exactly one row.
Related
I read that it is not correct to use an aggregate function along with the GROUP BY clause.
Below are some points which I found in the book.
When using the GROUP BY clause with a SELECT statement:
• The SELECT’s columnlist must include a combination of column names and aggregate functions.
• The GROUP BY clause’s columnlist must include all nonaggregate function columns specified in
the SELECT’s columnlist. If required, you could also group by any aggregate function columns that
appear in the SELECT’s columnlist.
• The GROUP BY clause columnlist can include any columns from the tables in the FROM clause of the
SELECT statement, even if they do not appear in the SELECT’s columnlist.
However, when I use the GROUP BY clause along with an aggregate function (I used MySQL), there is no error generated and I get the answer expected. Why is this?
As an example see this code:
SELECT MONTHS * SALARY FROM EMPLOYEE GROUP BY MONTHS * SALARY;
SELECT COUNT(*) FROM EMPLOYEE GROUP BY COUNT(*);
I get no error generated. I find the points given in the book more logical because at the beginning, the table does not have a column MONTHS * SALARY. So how does this happen? And why does this observation contradict with the points given in the book?
My database is called: (training_session)
I try to print out some information from my data, but I do not want to have any duplicates. I do get it somehow, may someone tell me what I do wrong?
SELECT DISTINCT athlete_id AND duration FROM training_session
SELECT DISTINCT athlete_id, duration FROM training_session
It works perfectly if i use only one column, but when I add another. it does not work.
I think you misunderstood the use of DISTINCT.
There is big difference between using DISTINCT and GROUP BY.
Both have some sort of goal, but they have different purpose.
You use DISTINCT if you want to show a series of columns and never repeat. That means you dont care about calculations or group function aggregates. DISTINCT will show different RESULTS if you keep adding more columns in your SELECT (if the table has many columns)
You use GROUP BY if you want to show "distinctively" on a certain selected columns and you use group function to calculate the data related to it. Therefore you use GROUP BY if you want to use group functions.
Please check group functions you can use in this link.
https://dev.mysql.com/doc/refman/8.0/en/group-by-functions.html
EDIT 1:
It seems like you are trying to get the "latest" of a certain athlete, I'll assume the current scenario if there is no ID.
Here is my alternate solution:
SELECT a.athlete_id ,
( SELECT b.duration
FROM training_session as b
WHERE b.athlete_id = a.athlete_id -- connect
ORDER BY [latest column to sort] DESC
LIMIT 1
) last_duration
FROM training_session as a
GROUP BY a.athlete_id
ORDER BY a.athlete_id
This syntax is called IN-SELECT subquery. With the help of LIMIT 1, it shows the topmost record. In-select subquery must have 1 record to return or else it shows error.
MySQL's DISTINCT clause is used to filter out duplicate recordsets.
If your query was SELECT DISTINCT athlete_id FROM training_session then your output would be:
athlete_id
----------
1
2
3
4
5
6
As soon as you add another column to your query (in your example, the column called duration) then each record resulting from your query are unique, hence the results you're getting. In other words the query is working correctly.
I have 2 tables called Orders and Salesperson shown below:
And I want to retrieve the names of all salespeople that have more than 1 order from the tables above.
Then firing following query shows an error:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The error is:
Column 'Name' is invalid in the select list because it is
not contained in either an aggregate function or
the GROUP BY clause.
From the error and searching it on google, I could understand that the error is because of Name column must be either a part of the group by statement or aggregate function.
Also I tried to understand why does the selected column have to be in the group by clause or art of an aggregate function? But didn't understand clearly.
So, how to fix this error?
SELECT max(Name) as Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The basic idea is that columns that are not in the group by clause need to be in an aggregate function now here due to the fact that the name is probably the same for every salesperson_id min or max make no real difference (the result is the same)
example
Looking at your data you have 3 entry's for Dan(7) now when a join is created the with row Dan (Name) gets multiplied by 3 (For every number 1 Dan) and then the server does not now witch "Dan" to pick cos to the server that are 3 lines even doh they are semantically the same
also try this so that you see what I am talking about:
SELECT Orders.Number, Salesperson.Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
As far as the query goes INNER JOIN is a better solution since its kinda the standard for this simple query it should not matter but in some cases can happen that INNER JOIN produces better results but as far as I know this is more of a legacy thing since this days the server should pretty much produce the same execution plan.
For code clarity I would stick with INNER JOIN
Assuming the name is unique to the salesperson.id then simply add it to your group by clause
GROUP BY salesperson_id, salesperson.Name
Otherwise use any Agg function
Select Min(Name)
The reason for this is that SQL doesn't know whether there are multiple name per salesperson.id
For readability and correctness, I usually split aggregate queries into two parts:
The aggregate query
Any additional queries to support fields not contained in aggregate functions
So:
1.Aggregate query - salespeople with more than 1 order
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
2.Use aggregate as subquery (basically a select joining onto another select) to join on any additional fields:
SELECT *
FROM Salesperson SP
INNER JOIN
(
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
) AGG_QUERY
ON AGG_QUERY.salesperson_id = SP.ID
There are other approaches, such as selecting the additional fields via aggregation functions (as shown by the other answers). These get the code written quickly so if you are writing the query under time pressure you may prefer that approach. If the query needs to be maintained (and hence readable) I would favour subqueries.
I'm reading a book on SQL (Sams Teach Yourself SQL in 10 Minutes) and its quite good despite its title. However the chapter on group by confuses me
"Grouping data is a simple process. The selected columns (the column list following
the SELECT keyword in a query) are the columns that can be referenced in the GROUP
BY clause. If a column is not found in the SELECT statement, it cannot be used in the
GROUP BY clause. This is logical if you think about it—how can you group data on a
report if the data is not displayed? "
How come when I ran this statement in MySQL it works?
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
You're right, MySQL does allow you to create queries that are ambiguous and have arbitrary results. MySQL trusts you to know what you're doing, so it's your responsibility to avoid queries like that.
You can make MySQL enforce GROUP BY in a more standard way:
mysql> SET SQL_MODE=ONLY_FULL_GROUP_BY;
mysql> select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
ERROR 1055 (42000): 'test.EMPLOYEE_PAY_TBL.EMP_ID' isn't in GROUP BY
Because the book is wrong.
The columns in the group by have only one relationship to the columns in the select according to the ANSI standard. If a column is in the select, with no aggregation function, then it (or the expression it is in) needs to be in the group by statement. MySQL actually relaxes this condition.
This is even useful. For instance, if you want to select rows with the highest id for each group from a table, one way to write the query is:
select t.*
from table t
where t.id in (select max(id)
from table t
group by thegroup
);
(Note: There are other ways to write such a query, this is just an example.)
EDIT:
The query that you are suggesting:
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
would work in MySQL but probably not in any other database (unless BONUS happens to be a poorly named primary key on the table, but that is another matter). It will produce one row for each value of BONUS. For each row, it will get an arbitrary EMP_ID and SALARY from rows in that group. The documentation actually says "indeterminate", but I think arbitrary is easier to understand.
What you should really know about this type of query is simply not to use it. All the "bare" columns in the SELECT (that is, with no aggregation functions) should be in the GROUP BY. This is required in most databases. Note that this is the inverse of what the book says. There is no problem doing:
select EMP_ID
from EMPLOYEE_PAY_TBL
group by EMP_ID, BONUS;
Except that you might get multiple rows back for the same EMP_ID with no way to distinguish among them.
I have a table named forms with the following structure-
GROUP | FORM | FILEPATH
====================================
SomeGroup | SomeForm1 | SomePath1
SomeGroup | SomeForm2 | SomePath2
------------------------------------
I use the following query-
SELECT * FROM forms GROUP BY 'GROUP'
It returns only the first row-
GROUP | FORM | FILEPATH
====================================
SomeGroup | SomeForm1 | SomePath1
------------------------------------
Shouldn't it return both (or all of it)? Or am I (possibly) wrong?
As the manual states:
In standard SQL, a query that includes a GROUP BY clause cannot refer to nonaggregated columns in the select list that are not named in the GROUP BY clause. For example, this query is illegal in standard SQL because the name column in the select list does not appear in the GROUP BY:
SELECT o.custid, c.name, MAX(o.payment)
FROM orders AS o, customers AS c
WHERE o.custid = c.custid
GROUP BY o.custid;
For the query to be legal, the name column must be omitted from the select list or named in the GROUP BY clause.
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
In your case, MySQL is correctly performing the grouping operation, but (since you select all columns including those by which you are not grouping the query) gives you an indeterminate one record from each group.
It only returns one row, because the values of your GROUP column are the same ... that's basically how GROUP BY works.
Btw, when using GROUP BY it's good form to use aggregate functions for the other columns, such as COUNT(), MIN(), MAX(). In MySQL it usually returns the first row of each group if you just specify the column names; other databases will not like that though.
Your code:
SELECT * FROM forms GROUP BY 'GROUP'
isn't very "good" SQL, MySQL lets you get away with it and returns only one value for all columns not mentioned in the group by clause. Almost any other database would not perform this query. As a rule, any column, that is not part of the grouping condition must be used with an aggregate function.
as far as mysql is concerned, I just solved my problem by hit & trial.
I had the same problem 10 minutes ago. I was using mysql statement something like this:
SELECT * FROM forms GROUP BY 'ID'; // returns only one row
However using the statement like the following would yeild same result:
SELECT ID FROM forms GROUP BY 'ID'; // returns only one row
The following was my solution:
SELECT ID FROM forms GROUP BY ID; // returns more than one row (with one column of field "ID") grouped by ID
or
SELECT * FROM forms GROUP BY ID; // returns more than one row (with columns of all fields) grouped by ID
or
SELECT * FROM forms GROUP BY `ID`; // returns more than one row (with columns of all fields) grouped by ID
Lesson: Donot use semicolon, i believe it does a stringtype search with colons. Remove colons from column name and it will group by its value. However you can use backtick escapes eg. ID
Thank you everyone for pointing out the obvious mistake I was too blind to see. I finally replaced GROUP BY with ORDER BY and included a WHERE clause to get my desired result. That is what I was intending to use all along. Silly me.
My final query becomes this-
SELECT * FROM forms WHERE GROUP='SomeGroup' ORDER BY 'GROUP'
SELECT * FROM forms GROUP BY `GROUP`
it's strange that your query does work
The above result is kind of correct, but not quite.
All columns you select, which are not part of the GROUP BY statement have to be aggregated by some function (list of aggregation function from the MySQL docu). Most often they are used together with numeric columns.
Besides this, your query will return one output row for every (combination of) attributes in the columns referenced in the GROUP BY statement. In your case there is just one distinct value in the GROUP column, namely "SomeGroup", so the output will only contain one row for this value.
Group by clause should only be required if you have any group functions, say max, min, avg, sum, etc, applied in query expressions. Your query does not show any such functions. Meaning you actually not required a Group by clause. And if you still use such clause, you will receive only the first record from a grouped results.
Hence output on your query is perfect.
Query result is perfect; it will return only one row.