I am having trouble designing a query that does the following:
List employee names, employee numbers, and their respective total earningPerProject using the following database schema:
department(primary key(deptName), deptName, deptCity)
employee(primary key(empNum), empName, empCity)
project(primary key(projectNum), projectName, budget)
worksOn(foreign key(empNum), foreign key(projectNum), deptNum, jobTitle, startDate, earningPerProject)
I am able to display the employee names and employee numbers but when it comes to the total of the earningPerProject for each employee I am lost.
Some employees are listed more than once, and I realize I have to use the aggregate functions SUM() and COUNT(), but I haven't figured out a way to do this successfully.
Here is what I have so far:
SELECT DISTINCT(empName), employee.empNum, earningPerProject FROM employee, worksOn
WHERE worksOn.empNum = employee.empNum;
Could someone assist me with some hints or example queries. I am not sure how I would go about doing this.
Here you must use the GROUP BY clause and SUM() to compute to total earningsPerProject for the given employee.
DISTINCT is not necessary. In your code you used DISTINCT(empName) which looks like you want to eliminate duplicate employee names in the result. It is possible to have two employees with the same name so retrieving only unique names could leave some employees out of your results. This is why we use things like empNum as a primary key instead of names. You actually want to retrieve the distinct combos of empNum and empName.
You are correct that there can be duplicate empNum in the worksOn table because a given employee could work on multiple projects. The GROUP BY will group together all rows having the same empNum and empName and combine them into a single row thus eliminating the need for DISTINCT. (More below)
Here I have modified your query to include the SUM() and GROUP BY.
SELECT employee.empNum, employee.empName, SUM(worksOn.earningPerProject)
FROM employee, worksOn
WHERE employee.empNum = worksOn.empNum
GROUP BY employee.empNum, employee.empName;
JOIN
The syntax used in your FROM clause (FROM employee, worksOn) where you list the tables to be joined together on the same line and comma separated is what is known as an implicit join. This syntax was deprecated with the release of SQL-92 according to Join (SQL).
Best practice dictates that you switch to using the new syntax known as the explicit join by using the JOIN keyword with the added ON keyword to describe the link between the tables.
The new JOIN syntax is functionally equivalent to the old implicit join syntax. Both produce the same results.
SELECT employee.empNum, employee.empName, SUM(worksOn.earningsPerProject)
FROM employee
JOIN worksOn ON employee.empNum = worksOn.empNum
GROUP BY employee.empNum, employee.empName;
DISTINCT
DISTINCT is a SQL keyword that eliminates duplicate result rows based on the expressions in your SELECT list. If you request only one expression (SELECT empCity FROM employee) it returns the unique values for that expression (it only shows each city once). If you have request more than one expression it returns unique combinations of those expressions.
Many database engines use GROUP BY to calculate DISTINCT results so using them together is usually redundant.
Your query includes some unfortunately legal SQL syntax. You put parentheses around empName which gave SELECT DISTINCT (empName), employee.empNum, .... This syntax is misleading because DISTINCT is a keyword and not a function and the parentheses here are not used by DISTINCT. When DISTINCT is used it applies to all expressions in the SELECT. In this case removing the parentheses does not change the meaning though it does make it more clear.
These three queries are equivalent:
SELECT DISTINCT empName, employee.empNum, ...
SELECT DISTINCT (empName), employee.empNum, ...
SELECT DISTINCT empName, (employee.empNum), ...
Parentheses in SQL can be used to group expressions and are typically used to force the order of evaluation when dealing with operators such as <, >, =, *, /. Placing parentheses around a single expression does not change its value. While you thought you were using DISTINCT for just empName you really were just wrapping the expression empName in parentheses which effectively did nothing.
You can test this by running this query
SELECT empName FROM employee
and this query
SELECT (empName) FROM employee
and you will see the same results.
Related
I was doing some experiments with the DISTINCT keyword and some particular situations caught up my attention.
First of all I noticed that I can put some parenthesis with DISTINCT, for example:
SELECT DISTINCT(NAME) FROM EMPLOYEE;
is ok, while
SELECT DISTINCT(NAME, SURNAME) FROM EMPLOYEE;
gives me an error. Why?
And what's the sense of allowing operations like this one?
SELECT DISTINCT(NAME), COUNT(SURNAME) FROM EMPLOYEE;
DISTINCT (or DISTINCTROW) is not a function; it is an option of the SELECT statement that tells MySQL to strip duplicate rows from the resultset generated by the query. By default it returns all the generated rows, including the duplicates.
It can also be used together with COUNT() aggregate function (as COUNT(DISTINCT expr)) and it has the same meaning: it ignores the duplicates.
Because DISTINCT is not a function, DISTINCT(NAME) is interpreted as DISTINCT followed by the expression (NAME) (which is the same as NAME).
SELECT DISTINCT(NAME, SURNAME) FROM EMPLOYEE doesn't work because (NAME, SURNAME) is not a valid MySQL expression.
DISTINCT is nothing really -- more of a modifier. It is used in two places. It can modifier aggregation functions. The only one worth using is COUNT(DISTINCT).
In your case, you are using it as a modifier to SELECT. It simply says that each entire row should be distinct.
So, when you say:
SELECT DISTINCT (name, surname)
This is the same as getting all distinct rows of (name, surname) from this query:
SELECT (name, surname)
MySQL does not recognize this syntax. In some databases, this represents a tuple or row constructor. But, because MySQL does not recognize this syntax, you are getting an error.
As a note: SELECT DISTINCT can be quite powerful but it is often not necessary. More typically, you can do a GROUP BY with the column you actually want "distinct"ed. For instance:
SELECT name, surname, COUNT(*)
FROM t
GROUP BY name, surname
ORDER BY COUNT(*) DESC;
This query actually shows the duplicates and how often they occur.
Could someone explain why the following query throws an error, if I am trying to get the names of all customers along with the total number of customers?
SELECT name, COUNT(*)
FROM CUSTOMER
I know that selecting columns along with an aggregate function requires a GROUP BY statement containing all the column names, but I don't understand the logical principle behind this.
edit:
http://sqlfiddle.com/#!2/90233/595
I guess 'error' isn't quite right, but notice how the current query returns Allison 9 as the only result.
I don't understand why it doesn't return:
Alison 9
Alison 9
Alison 9
Alison 9
Jason 9
...
(This is a new answer based on the comment and looking at the fiddle.)
The issue here is how mysql handles aggregate functions -- which is a non-standard way and different then everyone else.
mysql lets you use aggregate functions (count() is an example of an aggregate function) without a group by. All (or most?) other sql implementations require the group by when you use count(*). When you have a group by you have to say the range in the group by (for example group by name). Also every column has to be in the range or the result of an aggregate function.
SINCE you don't have a range mysql assumes the whole table and since you have a column that is not the result of a aggregate function or in the range (in this case name) mysql does something to make that column the result of an aggregate function. I'm not sure if it is specified in mysql what it does -- lets say "max()". (Fairly sure it is max()). So the real sql that is getting executed is
SELECT ANY_VALUE(name), COUNT(*)
FROM CUSTOMER
Thus you only see one name.
mysql documentation - http://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
After reading the above I see that mysql will use the default aggregate function ANY_VALUE() for columns which are not in the range.
If you just want the total number of customers on each row you could do this
SELECT DISTINCT NAME, COUNT(NAME) OVER () AS CustomerCount
FROM CUSTOMER
In this way you don't need the GROUP BY syntax. Under the covers it is probably doing the same thing as #GordonLinoff 's answer.
I added this because maybe it makes it clearer how group by works.
Select name, Count(*) as 'CountCustomers'
FROM CUSTOMER
Group by name
Order by name
Think of it as giving an instruction of which field to aggregate by. For example, if you had a field with the State of the Customer, you could group by State which would give a count of customers by state.
Also, note you can have multiple aggregate functions in the same select using the "over (partition by" construct.
If you want the names along with the total number of customers, then use a window function:
select name, count(*) as NumCustomersWithName,
sum(count(*)) over () as NumCustomers
from customer
group by name;
Edit:
You actually seem to want:
select name, count(*) over () as NumCustomers
from customer;
In MySQL, you would do this with a subquery:
select name, cnt
from customers cross join
(select count(*) as cnt from customers) x;
The reason your query doesn't work is because it is an aggregation query that returns exactly one row. When you use aggregation functions without a GROUP BY, then the query always returns exactly one row.
I'm reading a book on SQL (Sams Teach Yourself SQL in 10 Minutes) and its quite good despite its title. However the chapter on group by confuses me
"Grouping data is a simple process. The selected columns (the column list following
the SELECT keyword in a query) are the columns that can be referenced in the GROUP
BY clause. If a column is not found in the SELECT statement, it cannot be used in the
GROUP BY clause. This is logical if you think about it—how can you group data on a
report if the data is not displayed? "
How come when I ran this statement in MySQL it works?
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
You're right, MySQL does allow you to create queries that are ambiguous and have arbitrary results. MySQL trusts you to know what you're doing, so it's your responsibility to avoid queries like that.
You can make MySQL enforce GROUP BY in a more standard way:
mysql> SET SQL_MODE=ONLY_FULL_GROUP_BY;
mysql> select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
ERROR 1055 (42000): 'test.EMPLOYEE_PAY_TBL.EMP_ID' isn't in GROUP BY
Because the book is wrong.
The columns in the group by have only one relationship to the columns in the select according to the ANSI standard. If a column is in the select, with no aggregation function, then it (or the expression it is in) needs to be in the group by statement. MySQL actually relaxes this condition.
This is even useful. For instance, if you want to select rows with the highest id for each group from a table, one way to write the query is:
select t.*
from table t
where t.id in (select max(id)
from table t
group by thegroup
);
(Note: There are other ways to write such a query, this is just an example.)
EDIT:
The query that you are suggesting:
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
would work in MySQL but probably not in any other database (unless BONUS happens to be a poorly named primary key on the table, but that is another matter). It will produce one row for each value of BONUS. For each row, it will get an arbitrary EMP_ID and SALARY from rows in that group. The documentation actually says "indeterminate", but I think arbitrary is easier to understand.
What you should really know about this type of query is simply not to use it. All the "bare" columns in the SELECT (that is, with no aggregation functions) should be in the GROUP BY. This is required in most databases. Note that this is the inverse of what the book says. There is no problem doing:
select EMP_ID
from EMPLOYEE_PAY_TBL
group by EMP_ID, BONUS;
Except that you might get multiple rows back for the same EMP_ID with no way to distinguish among them.
I am trying to GROUP BY a MYSQL request on a GROUP_CONCAT. The trio of values that is generated by this GROUP_CONCAT is the only unique identifier that I have to describe the group I want to apply the GROUP BY.
When I do the following :
SELECT [...] GROUP_CONCAT(DISTINCT xxx) as supsku
[...]
GROUP BY supsku
it says :
Can't group on 'supsku'
Thanks a lot
One way to go try with a subselect
SELECT t.* FROM (
SELECT [...] GROUP_CONCAT(DISTINCT xxx) as supsku
[...]
) t
GROUP BY supsku
You can't group by a column whose contents don't exist until after the groups are formed. That's a chicken-and-egg problem.
By analogy, suppose I ask you to scratch off some lottery tickets, but scratch them only if the total value of the winning tickets is more than $100? Obviously, you can't know what the winning values are before you scratch the lottery tickets, so you can't know if you should scratch them or not.
The answer from #MKhalidJunaid shows part of the solution -- using a subquery to produce a partial result with the strings formed into groups. Then embed that as a derived table subquery to be further processed by an outer query with a GROUP BY.
But the problem with that solution is that we don't know how to group the strings in the inner subquery. Without a valid GROUP BY in the subquery, the default is to treat the whole table as one group, and therefore GROUP_CONCAT will return one row with one string.
So you need to think about defining your problem better. There must be some other grouping criterion you have in mind.
/* erroneous query */
select dept name, ID, avg (salary)
from instructor
group by dept name;
I know that every non-aggregated function must appear in group by if it appears in select. However this query still runs in mySQL.
should it be:
/* erroneous query */
select dept name, ID, avg (salary)
from instructor
group by dept name, **ID**;
Because I ran the both queries and they give the exact same answers!
MySQL will allow you to not include non-aggregated columns in your group by, which is just a terrible idea to me. This can result in some very un-predictable results. Here's a link to the documentation:
Clicky!
it should be:
select [dept name], ID, AVG(salary)
from instructor
group by [dept name]
Now it would be more instructive to show the columns defined in your table, but you CANNOT have spaces in a column name without the column being wrapped in brackets live I did above.
From the MySQL documentation on this particular point:
In standard SQL, a query that includes a GROUP BY clause cannot refer
to non-aggregated columns in the select list that are not named in the
GROUP BY clause. For example, this query is illegal in standard SQL
because the name column in the select list does not appear in the
GROUP BY ...
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group.
So roughly spoken the omitted columns get added automatically.
However, note that it is not exactly the same. Have a look at this example.
SELECT name, address, MAX(age) FROM customers GROUP BY name, address;
might give you something different as:
SELECT name, address, MAX(age) FROM customers GROUP BY name;
Check this Fiddle.
Your statement that "I know that every non-aggregated function must appear in group by if it appears in select" is according to me correct. I am not a SQL guru but thats my understanding too. I would have expected a syntax error to be flagged if your statement does not meet that condition. However, if it gives the same result, then the one possibility is that you have the same value in all rows for the ID field or whatever field it is that is missing in the group by list. Just check different values and see. Also, it may help to expressly use "as" for alias rather than blanks.
MySQL extends the use of GROUP BY so you can select nonaggregated columns, not named in the group by clause:
SELECT dept_name, ID, avg(salary)
FROM instructor
GROUP BY dept_name;
the previous query is perfectly valid in MySQL, while other DBMS will rise an error because of ID not present in the group by clause.
However, if there are more than one ID for each dept_name, the value of ID returned by MySQL will be undetermined.
You can configure MySQL to disable this extension.