DISTINCT as a clause or a function - mysql

I was doing some experiments with the DISTINCT keyword and some particular situations caught up my attention.
First of all I noticed that I can put some parenthesis with DISTINCT, for example:
SELECT DISTINCT(NAME) FROM EMPLOYEE;
is ok, while
SELECT DISTINCT(NAME, SURNAME) FROM EMPLOYEE;
gives me an error. Why?
And what's the sense of allowing operations like this one?
SELECT DISTINCT(NAME), COUNT(SURNAME) FROM EMPLOYEE;

DISTINCT (or DISTINCTROW) is not a function; it is an option of the SELECT statement that tells MySQL to strip duplicate rows from the resultset generated by the query. By default it returns all the generated rows, including the duplicates.
It can also be used together with COUNT() aggregate function (as COUNT(DISTINCT expr)) and it has the same meaning: it ignores the duplicates.
Because DISTINCT is not a function, DISTINCT(NAME) is interpreted as DISTINCT followed by the expression (NAME) (which is the same as NAME).
SELECT DISTINCT(NAME, SURNAME) FROM EMPLOYEE doesn't work because (NAME, SURNAME) is not a valid MySQL expression.

DISTINCT is nothing really -- more of a modifier. It is used in two places. It can modifier aggregation functions. The only one worth using is COUNT(DISTINCT).
In your case, you are using it as a modifier to SELECT. It simply says that each entire row should be distinct.
So, when you say:
SELECT DISTINCT (name, surname)
This is the same as getting all distinct rows of (name, surname) from this query:
SELECT (name, surname)
MySQL does not recognize this syntax. In some databases, this represents a tuple or row constructor. But, because MySQL does not recognize this syntax, you are getting an error.
As a note: SELECT DISTINCT can be quite powerful but it is often not necessary. More typically, you can do a GROUP BY with the column you actually want "distinct"ed. For instance:
SELECT name, surname, COUNT(*)
FROM t
GROUP BY name, surname
ORDER BY COUNT(*) DESC;
This query actually shows the duplicates and how often they occur.

Related

How to find items with multiple different names

I'm having trouble with a very simple sql query. I want to identify all items that have more than one name. Here's what I'm currently doing:
select group_concat(distinct name) names
from table
group by master_id
having names like '%,%'
Unfortunately, a lot of names have a , in it, so the above doesn't work well. What would be the correct way to do this query?
Here is a correct version of your query:
SELECT
master_id,
GROUP_CONCAT(DISTINCT name) names
FROM yourTable
GROUP BY master_id
HAVING COUNT(DISTINCT name) > 1;
The reason we need to count distinct in the HAVING clause is that a logical item in the aggregated string is a distinct name.
The correct solution would be:
… HAVING COUNT(name) > 1
In a query using GROUP BY, aggregate functions like COUNT(), MIN(), and MAX() (as well as GROUP_CONCAT(), as well as a few others) can be used to operate on all values of a column in the grouped rows.
You could also include COUNT(name) in the columns to return the number of names for the master_id.

GROUP BY clause in MySQL groups records with different values

MySQL GROUP BY clause groups records even when they have different values.
However I would like it to as with DB2 SQL so that if records not contain exactly the same information they are not grouped.
Currently in MySQL for:
id Name
A Amanda
A Ana
the Group by id would return 1 record randomly (unless aggregation clauses used of course)
However in DB2 SQL the same Group by id would not group those: returning 2 records and never doing such a thing as picking randomly one of the values when grouping without using aggregation functions.
First, id is a bad name for a column that is not the primary key of a table. But that is not relevant to your question.
This query:
select id, name
from t
group by id;
returns an error in almost any database other than MySQL. The problem is that name is not in the group by and is not the argument of an aggregation function. The failure is ANSI-standard behavior, not honored by MySQL.
A typical way to write the query is:
select id, max(name)
from t
group by id;
This should work in all databases (assuming name is not some obscure type where max() doesn't work).
Or, if you want each name, then:
select id, name
from t
group by id, name;
or the simpler:
select distinct id, name
from t;
In MySQL, you can get the ANSI standard behavior by setting ONLY_FULL_GROUP_BY for the database/session. MySQL will then return an error, as DB2 does in this case.
The most recent versions of MySQL have ONLY_FULL_GROUP_BY set by default.
Group by in mysql will group the records according to the set fields. Think of it as: It gets one and the others will not show up. It has uses, for example, to count how many times that ID is repeated on the table:
select count(id), id from table group by id
You can, however, to achieve your purpose, group by multiple fields, something among the lines of:
select * from table group by id, name
I do not think there is an automated way to do this but using
GROUP BY id, name
Would give you the solution you are looking for

MySQL employee database

I am having trouble designing a query that does the following:
List employee names, employee numbers, and their respective total earningPerProject using the following database schema:
department(primary key(deptName), deptName, deptCity)
employee(primary key(empNum), empName, empCity)
project(primary key(projectNum), projectName, budget)
worksOn(foreign key(empNum), foreign key(projectNum), deptNum, jobTitle, startDate, earningPerProject)
I am able to display the employee names and employee numbers but when it comes to the total of the earningPerProject for each employee I am lost.
Some employees are listed more than once, and I realize I have to use the aggregate functions SUM() and COUNT(), but I haven't figured out a way to do this successfully.
Here is what I have so far:
SELECT DISTINCT(empName), employee.empNum, earningPerProject FROM employee, worksOn
WHERE worksOn.empNum = employee.empNum;
Could someone assist me with some hints or example queries. I am not sure how I would go about doing this.
Here you must use the GROUP BY clause and SUM() to compute to total earningsPerProject for the given employee.
DISTINCT is not necessary. In your code you used DISTINCT(empName) which looks like you want to eliminate duplicate employee names in the result. It is possible to have two employees with the same name so retrieving only unique names could leave some employees out of your results. This is why we use things like empNum as a primary key instead of names. You actually want to retrieve the distinct combos of empNum and empName.
You are correct that there can be duplicate empNum in the worksOn table because a given employee could work on multiple projects. The GROUP BY will group together all rows having the same empNum and empName and combine them into a single row thus eliminating the need for DISTINCT. (More below)
Here I have modified your query to include the SUM() and GROUP BY.
SELECT employee.empNum, employee.empName, SUM(worksOn.earningPerProject)
FROM employee, worksOn
WHERE employee.empNum = worksOn.empNum
GROUP BY employee.empNum, employee.empName;
JOIN
The syntax used in your FROM clause (FROM employee, worksOn) where you list the tables to be joined together on the same line and comma separated is what is known as an implicit join. This syntax was deprecated with the release of SQL-92 according to Join (SQL).
Best practice dictates that you switch to using the new syntax known as the explicit join by using the JOIN keyword with the added ON keyword to describe the link between the tables.
The new JOIN syntax is functionally equivalent to the old implicit join syntax. Both produce the same results.
SELECT employee.empNum, employee.empName, SUM(worksOn.earningsPerProject)
FROM employee
JOIN worksOn ON employee.empNum = worksOn.empNum
GROUP BY employee.empNum, employee.empName;
DISTINCT
DISTINCT is a SQL keyword that eliminates duplicate result rows based on the expressions in your SELECT list. If you request only one expression (SELECT empCity FROM employee) it returns the unique values for that expression (it only shows each city once). If you have request more than one expression it returns unique combinations of those expressions.
Many database engines use GROUP BY to calculate DISTINCT results so using them together is usually redundant.
Your query includes some unfortunately legal SQL syntax. You put parentheses around empName which gave SELECT DISTINCT (empName), employee.empNum, .... This syntax is misleading because DISTINCT is a keyword and not a function and the parentheses here are not used by DISTINCT. When DISTINCT is used it applies to all expressions in the SELECT. In this case removing the parentheses does not change the meaning though it does make it more clear.
These three queries are equivalent:
SELECT DISTINCT empName, employee.empNum, ...
SELECT DISTINCT (empName), employee.empNum, ...
SELECT DISTINCT empName, (employee.empNum), ...
Parentheses in SQL can be used to group expressions and are typically used to force the order of evaluation when dealing with operators such as <, >, =, *, /. Placing parentheses around a single expression does not change its value. While you thought you were using DISTINCT for just empName you really were just wrapping the expression empName in parentheses which effectively did nothing.
You can test this by running this query
SELECT empName FROM employee
and this query
SELECT (empName) FROM employee
and you will see the same results.

Why is the following SQL query erroneous?

/* erroneous query */
select dept name, ID, avg (salary)
from instructor
group by dept name;
I know that every non-aggregated function must appear in group by if it appears in select. However this query still runs in mySQL.
should it be:
/* erroneous query */
select dept name, ID, avg (salary)
from instructor
group by dept name, **ID**;
Because I ran the both queries and they give the exact same answers!
MySQL will allow you to not include non-aggregated columns in your group by, which is just a terrible idea to me. This can result in some very un-predictable results. Here's a link to the documentation:
Clicky!
it should be:
select [dept name], ID, AVG(salary)
from instructor
group by [dept name]
Now it would be more instructive to show the columns defined in your table, but you CANNOT have spaces in a column name without the column being wrapped in brackets live I did above.
From the MySQL documentation on this particular point:
In standard SQL, a query that includes a GROUP BY clause cannot refer
to non-aggregated columns in the select list that are not named in the
GROUP BY clause. For example, this query is illegal in standard SQL
because the name column in the select list does not appear in the
GROUP BY ...
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group.
So roughly spoken the omitted columns get added automatically.
However, note that it is not exactly the same. Have a look at this example.
SELECT name, address, MAX(age) FROM customers GROUP BY name, address;
might give you something different as:
SELECT name, address, MAX(age) FROM customers GROUP BY name;
Check this Fiddle.
Your statement that "I know that every non-aggregated function must appear in group by if it appears in select" is according to me correct. I am not a SQL guru but thats my understanding too. I would have expected a syntax error to be flagged if your statement does not meet that condition. However, if it gives the same result, then the one possibility is that you have the same value in all rows for the ID field or whatever field it is that is missing in the group by list. Just check different values and see. Also, it may help to expressly use "as" for alias rather than blanks.
MySQL extends the use of GROUP BY so you can select nonaggregated columns, not named in the group by clause:
SELECT dept_name, ID, avg(salary)
FROM instructor
GROUP BY dept_name;
the previous query is perfectly valid in MySQL, while other DBMS will rise an error because of ID not present in the group by clause.
However, if there are more than one ID for each dept_name, the value of ID returned by MySQL will be undetermined.
You can configure MySQL to disable this extension.

Scope of COUNT(DISTINCT ..) when used with GROUP BY

I'm doing something like follows (Example, getting distinct people named "Mark" by State):
Select count(distinct FirstName) FROM table
GROUP BY State
I think the group by query organization is done first, such that the distinct is only relative to each "group by"? Basically, can "Mark" show up as a "distinct" count in each group? This would "scope" my distinct expression to the group by rows only, I believe...
This may actually depend on where DISTINCT is used. For example, SELECT DISTINCT COUNT( would be different than SELECT COUNT(DISTINCT.
In this case, it will work as you want and get a count of distinct names in each group (even if the names are not distinct across groups).
Your understanding is correct. Group by says, essentially, to take a group of rows and aggregate them into one row (based on the criteria). All aggregation functions -- including count(distinct) -- summarize values in this group.
As a note, you are using the word "scope". Just so you know, this has a particular meaning in SQL. The meaning refers to the portions of the query where a column or table alias are understood by the compiler.