Remove Duplicate record from Mysql Table using Group By - mysql

I have a table structure and data below.
I need to remove duplicate record from the table list. My confusion is that when I am firing query
SELECT * FROM `table` GROUP BY CONCAT(`name`,department)
then giving me correct list(12 records).
Same query when I am using the subquery:
SELECT *
FROM `table` WHERE id IN (SELECT id FROM `table` GROUP BY CONCAT(`name`,department))
It returning all record which is wrong.
So, My question is why group by in subquery is not woking.

Actually as Tim mentioned in his answer that it to get first unique record by group by clause is not a standard feature of sql but mysql allows it till mysql5.6.16 version but from 5.6.21 it has been changed.
Just change mysql version in your sql fiddle and check that you will get what you want.

In the query
SELECT * FROM `table` GROUP BY CONCAT(`name`,department)
You are selecting the id column, which is a non-aggregate column. Many RDBMS would give you an error, but MySQL allows this for performance reasons. This means MySQL has to choose which record to retain in the result set. Based on the result set in your original problem, it appears that MySQL is retaining the id of the first duplicate record, in cases where a group has more than one member.
In the query
SELECT *
FROM `table`
WHERE id IN
(
SELECT id FROM `table` GROUP BY CONCAT(`name`,department)
)
you are also selecting a non-aggregate column in the subquery. It appears that MySQL actually decides which id value to be retained in the subquery based on the id value in the outer query. That is, for each id value in table, MySQL performs the subquery and then selectively chooses to retain a record in the group if two id values match.
You should avoid using a non-aggregate column in a query with GROUP BY, because it is a violation of the ANSI standard, and as you have seen here it can result in unexpected results. If you give us more information about what result set you want, we can give you a correct query which will avoid this problem.
I welcome anyone who has documentation to support these observations to either edit my question or post a new one.

You can JOIN the grouped ids with that of table ids, so that you can get desired results.
Example:
SELECT t.* FROM so_q32175332 t
JOIN ( SELECT id FROM so_q32175332
GROUP BY CONCAT( name, department ) ) f
ON t.id = f.id
ORDER BY CONCAT( name, department );
Here order by was added just to compare directly the * results on group.
Demo on SQL Fiddle: http://sqlfiddle.com/#!9/d715a/1

Related

GROUP BY clause in MySQL groups records with different values

MySQL GROUP BY clause groups records even when they have different values.
However I would like it to as with DB2 SQL so that if records not contain exactly the same information they are not grouped.
Currently in MySQL for:
id Name
A Amanda
A Ana
the Group by id would return 1 record randomly (unless aggregation clauses used of course)
However in DB2 SQL the same Group by id would not group those: returning 2 records and never doing such a thing as picking randomly one of the values when grouping without using aggregation functions.
First, id is a bad name for a column that is not the primary key of a table. But that is not relevant to your question.
This query:
select id, name
from t
group by id;
returns an error in almost any database other than MySQL. The problem is that name is not in the group by and is not the argument of an aggregation function. The failure is ANSI-standard behavior, not honored by MySQL.
A typical way to write the query is:
select id, max(name)
from t
group by id;
This should work in all databases (assuming name is not some obscure type where max() doesn't work).
Or, if you want each name, then:
select id, name
from t
group by id, name;
or the simpler:
select distinct id, name
from t;
In MySQL, you can get the ANSI standard behavior by setting ONLY_FULL_GROUP_BY for the database/session. MySQL will then return an error, as DB2 does in this case.
The most recent versions of MySQL have ONLY_FULL_GROUP_BY set by default.
Group by in mysql will group the records according to the set fields. Think of it as: It gets one and the others will not show up. It has uses, for example, to count how many times that ID is repeated on the table:
select count(id), id from table group by id
You can, however, to achieve your purpose, group by multiple fields, something among the lines of:
select * from table group by id, name
I do not think there is an automated way to do this but using
GROUP BY id, name
Would give you the solution you are looking for

Subquery returns more rows than straight same query in MySQL

I want to remove duplicates based on the combination of listings.product_id and listings.channel_listing_id
This simple query returns 400.000 rows (the id's of the rows I want to keep):
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
While this variation returns 1.600.000 rows, which are all records on the table, not only is_verified = 0:
SELECT *
FROM (
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
) AS keepem
I'd expect them to return the same amount of rows.
What's the reason for this? How can I avoid it (in order to use the subselect in the where condition of the DELETE statement)?
EDIT: I found that doing a SELECT DISTINCT in the outer SELECT "fixes" it (it returns 400.000 records as it should). I'm still not sure if I should trust this subquery, for there is no DISTINCT in the DELETE statement.
EDIT 2: Seems to be just a bug in the way phpMyAdmin reports the total count of the rows.
Your query as it stands is ambiguous. Suppose you have two listings with the same product_id and channel_id. Then what id is supposed to be returned? The first, the second? Or both, ignoring the GROUP request?
What if there is more than one id with different product and channel ids?
Try removing the ambiguity by selecting MAX(id) AS id and adding DISTINCT.
Are there any foreign keys to worry about? If not, you could pour the original table into a copy, empty the original and copy back in it the non-duplicates only. Messier, but you only do SELECTs or DELETEs guaranteed to succeed, and you also get to keep a backup.
Assign aliases in order to avoid field reference ambiguity:
SELECT
keepem.*
FROM
(
SELECT
innerStat.id
FROM
`listings` AS innerStat
WHERE
innerStat.is_verified = 0
GROUP BY
innerStat.product_id,
innerStat.channel_listing_id
) AS keepem

SQL: Group By is mismatching records

I'm trying to get the highest version within a group. My query:
SELECT
rubric_id,
max(version) as version,
group_id
FROM
rubrics
WHERE
client_id = 1
GROUP BY
group_id
The Data:
The Results:
The rubric of ID 2 does not have a version of 2, why is this being mismatched? What do I need to do to correct this?
Edit, not a duplicate:
This is not a duplicate of SQL Select only rows with Max Value on a Column , which is a post I have read and referenced before writing this. My question is not how to find the max, my question is why is the version not matched to the correct ID
MySQL is confusing you by letting you get away with having a column in your select that isn't in your group by. To resolve the issue, make sure you don't select any field that isn't in the group by.
Instead of trying to get everything in one statement, you will need to use a subquery to find the max_version_id and then join to it.
SELECT T.*
FROM rubrics T
JOIN
(
SELECT
group_id,
max(version) as max_version
FROM
rubrics
GROUP BY
group_id
) dedupe
on T.group_id = dedupe.group_id
and T.version_id = dedupe.max_version_id
WHERE
T.client_id = 1
Edit: So MySQL allows it, but I don't think it's a good practise to use it.
You are trying to query non-aggregated data from an aggregated query. You should not do that.
A GROUP BY takes the field it should make group of rows with (in your case, what you say with your GROUP BY is: give me a result per different group_id) and gives a result (the aggregated data) based on the grouping.
Here, you try to access non aggregated data (rubric_id in your case). For some reason, the query does not crash and picks a "random" id in your aggregated data.

Why does MySQL allow you to group by columns that are not selected

I'm reading a book on SQL (Sams Teach Yourself SQL in 10 Minutes) and its quite good despite its title. However the chapter on group by confuses me
"Grouping data is a simple process. The selected columns (the column list following
the SELECT keyword in a query) are the columns that can be referenced in the GROUP
BY clause. If a column is not found in the SELECT statement, it cannot be used in the
GROUP BY clause. This is logical if you think about it—how can you group data on a
report if the data is not displayed? "
How come when I ran this statement in MySQL it works?
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
You're right, MySQL does allow you to create queries that are ambiguous and have arbitrary results. MySQL trusts you to know what you're doing, so it's your responsibility to avoid queries like that.
You can make MySQL enforce GROUP BY in a more standard way:
mysql> SET SQL_MODE=ONLY_FULL_GROUP_BY;
mysql> select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
ERROR 1055 (42000): 'test.EMPLOYEE_PAY_TBL.EMP_ID' isn't in GROUP BY
Because the book is wrong.
The columns in the group by have only one relationship to the columns in the select according to the ANSI standard. If a column is in the select, with no aggregation function, then it (or the expression it is in) needs to be in the group by statement. MySQL actually relaxes this condition.
This is even useful. For instance, if you want to select rows with the highest id for each group from a table, one way to write the query is:
select t.*
from table t
where t.id in (select max(id)
from table t
group by thegroup
);
(Note: There are other ways to write such a query, this is just an example.)
EDIT:
The query that you are suggesting:
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
would work in MySQL but probably not in any other database (unless BONUS happens to be a poorly named primary key on the table, but that is another matter). It will produce one row for each value of BONUS. For each row, it will get an arbitrary EMP_ID and SALARY from rows in that group. The documentation actually says "indeterminate", but I think arbitrary is easier to understand.
What you should really know about this type of query is simply not to use it. All the "bare" columns in the SELECT (that is, with no aggregation functions) should be in the GROUP BY. This is required in most databases. Note that this is the inverse of what the book says. There is no problem doing:
select EMP_ID
from EMPLOYEE_PAY_TBL
group by EMP_ID, BONUS;
Except that you might get multiple rows back for the same EMP_ID with no way to distinguish among them.

How do records get ordered in a mysql group by?

so suppose I do
SELECT * FROM table t GROUP BY t.id
Suppose there are multiple rows in the table with the same id, only one row of that id will ultimately come out...I suppose mysql will order the results that have the same id and the return the first one or something....my question is...how does mysql perform this ordering and is there a way that I can control its ordering so that for instance, it uses a certain field etc?
You can GROUP BY several different columns to arrange the order for which the result is grouped.
SELECT * FROM table t GROUP BY t.id, t.foo, t.bar
In strictly correct SQL, when you use GROUP BY all the values being selected must be either columns named in the GROUP BY clause or aggregate functions. MySQL allows you to violate this rule, unless you set the only_full_group_by mode. But although it allows you to perform such queries, it doesn't specify which row will be selected for each grouped column.
If you want to select a row that corresponds to the max or min of some other column, you can do something like this:
SELECT a.*
FROM table a
JOIN (SELECT id, max(somecol) maxcol
FROM table
GROUP BY id) b
ON a.id = b.id
AND a.somecol = b.maxcol
Note that this can still return multiple rows per ID if they both have the max value. You can add a final GROUP BY clause and it will select one of them arbitrarily.
When using this MySQL extension to the standard SQL GROUP BY functionality you cannot control which values for the non-aggregated, non-GROUP BY columns are selected by the server. The MySQL documentation discusses this specific case and states clearly that
The server is free to choose any value from each group, so unless they
are the same, the values chosen are indeterminate. Furthermore, the
selection of values from each group cannot be influenced by adding an
ORDER BY clause.