Delete duplicate rows in MySQL based on contents of another table - mysql

I have a MySQL (5.4) table that has some rows with duplicate fields (2-5 copies sometimes) that I'd like to remove, leaving only one. But it's not as simple as just picking the highest or lowest id. The duplicates I'd like to remove are those that don't have corresponding entries in another table.
Table tb_email_to_members has email_id (auto-incrementing) and email_address (and other fields that aren't relevant). For example:
email_id email_address
-------------------------
1 arnold#foo.com
2 foo#foo.com
3 foo#foo.com
4 foo#foo.com
5 jeanluc#foo.com
Table tb_tx has tx_id (auto-incrementing) and frn_email_id (and other fields that aren't relevant), where tb_tx.frn_email_id matches up with tb_email_to_members.email_id. For example:
tx_id frn_email_id
--------------------------
100 5
101 2
102 19
103 19
104 19
105 1
I want to remove rows where email_address is duplicated one or more times in tb_email_to_members, but only when there are NO rows containing frn_email_id in tb_tx for the email_id that comes from tb_email_to_members. I need to make sure to leave one row of the duplicates, even if none of them have corresponding entries in tb_tx. In the examples above, I want to remove rows 3 and 4 from tb_email_to_members, since only row 2 exists in tb_tx.
(In essence, tb_email_to_members maps email addresses to user accounts in another table yet, and tb_tx maps orders to those email addresses from tb_email_to_members.)
I can find the duplicates easily, and I see lots of code for deleting duplicates, but not with the tweak of needing to delete only certain duplicates based on the failure of a lookup from another table. Suggestions?

#MHardwick and #ShadowRay almost got it right. The following also checks to make sure the email exists more tan once in tb_email_to_members
DELETE FROM tb_email_to_members
WHERE email_id NOT IN (SELECT frn_email_id FROM tb_tx)
AND email_address IN (SELECT email_address FROM tb_email_to_members GROUP BY email_address HAVING COUNT(email_address) > 1);
And obviously changing DELETE to SELECT * will show you what exactly you're about to delete.
Bonus points for knowing tb is short for tidbits?

This should answer your question:
DELETE FROM tb_email_to_members WHERE email_id NOT IN (select frn_email_id FROM tb_tx);

This, I think, does exactly what you want. It removes only the duplicate entries from tb_email_to_members where there is no related row in tb_tx, and leaves all of the originals.
Note that you didn't say anything about removing entries from tb_tx, so the duplicates in that table are left alone (in your example content, rows 102-104).
The approach I'm using here essentially does this, in pseudo code:
DELETE FROM table WHERE id_col IN (
SUBQUERY that selects an id column and applies a WHERE filter that makes sure each id is NOT in (
another SUBQUERY which only selects the first item from each grouping, very similar to the first SUBQUERY
)
)
There's another SUBQUERY in there (line 2) wrapping the whole thing up, which prevents MySQL from complaining that you can't select from and modify a table at the same time.
Note: this is likely to be slow if your data set is large. Back up your tables before deleting a lot of data manually!
I realize this is a rather complex query, but it does work.
DELETE FROM tb_email_to_members WHERE email_id IN (
SELECT * FROM (
SELECT ids.eid FROM (
SELECT tb_email_to_members.email_id AS eid, dup.email_id AS eid2, dup.email_address, frn_email_id
FROM tb_email_to_members
LEFT JOIN (
SELECT email_id, email_address FROM tb_email_to_members
GROUP BY email_address
HAVING count(email_id) > 1) AS dup
ON tb_email_to_members.email_address = dup.email_address
INNER JOIN tb_tx tx ON dup.email_id = tx.frn_email_id
) AS ids
WHERE ids.eid NOT IN (
SELECT tb_email_to_members.email_id AS eid FROM tb_email_to_members
LEFT JOIN (
SELECT email_id, email_address FROM tb_email_to_members
GROUP BY email_address
HAVING count(email_id) > 1) AS dup
ON tb_email_to_members.email_address = dup.email_address
INNER JOIN tb_tx tx ON dup.email_id = tx.frn_email_id
GROUP BY dup.email_id
)
) AS foo
)

Related

Split MySQL row into multiple rows based on a value in the table

I need to split my MySQL table rows into multiple rows based on the number of students applying for the course with certain course code.
basically, I need to separate the poorly designed MySQL table into rows so each student ends up in its designated row generated with the relevant course_code and unique id (id_new). Notice that the total number of studens in course1_students and course2_students will obviously be equal to the number of rows in the needed table.
Appreciate any help!
To split each row into two rows, get two copies of each row in the FROM via cross join of the poorly designed table with a two-row constant table. To get courseX_student rows for each row of that table in the FROM, cross join with a table of integers >= 0 on course_students. Insert the following rows into a table with auto_incremented id_new. (I'll assume that a courseX_students value of 0 produces no output rows. Because you haven't given the exact meaning of the tables.)
select id,name,1 as course_students,
case which
when 1 then course1_code
when 2 then course2_code
end as course_code
from numbers
cross join (select 1 as which union select 2) w
cross join poorly
where n <
case which
when 1 then course1_students
when 2 then course2_students
end
CREATE TABLE my_new_table AS
SELECT id
, name
, course1_students course_students
, course1_code course_code
FROM my_table
UNION
SELECT id
, name
, course2_students
, course2_code
FROM my_table;
ALTER TABLE my_new_table ADD PRIMARY KEY(id,course_code);
id_new is redundant (and name probably belongs in a separate table)

MYSQL Deduplicate and remove the duplicate row with least data

I am working on a MYSQL database which has the following three columns: emails, name, surname.
What I need to do is deduplicate the emails where I know I can use a function such as this one (this query just to sort not delete):
select distinct emails, name, surname from emails;
or
select emails, name, surname from emails group by emails having count(*) >= 2;
However I also need to make sure that when there a duplicate email address is found that the one kept is the one that has a name and/or surname value.
For example:
|id | emails | name | surname |
|1 | bob#bob.com | bob | paulson |
|2 | bob#bob.com | | |
In this case I would like to keep the first result and delete the second.
I have been looking into using 'case' or 'if' statements but am not experienced with using those. I tried expanding the above functions with those statements but to no avail.
Could anyone point me in the right direction?
PS: The first column in the table is an auto-incremented id value, in case that helps
UPDATE 1: So far #Bohemian answer below is working great but fails in one case where there is a duplicate emails address where in one row it has a name but no surname and in the next row it has no name but has a surname. It will keep both records. All that needs to be edited is so that one of these two records gets deleted, no matter which.
UPDATE 2: #Bohemian's answer is great, but after more testing I've found that it has a fundamental flaw in that it works only when there is a duplicate email row where the name and surname fields have data (like the first entry in the table above). If there are duplicates of an email but none of the rows have both the name and surname fields filled in then all those rows will be ignored and not deduplicated.
The last step for this query would be to work out how to delete the duplicates that don't meet the current necessary conditions. If one row has just name and the other just surname, it really doesn't matter which gets deleted as the email is the important thing to keep.
You could use this DELETE query, which is generic and can be easily adapted to support more fields:
DELETE tablename.*
FROM
tablename LEFT JOIN (
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails) ids
ON tablename.id=ids.min_id
WHERE
ids.min_id IS NULL
Please see fiddle here.
This query returns the maximum number of non null fields, for every email:
SELECT
emails,
MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails
I'm then joining this query with tablename, to get the minimum ID for every email that has the maximum number of non null fields:
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails
and then I'm deleting all rows that have an ID that is not returned by this query.
This is easy with mysql's multiple-table delete syntax:
delete b
from mytable a
join mytable b
on a.email = b.email
and a.id != b.id
where a.name is not null
and a.surname is not null
Delete record with duplicate email id
delete
from duplicate_email where id in(
select id from (
select id, email from duplicate_email group by email having count(id) > 1) as id
)
but there is one problem you can delete those record which have only one duplicate email i.e two same email but if there are three or more, you can repeat this query until you get zero record deleted

Delete from a table matching one criteria where there are rows in same table matching different criteria?

Sorry for the mega title... I was trying to be descriptive enough. I've got a table that contains event attendance data that has some erroneous data in it. The table definition is kind of like this:
id (row id)
date
company_name
attendees
It ended up with some cases where for a given date, there are two entries matching a company_name and date but one has attendees=0 and the other has attendees>0. In those cases, I want to discard the ones where attendees=0.
I know you can't join on the same table while deleting, so please consider this query to be pseudocode that shows what I want to accomplish.
DELETE FROM attendance a WHERE a.attendees=0 AND a.date IN (SELECT b.date FROM attendance b WHERE b.attendees > 0 AND b.company_name = a.company_name);
I also tried to populate a temporary table with the ids of the rows I want to delete, but that query hangs because of the IN (SELECT ...) clause. My table has thousands of rows so that just maxes out the CPU and then times out.
This ugly thing should work (using alias permit to avoid the You can't specify target table for update in FROM clause error)
DELETE FROM attendance
WHERE (attendees, date, company_name)
IN (SELECT c.a, c.d, c.c
FROM
(SELECT MIN(attendees) a, date d, company_name c
FROM attendance
GROUP BY date, company_name
HAVING COUNT(*) > 1) as c);
SqlFiddle

How to: Find and update all the entries where the value in one column shows up more than once

I have a table with the following columns:
subid - id of the resource
authorid - id of the author
ordering - order of author within citation
For an application where users can submit resources and cite multiple authors. Authors can cite primary and secondary authors in their submissions and usually do.
There is one case where a user (call him user 111) submitted all entries listing himself as the primary and the actual author as secondary. Unfortunately that person has left the project so it has fallen to me to fix this (I have to do it purely in sql).
I am trying to figure out how to build a query to do the following:
Find all entries
where the subid value shows up more than once in the table
where at least one of the authorid values is 111
where the ordering for 111 is greater than the ordering for any users that are not 111
& update them so
the not(111) author has ordering of '0'
and the 111 author has ordering '1'
Try this solution:
UPDATE tbl a
INNER JOIN
(
SELECT subid
FROM tbl
GROUP BY subid
HAVING COUNT(*) > 1 AND SUM(author_id = 111) > 0
) b ON a.subid = b.subid
SET a.ordering = (a.author_id = 111)
Replace tbl with your actual table name.

Delete Duplicate email addresses from Table in MYSQL

I have a table with columns for ID, firstname, lastname, address, email and so on.
Is there any way to delete duplicate email addresses from the TABLE?
Additional information (from comments):
If there are two rows with the same email address one would have a normal firstname and lastname but the other would have 'Instant' in the firstname. Therefore I can distinguish between them. I just want to delete the one with first name 'instant'.
Note, some records where the firstname='Instant' will have just 1 email address. I don't want to delete just one unique email address, so I can't just delete everything where firstname='Instant'.
Please help me out.
DELETE n1 FROM customers n1, customers n2 WHERE n1.ID > n2.ID AND n1.email = n2.email
DELETE FROM table WHERE id NOT IN (SELECT MIN(id) FROM table GROUP BY email)
This keeps the lowest, first inserted id's for every email.
While MiPnamic's answer is essentially correct, it doesn't solve the problem of which record you keep and which you throw away (and how you sort out related records). The short answer is that this cannot be done programmatically.
Given a query like this:
SELECT email, MAX(ID), MAX(firstname), MAX(lastname), MAX(address)
FROM customers
makes it even worse - since you are potentially selecting a mixture of fields from the duplicate rows. You'd need to do something like:
SELECT csr2.*
FROM customers csr2
WHERE ID IN (
SELECT MAX(id)
FROM customers csr
GROUP BY email
);
To get a unique set of existing rows. Of course you still need to sort out all the lreated records (hint - that's the IDs ni customers table not returned by the query above).
I don't know if this will work in MYSQL (I haven't used it)... but you should be able to do something like the following snippets.
I'd suggest you run them in order to get a feel for if the right data is being selected. If it does work, then you probably want to create a constraint on the column.
Get all of the duplicate e-mail addresses:
SELECT
EMAILADDRESS, COUNT(1)
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
Then determine the ID from that gives:
SELECT
ID
FROM
TABLE
WHERE
EMAILADDRESS IN (
SELECT
EMAILADDRESS
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
)
Then finally, delete the rows, based on the above and other constraints:
DELETE
FROM
TABLE
WHERE
ID IN (
SELECT
ID
FROM
TABLE
WHERE
EMAILADDRESS IN (
SELECT
EMAILADDRESS
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
)
)
AND FIRSTNAME = 'Instant'
Duplicate the table structure
Put a Unique Key on the email of the new table (just for safe)
Do a INSERT on the new table SELECTING data from the older one GROUPING by the email address
Another way to dedeupe using forsvarir answer above but modifying it a bit. This way you can keep which ever record you choose to partition by:
BEGIN TRAN
DELETE
FROM [TABLE]
WHERE
ID IN (
SELECT a.ID
FROM
(
SELECT ROW_NUMBER() OVER(PARTITION BY Email ORDER BY Email) [RowNum], ID, Email
FROM [TABLE]
WHERE Email IN
(
SELECT
Email
FROM
[TABLE]
GROUP BY Email
HAVING COUNT(1) > 1
)
) a
WHERE a.RowNum > 1
)
--COMMIT TRAN
--ROLLBACK TRAN
You can follow this MySQL query:
DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email
AND p1.id> p2.id;