Delete Duplicate email addresses from Table in MYSQL - mysql

I have a table with columns for ID, firstname, lastname, address, email and so on.
Is there any way to delete duplicate email addresses from the TABLE?
Additional information (from comments):
If there are two rows with the same email address one would have a normal firstname and lastname but the other would have 'Instant' in the firstname. Therefore I can distinguish between them. I just want to delete the one with first name 'instant'.
Note, some records where the firstname='Instant' will have just 1 email address. I don't want to delete just one unique email address, so I can't just delete everything where firstname='Instant'.
Please help me out.

DELETE n1 FROM customers n1, customers n2 WHERE n1.ID > n2.ID AND n1.email = n2.email

DELETE FROM table WHERE id NOT IN (SELECT MIN(id) FROM table GROUP BY email)
This keeps the lowest, first inserted id's for every email.

While MiPnamic's answer is essentially correct, it doesn't solve the problem of which record you keep and which you throw away (and how you sort out related records). The short answer is that this cannot be done programmatically.
Given a query like this:
SELECT email, MAX(ID), MAX(firstname), MAX(lastname), MAX(address)
FROM customers
makes it even worse - since you are potentially selecting a mixture of fields from the duplicate rows. You'd need to do something like:
SELECT csr2.*
FROM customers csr2
WHERE ID IN (
SELECT MAX(id)
FROM customers csr
GROUP BY email
);
To get a unique set of existing rows. Of course you still need to sort out all the lreated records (hint - that's the IDs ni customers table not returned by the query above).

I don't know if this will work in MYSQL (I haven't used it)... but you should be able to do something like the following snippets.
I'd suggest you run them in order to get a feel for if the right data is being selected. If it does work, then you probably want to create a constraint on the column.
Get all of the duplicate e-mail addresses:
SELECT
EMAILADDRESS, COUNT(1)
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
Then determine the ID from that gives:
SELECT
ID
FROM
TABLE
WHERE
EMAILADDRESS IN (
SELECT
EMAILADDRESS
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
)
Then finally, delete the rows, based on the above and other constraints:
DELETE
FROM
TABLE
WHERE
ID IN (
SELECT
ID
FROM
TABLE
WHERE
EMAILADDRESS IN (
SELECT
EMAILADDRESS
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
)
)
AND FIRSTNAME = 'Instant'

Duplicate the table structure
Put a Unique Key on the email of the new table (just for safe)
Do a INSERT on the new table SELECTING data from the older one GROUPING by the email address

Another way to dedeupe using forsvarir answer above but modifying it a bit. This way you can keep which ever record you choose to partition by:
BEGIN TRAN
DELETE
FROM [TABLE]
WHERE
ID IN (
SELECT a.ID
FROM
(
SELECT ROW_NUMBER() OVER(PARTITION BY Email ORDER BY Email) [RowNum], ID, Email
FROM [TABLE]
WHERE Email IN
(
SELECT
Email
FROM
[TABLE]
GROUP BY Email
HAVING COUNT(1) > 1
)
) a
WHERE a.RowNum > 1
)
--COMMIT TRAN
--ROLLBACK TRAN

You can follow this MySQL query:
DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email
AND p1.id> p2.id;

Related

Students records in MySQL are entered again in more than one table after entering the email wrong. How to fix this issue?

There are three tables - student, exams, sports.
STUDENT : sid(primary, auto increment), email, fname, lname, address, standard
EXAMS : eid(primary, auto increment), sid(foreign) ename, date, result
SPORTS : spid(primary, auto increment), sid(foreign) spname, date, score
Data was added into the DB, but after sometime a guy in my project realized that he entered some emails wrong. Then, instead of editing the emails, he tried to add new entries of some of those students but not all data (he missed adding entries to exams/sports table for some students) . He did that for random students.
I tried this query to get a clear understanding
SELECT a.sid, a.fname,
CASE WHEN EXISTS (SELECT * FROM EXAMS e WHERE a.sid = e.sid) THEN 'YES' ELSE 'NO' END,
CASE WHEN EXISTS (SELECT * FROM SPORTS s WHERE a.sid = s.sid) THEN 'YES' ELSE 'NO' END
FROM STUDENT a;
How do I find which records I need to delete and which I need to update.
As existing copy of data has been inserted, instead of updating the emails, you can check using the other attributes in STUDENT table for more than one copies of data. Assuming he has entered data for other attribtes correctly, you can check using this:
SELECT fname, lname, address, standard, MAX(sid)
FROM STUDENT c GROUP BY fname, lname, address, standard HAVING count(*) > 1;
As sid is AUTO_INCREMENT, then for the same student, the later inserted IDs are duplicate. The above query finds those, use that to delete those IDs from your STUDENT table. Subsequently delete all those IDs from the rest tables.
Note: if a foreign key constraint is declared, you might need to delete from EXAM and SPORT first.

How to find duplicate fields using SQL?

I have table person(id, iin, name, done) and table err_person(id, iin, surname, name). How i can find duplicate values within 'iin' fields. If it exist, copy to err_person table and set flag person.done=1 for these rows.
person table
desired results: err_person
To find duplicate values in a field:
SELECT iin
FROM person
GROUP BY iin
HAVING COUNT(*) > 1
You may wish to nest this query:
SELECT * FROM person WHERE iin IN (
SELECT iin
FROM person
GROUP BY iin
HAVING COUNT(*) > 1)
And so, to insert this into the err_person table you could do something like this (noting that the person table does not have a surname field):
INSERT INTO err_person (id, iin, name)
SELECT id, iin, name FROM person WHERE iin IN (
SELECT iin
FROM person
GROUP BY iin
HAVING COUNT(*) > 1)
Finally, a separate query would have to be run to change the done field. The problem with a nested query here is that you're updating a table that you're trying to look at, thus a simple effort to use an update query that looks at a subquery will fail - because both are based on the person table. A temporary table might be a better option here.

Delete duplicate rows in MySQL based on contents of another table

I have a MySQL (5.4) table that has some rows with duplicate fields (2-5 copies sometimes) that I'd like to remove, leaving only one. But it's not as simple as just picking the highest or lowest id. The duplicates I'd like to remove are those that don't have corresponding entries in another table.
Table tb_email_to_members has email_id (auto-incrementing) and email_address (and other fields that aren't relevant). For example:
email_id email_address
-------------------------
1 arnold#foo.com
2 foo#foo.com
3 foo#foo.com
4 foo#foo.com
5 jeanluc#foo.com
Table tb_tx has tx_id (auto-incrementing) and frn_email_id (and other fields that aren't relevant), where tb_tx.frn_email_id matches up with tb_email_to_members.email_id. For example:
tx_id frn_email_id
--------------------------
100 5
101 2
102 19
103 19
104 19
105 1
I want to remove rows where email_address is duplicated one or more times in tb_email_to_members, but only when there are NO rows containing frn_email_id in tb_tx for the email_id that comes from tb_email_to_members. I need to make sure to leave one row of the duplicates, even if none of them have corresponding entries in tb_tx. In the examples above, I want to remove rows 3 and 4 from tb_email_to_members, since only row 2 exists in tb_tx.
(In essence, tb_email_to_members maps email addresses to user accounts in another table yet, and tb_tx maps orders to those email addresses from tb_email_to_members.)
I can find the duplicates easily, and I see lots of code for deleting duplicates, but not with the tweak of needing to delete only certain duplicates based on the failure of a lookup from another table. Suggestions?
#MHardwick and #ShadowRay almost got it right. The following also checks to make sure the email exists more tan once in tb_email_to_members
DELETE FROM tb_email_to_members
WHERE email_id NOT IN (SELECT frn_email_id FROM tb_tx)
AND email_address IN (SELECT email_address FROM tb_email_to_members GROUP BY email_address HAVING COUNT(email_address) > 1);
And obviously changing DELETE to SELECT * will show you what exactly you're about to delete.
Bonus points for knowing tb is short for tidbits?
This should answer your question:
DELETE FROM tb_email_to_members WHERE email_id NOT IN (select frn_email_id FROM tb_tx);
This, I think, does exactly what you want. It removes only the duplicate entries from tb_email_to_members where there is no related row in tb_tx, and leaves all of the originals.
Note that you didn't say anything about removing entries from tb_tx, so the duplicates in that table are left alone (in your example content, rows 102-104).
The approach I'm using here essentially does this, in pseudo code:
DELETE FROM table WHERE id_col IN (
SUBQUERY that selects an id column and applies a WHERE filter that makes sure each id is NOT in (
another SUBQUERY which only selects the first item from each grouping, very similar to the first SUBQUERY
)
)
There's another SUBQUERY in there (line 2) wrapping the whole thing up, which prevents MySQL from complaining that you can't select from and modify a table at the same time.
Note: this is likely to be slow if your data set is large. Back up your tables before deleting a lot of data manually!
I realize this is a rather complex query, but it does work.
DELETE FROM tb_email_to_members WHERE email_id IN (
SELECT * FROM (
SELECT ids.eid FROM (
SELECT tb_email_to_members.email_id AS eid, dup.email_id AS eid2, dup.email_address, frn_email_id
FROM tb_email_to_members
LEFT JOIN (
SELECT email_id, email_address FROM tb_email_to_members
GROUP BY email_address
HAVING count(email_id) > 1) AS dup
ON tb_email_to_members.email_address = dup.email_address
INNER JOIN tb_tx tx ON dup.email_id = tx.frn_email_id
) AS ids
WHERE ids.eid NOT IN (
SELECT tb_email_to_members.email_id AS eid FROM tb_email_to_members
LEFT JOIN (
SELECT email_id, email_address FROM tb_email_to_members
GROUP BY email_address
HAVING count(email_id) > 1) AS dup
ON tb_email_to_members.email_address = dup.email_address
INNER JOIN tb_tx tx ON dup.email_id = tx.frn_email_id
GROUP BY dup.email_id
)
) AS foo
)

MYSQL Deduplicate and remove the duplicate row with least data

I am working on a MYSQL database which has the following three columns: emails, name, surname.
What I need to do is deduplicate the emails where I know I can use a function such as this one (this query just to sort not delete):
select distinct emails, name, surname from emails;
or
select emails, name, surname from emails group by emails having count(*) >= 2;
However I also need to make sure that when there a duplicate email address is found that the one kept is the one that has a name and/or surname value.
For example:
|id | emails | name | surname |
|1 | bob#bob.com | bob | paulson |
|2 | bob#bob.com | | |
In this case I would like to keep the first result and delete the second.
I have been looking into using 'case' or 'if' statements but am not experienced with using those. I tried expanding the above functions with those statements but to no avail.
Could anyone point me in the right direction?
PS: The first column in the table is an auto-incremented id value, in case that helps
UPDATE 1: So far #Bohemian answer below is working great but fails in one case where there is a duplicate emails address where in one row it has a name but no surname and in the next row it has no name but has a surname. It will keep both records. All that needs to be edited is so that one of these two records gets deleted, no matter which.
UPDATE 2: #Bohemian's answer is great, but after more testing I've found that it has a fundamental flaw in that it works only when there is a duplicate email row where the name and surname fields have data (like the first entry in the table above). If there are duplicates of an email but none of the rows have both the name and surname fields filled in then all those rows will be ignored and not deduplicated.
The last step for this query would be to work out how to delete the duplicates that don't meet the current necessary conditions. If one row has just name and the other just surname, it really doesn't matter which gets deleted as the email is the important thing to keep.
You could use this DELETE query, which is generic and can be easily adapted to support more fields:
DELETE tablename.*
FROM
tablename LEFT JOIN (
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails) ids
ON tablename.id=ids.min_id
WHERE
ids.min_id IS NULL
Please see fiddle here.
This query returns the maximum number of non null fields, for every email:
SELECT
emails,
MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails
I'm then joining this query with tablename, to get the minimum ID for every email that has the maximum number of non null fields:
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails
and then I'm deleting all rows that have an ID that is not returned by this query.
This is easy with mysql's multiple-table delete syntax:
delete b
from mytable a
join mytable b
on a.email = b.email
and a.id != b.id
where a.name is not null
and a.surname is not null
Delete record with duplicate email id
delete
from duplicate_email where id in(
select id from (
select id, email from duplicate_email group by email having count(id) > 1) as id
)
but there is one problem you can delete those record which have only one duplicate email i.e two same email but if there are three or more, you can repeat this query until you get zero record deleted

Create Contacts Database Which Refers to Users Without Duplicates

My question is similar (but at the same time completely different) than this question:
Contacts Database
The question is simple: How can I create a Contacts database table which stores a user id and contact id without duplicating keys.
For example, if I have a table called Contacts, it would have a column user_id, and a column contact_id.
Once I do that, it should be as simple as inserting the user and the added contact. Once that is done though, how do I select all of a user's contacts? Also, how do I narrow down the contact entry enough to delete it if need be?
I ended up just creating a table with two foreign keys and then selecting them based on either of the fields.
For example (pseudo code--no specific language, just english):
Table Contact:
user = ForeignKey(from user table)
contact = ForeignKey(from user table)
Then whenever I need something from them, I'll check if the user field contains what I want and then I'll check if the contact field has what I want. This way I don't have to repeat records and I can still find what I need.
Thanks for your answers.
Similar to the question in the link. You would have 3 tables.
Table 1
User_ID
Name
PK(User_ID)
Table 2
Contact_id
Address
Phone_Number
etc...
PK(Contact_id)
Table 3
User_ID
Contact_id
PK(User_ID, Contact_id)
Here you would have ContactID in table 2 as an autoinc column.
Also, when inserting in Table 3, MySQL would throw an error if there is a duplicate.
To select all of a users contacts, use:
SELECT *
FROM Table_2 join Table_3
ON Table_2.Contact_id = Table_3.contact_id
WHERE Table2.User_id = <userid>
Or if you need it for a particular name, then
SELECT *
FROM Table_1 JOIN Table_2
ON Table_1.User_id = Table_2.User_id
JOIN Table_3
ON Table_2.Contact_id = Table_3.contact_id
WHERE Table1.name = <user name>
there are two questions.
" how do I select all of a user's contacts?"
So you have a table tbl_contacts(user_id, contact_id) both them are your primary key, so you won't get duplicated data.
I you want to list all contacts for user_id = ?
SELECT *
FROM tbl_contacts
WHERE user_id = ?
You might want to clarify your second question "Also, how do I narrow down the contact entry enough to delete it if need be?"
You probably have some other properties belong to the user's contact and you will need to use those properties to search for.(eg.: contact_name or contact_number) and when you have 1 record as a result of a query you can -> DELETE FROM tbl_contact WHERE contact_id = ?
If this is not the answer you wanted please clarify your question.