MYSQL Deduplicate and remove the duplicate row with least data - mysql

I am working on a MYSQL database which has the following three columns: emails, name, surname.
What I need to do is deduplicate the emails where I know I can use a function such as this one (this query just to sort not delete):
select distinct emails, name, surname from emails;
or
select emails, name, surname from emails group by emails having count(*) >= 2;
However I also need to make sure that when there a duplicate email address is found that the one kept is the one that has a name and/or surname value.
For example:
|id | emails | name | surname |
|1 | bob#bob.com | bob | paulson |
|2 | bob#bob.com | | |
In this case I would like to keep the first result and delete the second.
I have been looking into using 'case' or 'if' statements but am not experienced with using those. I tried expanding the above functions with those statements but to no avail.
Could anyone point me in the right direction?
PS: The first column in the table is an auto-incremented id value, in case that helps
UPDATE 1: So far #Bohemian answer below is working great but fails in one case where there is a duplicate emails address where in one row it has a name but no surname and in the next row it has no name but has a surname. It will keep both records. All that needs to be edited is so that one of these two records gets deleted, no matter which.
UPDATE 2: #Bohemian's answer is great, but after more testing I've found that it has a fundamental flaw in that it works only when there is a duplicate email row where the name and surname fields have data (like the first entry in the table above). If there are duplicates of an email but none of the rows have both the name and surname fields filled in then all those rows will be ignored and not deduplicated.
The last step for this query would be to work out how to delete the duplicates that don't meet the current necessary conditions. If one row has just name and the other just surname, it really doesn't matter which gets deleted as the email is the important thing to keep.

You could use this DELETE query, which is generic and can be easily adapted to support more fields:
DELETE tablename.*
FROM
tablename LEFT JOIN (
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails) ids
ON tablename.id=ids.min_id
WHERE
ids.min_id IS NULL
Please see fiddle here.
This query returns the maximum number of non null fields, for every email:
SELECT
emails,
MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails
I'm then joining this query with tablename, to get the minimum ID for every email that has the maximum number of non null fields:
SELECT MIN(id) min_id
FROM
tablename t INNER JOIN (
SELECT
emails, MAX((name IS NOT NULL) + (surname IS NOT NULL)) max_non_nulls
FROM
tablename
GROUP BY
emails) m
ON t.emails=m.emails
AND ((t.name IS NOT NULL) + (t.surname IS NOT NULL))=m.max_non_nulls
GROUP BY
t.emails
and then I'm deleting all rows that have an ID that is not returned by this query.

This is easy with mysql's multiple-table delete syntax:
delete b
from mytable a
join mytable b
on a.email = b.email
and a.id != b.id
where a.name is not null
and a.surname is not null

Delete record with duplicate email id
delete
from duplicate_email where id in(
select id from (
select id, email from duplicate_email group by email having count(id) > 1) as id
)
but there is one problem you can delete those record which have only one duplicate email i.e two same email but if there are three or more, you can repeat this query until you get zero record deleted

Related

Delete duplicate rows in MySQL based on contents of another table

I have a MySQL (5.4) table that has some rows with duplicate fields (2-5 copies sometimes) that I'd like to remove, leaving only one. But it's not as simple as just picking the highest or lowest id. The duplicates I'd like to remove are those that don't have corresponding entries in another table.
Table tb_email_to_members has email_id (auto-incrementing) and email_address (and other fields that aren't relevant). For example:
email_id email_address
-------------------------
1 arnold#foo.com
2 foo#foo.com
3 foo#foo.com
4 foo#foo.com
5 jeanluc#foo.com
Table tb_tx has tx_id (auto-incrementing) and frn_email_id (and other fields that aren't relevant), where tb_tx.frn_email_id matches up with tb_email_to_members.email_id. For example:
tx_id frn_email_id
--------------------------
100 5
101 2
102 19
103 19
104 19
105 1
I want to remove rows where email_address is duplicated one or more times in tb_email_to_members, but only when there are NO rows containing frn_email_id in tb_tx for the email_id that comes from tb_email_to_members. I need to make sure to leave one row of the duplicates, even if none of them have corresponding entries in tb_tx. In the examples above, I want to remove rows 3 and 4 from tb_email_to_members, since only row 2 exists in tb_tx.
(In essence, tb_email_to_members maps email addresses to user accounts in another table yet, and tb_tx maps orders to those email addresses from tb_email_to_members.)
I can find the duplicates easily, and I see lots of code for deleting duplicates, but not with the tweak of needing to delete only certain duplicates based on the failure of a lookup from another table. Suggestions?
#MHardwick and #ShadowRay almost got it right. The following also checks to make sure the email exists more tan once in tb_email_to_members
DELETE FROM tb_email_to_members
WHERE email_id NOT IN (SELECT frn_email_id FROM tb_tx)
AND email_address IN (SELECT email_address FROM tb_email_to_members GROUP BY email_address HAVING COUNT(email_address) > 1);
And obviously changing DELETE to SELECT * will show you what exactly you're about to delete.
Bonus points for knowing tb is short for tidbits?
This should answer your question:
DELETE FROM tb_email_to_members WHERE email_id NOT IN (select frn_email_id FROM tb_tx);
This, I think, does exactly what you want. It removes only the duplicate entries from tb_email_to_members where there is no related row in tb_tx, and leaves all of the originals.
Note that you didn't say anything about removing entries from tb_tx, so the duplicates in that table are left alone (in your example content, rows 102-104).
The approach I'm using here essentially does this, in pseudo code:
DELETE FROM table WHERE id_col IN (
SUBQUERY that selects an id column and applies a WHERE filter that makes sure each id is NOT in (
another SUBQUERY which only selects the first item from each grouping, very similar to the first SUBQUERY
)
)
There's another SUBQUERY in there (line 2) wrapping the whole thing up, which prevents MySQL from complaining that you can't select from and modify a table at the same time.
Note: this is likely to be slow if your data set is large. Back up your tables before deleting a lot of data manually!
I realize this is a rather complex query, but it does work.
DELETE FROM tb_email_to_members WHERE email_id IN (
SELECT * FROM (
SELECT ids.eid FROM (
SELECT tb_email_to_members.email_id AS eid, dup.email_id AS eid2, dup.email_address, frn_email_id
FROM tb_email_to_members
LEFT JOIN (
SELECT email_id, email_address FROM tb_email_to_members
GROUP BY email_address
HAVING count(email_id) > 1) AS dup
ON tb_email_to_members.email_address = dup.email_address
INNER JOIN tb_tx tx ON dup.email_id = tx.frn_email_id
) AS ids
WHERE ids.eid NOT IN (
SELECT tb_email_to_members.email_id AS eid FROM tb_email_to_members
LEFT JOIN (
SELECT email_id, email_address FROM tb_email_to_members
GROUP BY email_address
HAVING count(email_id) > 1) AS dup
ON tb_email_to_members.email_address = dup.email_address
INNER JOIN tb_tx tx ON dup.email_id = tx.frn_email_id
GROUP BY dup.email_id
)
) AS foo
)

Differences between two result sets MySQL

Very beginner question but haven't been able to come up with answer after reading various help resources.
I have a table group_affiliations which is a joining table between the tables users and groups. Relevant columns: Id, user_id, group_id. I am doing a data cleanup where users were assigned a group_id based on a location which used to be a 3 character abbreviation of a city but has since gone to spelling out full city (ex: a group_id for CHA was previously assigned and now a group_id for Charlotte). Most users currently have both group_ids associated with their user_id but some still only have the old group_id and were never assigned the new one.
What is the most efficient way of finding which ids are in this result set:
select user_id from group_affiliations where group_id=OldId;
and not in this result set:
select user_id from group_affiliations where group_id=NewId;
SELECT 'user_id'
from 'group_affiliations'
where 'group_id' = OldId
and 'group_id' != NewId
how about using a JOIN
SELECT g1.'user_id'
from 'group_affiliations' g1
inner join 'group_affiliations' g2
on g2.'group_id' != NewId
and g2.'group_id' = OldId
and g1.'user_id'=g2.'user_id'

How to: Find and update all the entries where the value in one column shows up more than once

I have a table with the following columns:
subid - id of the resource
authorid - id of the author
ordering - order of author within citation
For an application where users can submit resources and cite multiple authors. Authors can cite primary and secondary authors in their submissions and usually do.
There is one case where a user (call him user 111) submitted all entries listing himself as the primary and the actual author as secondary. Unfortunately that person has left the project so it has fallen to me to fix this (I have to do it purely in sql).
I am trying to figure out how to build a query to do the following:
Find all entries
where the subid value shows up more than once in the table
where at least one of the authorid values is 111
where the ordering for 111 is greater than the ordering for any users that are not 111
& update them so
the not(111) author has ordering of '0'
and the 111 author has ordering '1'
Try this solution:
UPDATE tbl a
INNER JOIN
(
SELECT subid
FROM tbl
GROUP BY subid
HAVING COUNT(*) > 1 AND SUM(author_id = 111) > 0
) b ON a.subid = b.subid
SET a.ordering = (a.author_id = 111)
Replace tbl with your actual table name.

Create Contacts Database Which Refers to Users Without Duplicates

My question is similar (but at the same time completely different) than this question:
Contacts Database
The question is simple: How can I create a Contacts database table which stores a user id and contact id without duplicating keys.
For example, if I have a table called Contacts, it would have a column user_id, and a column contact_id.
Once I do that, it should be as simple as inserting the user and the added contact. Once that is done though, how do I select all of a user's contacts? Also, how do I narrow down the contact entry enough to delete it if need be?
I ended up just creating a table with two foreign keys and then selecting them based on either of the fields.
For example (pseudo code--no specific language, just english):
Table Contact:
user = ForeignKey(from user table)
contact = ForeignKey(from user table)
Then whenever I need something from them, I'll check if the user field contains what I want and then I'll check if the contact field has what I want. This way I don't have to repeat records and I can still find what I need.
Thanks for your answers.
Similar to the question in the link. You would have 3 tables.
Table 1
User_ID
Name
PK(User_ID)
Table 2
Contact_id
Address
Phone_Number
etc...
PK(Contact_id)
Table 3
User_ID
Contact_id
PK(User_ID, Contact_id)
Here you would have ContactID in table 2 as an autoinc column.
Also, when inserting in Table 3, MySQL would throw an error if there is a duplicate.
To select all of a users contacts, use:
SELECT *
FROM Table_2 join Table_3
ON Table_2.Contact_id = Table_3.contact_id
WHERE Table2.User_id = <userid>
Or if you need it for a particular name, then
SELECT *
FROM Table_1 JOIN Table_2
ON Table_1.User_id = Table_2.User_id
JOIN Table_3
ON Table_2.Contact_id = Table_3.contact_id
WHERE Table1.name = <user name>
there are two questions.
" how do I select all of a user's contacts?"
So you have a table tbl_contacts(user_id, contact_id) both them are your primary key, so you won't get duplicated data.
I you want to list all contacts for user_id = ?
SELECT *
FROM tbl_contacts
WHERE user_id = ?
You might want to clarify your second question "Also, how do I narrow down the contact entry enough to delete it if need be?"
You probably have some other properties belong to the user's contact and you will need to use those properties to search for.(eg.: contact_name or contact_number) and when you have 1 record as a result of a query you can -> DELETE FROM tbl_contact WHERE contact_id = ?
If this is not the answer you wanted please clarify your question.

Delete Duplicate email addresses from Table in MYSQL

I have a table with columns for ID, firstname, lastname, address, email and so on.
Is there any way to delete duplicate email addresses from the TABLE?
Additional information (from comments):
If there are two rows with the same email address one would have a normal firstname and lastname but the other would have 'Instant' in the firstname. Therefore I can distinguish between them. I just want to delete the one with first name 'instant'.
Note, some records where the firstname='Instant' will have just 1 email address. I don't want to delete just one unique email address, so I can't just delete everything where firstname='Instant'.
Please help me out.
DELETE n1 FROM customers n1, customers n2 WHERE n1.ID > n2.ID AND n1.email = n2.email
DELETE FROM table WHERE id NOT IN (SELECT MIN(id) FROM table GROUP BY email)
This keeps the lowest, first inserted id's for every email.
While MiPnamic's answer is essentially correct, it doesn't solve the problem of which record you keep and which you throw away (and how you sort out related records). The short answer is that this cannot be done programmatically.
Given a query like this:
SELECT email, MAX(ID), MAX(firstname), MAX(lastname), MAX(address)
FROM customers
makes it even worse - since you are potentially selecting a mixture of fields from the duplicate rows. You'd need to do something like:
SELECT csr2.*
FROM customers csr2
WHERE ID IN (
SELECT MAX(id)
FROM customers csr
GROUP BY email
);
To get a unique set of existing rows. Of course you still need to sort out all the lreated records (hint - that's the IDs ni customers table not returned by the query above).
I don't know if this will work in MYSQL (I haven't used it)... but you should be able to do something like the following snippets.
I'd suggest you run them in order to get a feel for if the right data is being selected. If it does work, then you probably want to create a constraint on the column.
Get all of the duplicate e-mail addresses:
SELECT
EMAILADDRESS, COUNT(1)
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
Then determine the ID from that gives:
SELECT
ID
FROM
TABLE
WHERE
EMAILADDRESS IN (
SELECT
EMAILADDRESS
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
)
Then finally, delete the rows, based on the above and other constraints:
DELETE
FROM
TABLE
WHERE
ID IN (
SELECT
ID
FROM
TABLE
WHERE
EMAILADDRESS IN (
SELECT
EMAILADDRESS
FROM
TABLE
GROUP BY EMAILADDRESS
HAVING COUNT(1) > 1
)
)
AND FIRSTNAME = 'Instant'
Duplicate the table structure
Put a Unique Key on the email of the new table (just for safe)
Do a INSERT on the new table SELECTING data from the older one GROUPING by the email address
Another way to dedeupe using forsvarir answer above but modifying it a bit. This way you can keep which ever record you choose to partition by:
BEGIN TRAN
DELETE
FROM [TABLE]
WHERE
ID IN (
SELECT a.ID
FROM
(
SELECT ROW_NUMBER() OVER(PARTITION BY Email ORDER BY Email) [RowNum], ID, Email
FROM [TABLE]
WHERE Email IN
(
SELECT
Email
FROM
[TABLE]
GROUP BY Email
HAVING COUNT(1) > 1
)
) a
WHERE a.RowNum > 1
)
--COMMIT TRAN
--ROLLBACK TRAN
You can follow this MySQL query:
DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email
AND p1.id> p2.id;