Removing reciprocal/similar records using Inner Join

Removing reciprocal/similar records using Inner Join - mysql

I have a table with names, some of which are shorthand for others and some which are similar but are not. For instance Michael and Mike are reciprocal, yet Uncle Michael is not. I ran a script to get the either one- or two-way matching e.g.
Michael | Mike
Mike | Michael
yet only
Michael | Uncle Michael
which indicates they are not matching pairs.
I'm trying to use that to then remove the shorter matching term (e.g. Mike).
I have a SqlFiddle demonstrating this, I can get as far as finding only the matching pairs but am unsure how to now do a Delete t1 to delete the shorter of the found record from all of the matching pairs.

This might give you some insight from db server's perspective. We can use a group by clause to group names defined in a name-pair. e.g 'Mike' and 'Michael'. Then we count the number of distinct names in the result set . In the case when more than 1 distinct name exists, we delete the shorter one. Otherwise delete nothing as there is only 1 distinct name existing which we probably want to keep.
delete from Names where exists
(
select count(*) from
(select name from Names where (name='Michael' or name='Mike') group by name ) t
having count(*) >1
)
and name='Mike'
;

Related

How do I search for an entry out of two SQL tables and know which table it came from?

I'm trying to find a specific entry. This entry can appear in only ONE of my two tables and will never repeat in either table.
Here is a scaled-down version example of my tables:
Table 1:
Date Name Room
2020/01/23 John 201
2020/01/22 Rebecca 203
Table 2 (does NOT have the same amount of columns):
Date Name
2020/01/23 Robert
2020/01/22 Sarah
To find this entry, I need to specify a date and a name. You can assume names never repeat.
So let's say I want to find Sarah 2020/01/22
She could appear in either Table 1 or Table 2, and I don't know which one and I need to know which table she's in.
I'm not sure how I would do this in a single SQL query. So far I just have two separate ones:
SELECT date,name from Table1 WHERE name="Sarah" and date='2020/01/22'
and
SELECT date,name from Table2 WHERE name="Sarah" and date='2020/01/22'
Is there a way to do it in a single query that also tells me which table it came from? It could be another field or some indication that I can get. Thanks.

Use union all, and add another column to each resulset, with a literal value that indicates the table name:
select 't1' as which, date, name from table1 where name = 'Sarah' and date = '2020-01-22'
union all
select 't2' as which, date, name from table2 where name = 'Sarah' and date = '2020-01-22'

SELECT grouping by value in field

Given the following (greatly simplified) example table:
CREATE TABLE `permissions` (
`name` varchar(64) NOT NULL DEFAULT '',
`access` enum('read_only','read_write') NOT NULL DEFAULT 'read_only'
);
And the following example contents:
| name | access |
=====================
| foo | read_only |
| foo | read_write |
| bar | read_only |
What I want to do is run a SELECT query that fetches one row for each unique value in name, favouring those with an access value of read_write, is there a way that this can be done? i.e- such that the results I would get are:
foo | read_write |
bar | read_only |
I may need to add new options to the access column in future, but they will always be in order of importance (lowest to highest) so, if possible, a solution that can cope with this would be especially useful.
Also, to clarify, my actual table includes other fields than these, which is why I'm not using a unique key on the name column; there will be multiple rows by name by design to suit various criteria.

The following will work on your data:
select name, max(access)
from permissions
group by name;
However, this orders by the string values, not the indexes. Here is another method:
select name,
substring_index(group_concat(access order by access desc), ',') as access
from permissions
group by name;
It is rather funky that order by goes by the index but min() and max() use the character value. Some might even call that a bug.

You can create another table with the priority of the access (so you can add new options), and then group by and find the MIN() value of the priority table:
E.g. create a table called Priority with the values
| PriorityID| access |
========================
| 1 | read_write |
| 2 | read_only |
And then,
SELECT A.Name, B.Access
FROM (
SELECT A.name, MIN(B.PriorityID) AS Most_Valued_Option -- This will be 1 if there is a read_write for that name
FROM permissions A
INNER JOIN Priority B
ON A.Access = B.Access
GROUP BY A.Name ) A
INNER JOIN Priority B
ON A.Most_Valued_Option = B.PriorityID
-- Join that ID with the actual access
-- (and we will select the value of the access in the select statement)

The solution proposed by Gordon is sufficient for the current requirements.
If we anticipate a future requirement for a priority order to be other than alphabetical string order (or by enum index value)...
As a modified version of Gordon's answer, I would be tempted to use the MySQL FIELD function and (its converse) ELT function, something like this:
SELECT p.name
, ELT(
MIN(
FIELD(p.access
,'read_only','read_write','read_some'
)
)
,'read_only','read_write','read_some'
) AS access
FROM `permissions` p
GROUP BY p.name
If the specification is to pull the entire row, and not just the value of the access column, we could use an inline view query to find the preferred access, and a join back to the preferences table to pull the whole row...
SELECT p.*
FROM ( -- inline view, to get the highest priority value of access
SELECT r.name
, MIN(FIELD(r.access,'read_only','read_write','read_some')) AS ax
FROM `permissions` r
GROUP BY r.name
) q
JOIN `permissions` p
ON p.name = q.name
AND p.access = ELT(q.ax,'read_only','read_write','read_some')
Note that this query returns not just the access with the highest priority, but can also return any columns from that row.
With the FIELD and ELT functions, we can implement any ad-hoc ordering of a list of specific, known values. Not just alphabetic ordering, or ordering by the enum index value.
That logic for "priority" can be contained within the query, and won't rely on an extra column(s) in the permissions table, or the contents of any other table(s).
To get the behavior we are looking for, just specifying a priority for access, the "list of the values" used in the FIELD function will need to match the "list of values" in the ELT function, in the same order, and the lists should include all possible values of access.
Reference:
http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_elt
http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_field
ADVANCED USAGE
Not that you have a requirement to do this, but considering possible future requirements... we note that...
A different order of the "list of values" will result in a different ordering of priority of access. So a variety of queries could each implement their own different rules for the "priority". Which access value to look for first, second and so on, by reordering the complete "list of values".
Beyond just reordering, it is also possible to omit a possible value from the "list of values" in the FIELD and ELT functions. Consider for example, omitting the 'read_only' value from the list on this line:
, MIN(FIELD(r.access,'read_write','read_some')) AS ax
and from this line:
AND p.access = ELT(q.ax,'read_write','read_some')
That will effectively limit the name rows returned. Only name that have an access value of 'read_write' or 'read_some'. Another way to look at that, a name that has only a 'read_only' for access will not be returned by the query.
Other modifications to the "list of values", where the lists don't "match" are also possible, to implement even more powerful rules. For example, we could exclude a name that has a row with 'read_only'.
For example, in the ELT function, in place of the 'read_only' value, we use a value that we know does not (and cannot) exist on any rows. To illustrate,
we can include 'read_only' as the "highest priority" on this line...
, MIN(FIELD(r.access,'read_only','read_write','read_some')) AS ax
^^^^^^^^^^^
so if a row with 'read_only' is found, that will take priority. But in the ELT function in the outer query, we can translate that back to a different value...
AND p.access = ELT(q.ax,'eXcluDe','read_write','read_some')
^^^^^^^^^
If we know that 'eXcluDe' doesn't exist in the access column, we have effectively excluded any name which has a 'read_only' row, even if there is a 'read_write' row.
Not that you have a specification or current requirement to do any of that. Something to keep in mind for future queries that do have these kinds of requirements.

You can use distinct statement (or Group by)
SELECT distinct name, access
FROM tab;

This works too:
SELECT name, MAX(access)
FROM permissions
GROUP BY name ORDER BY MAX(access) desc

Finding non-matches on same table in MS Access

I'm a bit of a novice in MS Access but I've started doing some data validation at work and figured it was time to get down to a more simplified way of doing it.
First time posting, I'm having an issue trying to "only" display non-matching values within the same table i.e Errors
I have a table (query) where I have employee details one from one database and one from another. Both have the same information in them however there is a some details in both which are not correct and need to be updated. As an example see below:
Table1
Employee ID Surname EmpID Surname1
123456789 Smith 123456789 Smith
654987321 Daniels 654987321 Volate
987654321 Hanks 987654321 Hanks
741852963 Donald 741852963 Draps
Now what I want to identify is the ones that are not matched by "Surname" and "Surname1"
This should be Employee ID
741852963 Donald 741852963 Draps
654987321 Daniels 654987321 Volate
I'm going to append this to an Errors table with I can list all the errors where values don't match.
What I've tried is the following:
Field: Matches: IIf([Table1].[Surname]<>[Table1].[Surname1],"Yes","No")
This doesn't seem to work as all the results display as Yes and I know for a fact there are inconsistencies.
Does anyone know what or how to do this? Ask any questions if need be.
Thanks
UPDATE
Ok I think it might be better if I gave you all the actual names of the columns. I thought it would be easier to simplify it but maybe not.
Assignment PayC HRIS Assignment No WAPayCycle
12345678 No Payroll 12345678 Pay Cycle 1
20001868 SCP Pay Cycle 1 20001868 SCP Pay Cycle 1
20003272-2 SCP Pay Cycle 1 #Error
20014627 SCP Pay Cycle 1 20014627 SCP Pay Cycle 1
So this gives and idea of what I am doing and the possible errors I need to counter for. The first one has a mismatch so I expect that to Error. The 3rd row has a Null value in one column and a Null in another however one is #Error where the other is just blank. The rest are matched.
LINK TO SCREEN DUMPS
https://drive.google.com/open?id=0B-5TRrOketfyb0tCbElYSWNSM1k

This option handles Errors an Nulls in [HRIS Assignment No]:
SELECT * , IIf([Assignment]<>IIf(IsError([HRIS Assignment No]),"",Nz([HRIS Assignment No]),""),"Yes","No") As Err
FROM [pc look up]
WHERE [Assignment]<>IIf(IsError([HRIS Assignment No]),"",Nz([HRIS Assignment No]),"")

This should work:
SELECT *
FROM Table
WHERE EmployeeID = EmpID
AND Surname <> Surname1
OR Len(Nz(Surname,'')) = 0
OR Len(Nz(Surname1,'')) = 0
Kind regards,
Rene

In your question you state "one from one database and one from another".
Assuming you start with two tables (you've shown us a query joining the four fields together?) then this query would work:
SELECT T1.[Employee ID]
,T1.Surname
,T2.EmpID
T2.Surname1
FROM Table1 T1 INNER JOIN Table2 T2 ON T1.[Employee ID] = T2.EmpID AND
T1.Surname <> T2.Surname1
ORDER BY T1.[Employee ID]
An INNER JOIN will give you the result you're after. A LEFT JOIN will show all the values in Table1 (aliased as T1) and only those matching in Table2 (aliased as T2) - the other values will be NULL, a RIGHT JOIN will show it the other way around.

is there anyway to know which values of a set of option for an WHERE IN clause were the ones that matched?

I am trying to figure out if this is possible (I think its not).
I have a query
Select ID from table, where table.someCode IN (code1,code2,code3...)
As result of this query, I will get all the rows that matches this paramenter.
My question basically is, there is a way to return which was the code that matched, or codes that matched? like : code1,code3 matched?
Thanks
EDIT ----------------------------
For example, I have rows like this
ID 1
name somename
somecode abc
ID 2
name someothername
somecode def
ID 3
name someotherothername
somecode qwer
So, I want to make a select ID from table, where somecode IN (abc,asdf,wefwerw,qwer, etc...)
But I want also to know (without using a loop in programming to go to each result and collect all the codes), which codes from the list of IN matched, in my example, abc,qwer
Any idea?

If you need the select * then you'll have to either have your invoking code figure it out while parsing/looping through results or you'll need another query (eg SELECT distinct(id) FROM table where table.id IN (...), -or- SELECT id, count(*) from table where table.id IN (...) group by id;)
--- EDITED ANSWER:
nah you cant do that. sorry man.
here's examples of some mysql-only things you can do:
(i'll fill this in in a few minutes)

You could select the unique identifier that identifies each of them. Each of them must have a primary key. Your select by printing ID for each of them will print the ones that matched, code1 is printed iff code1 = ID, code2 is printed iff code2 = ID and so on.
UPDATE: Just change to SELECT ID, code.

Finding and dealing with duplicate users

In a large user database with the following format and sample data, we are trying to identify duplicated people:
id first_name last_name email
---------------------------------------------------
1 chris baker
2 chris baker chris#gmail.com
3 chris baker chris#hotmail.com
4 chris baker crayzyguy#crazy.com
5 carl castle castle#npr.org
6 mike rotch fakeuser#sample.com
I am using the following query:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
This works great; I get a list of duplicates with the id numbers of the involved rows.
We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.
The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.
We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.
What I tried
I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.
Question Summed Up
How can I build exceptions into the above query without looping results or multiple queries?
Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.

Try to add the is_not_duplicate boolean field and modify your code as follows:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count",
SUM(is_not_duplicate) AS "real_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
duplicate_count - real_count > 0
Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

My brain is too fried to come up with the actual query for this at the moment, but I might be able to give you a nudge in a path that should work :)
What if you did add another column (maybe a table of valid duplicated users instead?...both will accomplish the same thing), and ran a subquery that would count up all of the valid duplicates and then you could compare against the count in your current query. You would exclude any users that have matching counts, and would pull in any with counts that are higher. Hopefully that makes sense; I will create a use case:
Chris Baker with id 1 and 4 are marked as valid_duplicates
There are 4 Chris Baker's in the system
You get a count of valid Chris Baker's
You get a count of all Chris Baker's
valid_count <> total_count, so return Chris Baker
*You probably can even modify the query so that it does not even list the duplicate id's (even if you get a duplicate marking of only 1 id). Rather than having to re-check which are the valids. This would be a little more complicated. Without it, at least you ignore Chris Baker until another enters the system
I have written up the basic query, dealing with excluding specific id's I will try to roll in tonight. But, this at least solves your initial need. If you do not need the more complicated query, do let me know so that I do not waste my time on it :)
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
duplicate_users.name = users.name
AND valid_duplicate_count = duplicate_count
)
GROUP BY
name
HAVING
duplicate_count > 1
Below is the query that should do the same as above, but the final list will only print the id's that are not in the valid list. This actually ended up being a lot simpler than I thought. And, it is mostly the same as above, but the only reason I kept above is to keep the two options and in case I messed the above up...it does get complicated as it is many nested queries. If CTE's are available to you, or even temp tables. It might make the query more expressive to break it up into temp tables :). Hopefully this helps and is what you are looking for
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "final_duplicate_count"
--This count could actually be 1 due to the nature of the query
FROM
users
--get the list of duplicated user names
WHERE EXISTS
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "total_duplicate_count"
FROM
users AS total_dup_users
--ignore valid_users whose count still matches
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users AS valid_users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
--join inner table to outer table
duplicate_users.name = total_dup_users.name
--valid count check
AND valid_duplicate_count = total_duplicate_count
)
--join inner table to outer table
AND total_dup_users.Name = users.Name
GROUP BY
name
HAVING
duplicate_count > 1
)
--ignore users that are valid when doing the actual counts
AND NOT EXISTS
(
SELECT 1
FROM users AS valid
WHERE
--join inner table to outer table
users.name =
CONCAT(UPPER(valid.first_name), UPPER(valid.last_name))
--only valid users
AND valid.is_valid_duplicate = 1 --true
)
GROUP BY
FinalDuplicates.Name

Since this is basically a many-to-many relationship I would add a new table not_duplicate with fields user1 and user2.
I would probably add two rows for each not_duplicate relationship such that I have one row for 2 -> 3 and a symmetric row for 3 -> 2 to ease querying, but that may introduce data inconsistencies so make sure you delete both rows at the same time (or have only one row and make the correct query in your script).

well it seems to me that the is_not_duplicate column is not complex enough to hold the information you want to store - from what I understand you want to manually tell your detection that two distinct users are not duplicates of each other. so either you create a column like is_not_duplicate_of=other-user-id or if you want to keep the possibility open that one user can be manually defined not duplicate of more than one users, you need a seperate table with two user-id columns.
the query telling you the non overridden duplicates probably has to be a bit more complex than the one you suggested, I cannot think of one that works with a group by and having logic. The only thing that would come to my mind is something like
SELECT u1.* FROM users u1
INNER JOIN users u2
ON u1.id <> u2.id
AND u2.name = u1.name
WHERE NOT EXISTS (
SELECT *
FROM users_non_dups un
WHERE (un.id1 = u1.id AND un.id2 = u2.id)
OR (un.id1 = u2.id AND un.id2 = u1.id)
)

If you were to correct all duplicates each time you run the report, then a very simple solution might be to modify the query:
SELECT
GROUP_CONCAT(id) AS "ids",
MAX(id) AS "max_id",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
max_id > MAX_ID_LAST_TIME_DUPLICATE_REPORT_WAS_GENERATED;

I would go ahead and make the "confirmed_unique" column, defaulted as "False."
In order to avoid the problems you mentioned,
Then I would select all elements that may look like duplicates and have a "False" entry for "confirmed_unique."

I am not sure if this will work, but could you consider the reverse logic of adding a *is_duplicate_of* column? That way you can mark duplicates by entering the ID of the first record at this column which will be greater than zero. The records that you wish to retain will have a 0 value at this field. You can set the default (unchecked records) to -1 to keep track of the validation status for each record.
Afterwards you can keep executing an SQL that will compare new records only with correct records having is_duplicate_of = 0 .

If you are ok to make a slight change to the format of the report. You could do a self-join like this -
SELECT
CONCAT(u1.id,",", u2.id) AS "ids",
CONCAT(UPPER(u1.first_name), UPPER(u1.last_name)) AS "name"
FROM
users u1, users u2
WHERE
u1.id < u2.id AND
UPPER(u1.first_name) = UPPER(u2.first_name) AND
UPPER(u1.last_name) = UPPER(u2.last_name) AND
CONCAT(u1.id,",", u2.id) NOT IN (SELECT ids from not_dupe)
which reports duplicates as follows:
ids | name
----|--------
1,2 | CHRISBAKER
1,3 | CHRISBAKER
...
And the not_dupe table would have rows like below:
ids
------
1,2
3,4
...

I think it would make sense to create a lookup-table storing the ids of the ones that are not duplicates. Thus confirmed non duplicants are removed and the query will only have to ad a small look up for duplicates actualy found on the lookup table.
for instance in this example we would have
id 1 | id 2
2 4
if crayzyguy#crazy.com and chris#gmail.com are diffrent persons.

If I were you, I will add some geolocalisation tables/fields to my database schema.
The probability two end-users are having the same names AND are living in the same place is very very low - except in very big town - but you can split geolocalization to small areas too - it's about granularity.
Good luck.

I would suggest you to create a couple of things:
A Boolean column to flag confirmed users
A String column to save ids
A trigger that will check if the first name and last name are already there to fill up the flag, and save in the string column all ids to which this one is a possible duplicate.
And then build a report that looks for duplicated true and decode the string field to match the possible duplicated

I gave Justin Pihony +1 as the 1st to suggest comparing the duplicate count with the not duplicate count, and Hrant Khachatrian +1 for being the 1st to show an efficient way of doing that.
Here is a slightly different method, plus some renaming to make everything a bit more self explanatory, plus some extra columns in the query to make it obvious which records need to be compared as potential duplicates.
I would call the new column "CONFIRMED_UNIQUE" instead of "IS_NOT_DUPLICATE". Like Hrant I would make it Boolean (tinyint(1) with 0=FALSE and 1=TRUE).
The "potential_duplicate_count" is the maximum number of records that would have to be deleted.
select
group_concat(case when not confirmed_unique then id end) as potential_duplicate_ids,
group_concat(case when confirmed_unique then id end) as confirmed_unique_ids,
concat(upper(first_name), upper(last_name)) as name,
sum( case when not confirmed_unique then 1 end ) - (not max(confirmed_unique)) as potential_duplicate_count
from
users
group by
name
having
potential_duplicate_count > 0

I see someone else has been voted down for the suggestion of merging, but nothing about your problem statement says the data needs to be inplace. The OP followed up with their solution which happens to be a put SQL one, that doesn't imply that every solution needs to be limited to that.
The issue as I understand is around contacts having multiple, similar, but not necessarily identical records in your database, which has cost and reputational implications so you're looking to deduplicate these records.
I would write a batch job that searches for potential duplicates (this can be as complicated or as simple as you like) and then close the two records that it finds are dupes and create a new record.
To enable that you'd need four new columns:
Status, which would be either Open, Merged, Split
RelatedId, which would hold the value of who the record was merged with
ChainId, the new record Id
DateStatusChanged, obvious enough
Open would be the default status
Merged would be when the record is merged (effectively closed and replaced)
Split would be if the merge was reversed
So, as an example, go through all of the records that, for example, have the same name. Merge them in pairs. So if you have three Chris Bakers, records 1, 2 and 3, merge 1 and 2 to make record 4 and then 3 and 4 to make record 5. Your table would end up something like:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED [other rows omitted]
1 Chris Baker MERGED 2 4 27-AUG-2012
2 Chris Baker MERGED 1 4 27-AUG-2012
3 Chris Baker MERGED 4 5 28-AUG-2012
4 Chris Baker MERGED 3 5 28-AUG-2012
5 Chris Baker OPEN
This way you have a full record of what has happened to your data can reverse any changes by unmerging, if for example contacts 1 and 2 weren't the same you reverse the merge of 3 and 4, reverse the merge of 1 and 2, you'd end up with this:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED
1 Chris Baker SPLIT 2 4 29-AUG-2012
2 Chris Baker SPLIT 1 4 29-AUG-2012
3 Chris Baker SPLIT 4 5 29-AUG-2012
4 Chris Baker CLOSED 3 5 29-AUG-2012
5 Chris Baker CLOSED 29-AUG-2012
You could then manually merge, as you'd probably not want your job to automatically remerge split records.

Is there a good reason for not merging duplicate accounts into a single account?
From the comments, it seems like the information is being used mostly for contact information so merging should be relatively painless and low risk. Once you merge users they will no longer appear in your duplicate report. Furthermore, you users table will actually shrink which could help with performance.

Add is_not_duplicate by datatype bit to your table and use below query after set is_not_duplicate data value:
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name"
FROM users
GROUP BY name
HAVING COUNT(*) > SUM(CAST(is_not_duplicate AS INT))
above query compare total duplicate rows by total valid duplicate rows.

Why don't you make the email column to be a unique identifier in this case, and after you cleanse your records once, you do not allow duplicates from there onwards?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008