How Do I Join Tables to Return a Specific Number of People Per Team Who Fit Certain Criteria? - mysql

This is a pretty specific one:
I have two tables (t1 and t2) -- t1 is my all-person table, where everybody in my database and all their data is housed, and t2 is my much smaller table of people who are actually going to do the work of talking to the people in t1.
As you can see in this sample SQL Fiddle, the people in t1 each have specific criteria assigned to them (age, rating, and team). My end-result will hopefully be: for every one worker in t2, the query will return 2 specific people from t1 if their teams match (the idea behind this is that I'm matching the workers in t2 to the person they're going to talk to in t1 based on their team).
But what makes it trickier is that there are two more sets of criteria that I want the query to satisfy also:
Only return people from t1 if their age is in between 30 and 60 (everybody outside of those age-ranges should be ignored) --
And if there's more than two people that fit the above criteria, return the ones with the best rating first. (For example, if there are four people on a team and we only need two -- so the query should return the two with the best ratings, which would be whichever ratings closer to 100.)
The final thing that is difficult to wrap my head around is that there are multiple workers per team as well -- so if there's two t2 workers on 'Team A', the query should return four distinct t1 people on Team A: two attached to one worker and two attached to another (and like I said above, they should be the four best rated people [though it doesn't matter which two people goes to which worker).
My hopeful output will look something like the following for all teams:
ID (t1) Person (t1) Team (t1) Worker (t2)
539184 Smith, Jane Team A Smith, Bob
539186 Smith, Jim Team A Smith, Bob
537141 Smith, Danny Team A Smith, Bill
537162 Smith, James Team A Smith, Bill
Etc.
In reality, I'm doing something similar to this with tens of thousands of records -- which is why this is the only way I can imagine doing it, but I barely even know where to start. Any help would be greatly appreciated, and I'll add any additional information that would be helpful!

The SQL fiddle did not work. But still going ahead with the query format :)
SET #rank:=1, #curr = 0;
SELECT * FROM(
SELECT #rank := if(#curr = t1.id, #rank+1, 1 ) as rank,#curr := t1.id as curr, <field_list>
FROM t1
INNER JOIN t2 on t1.id = t2.ref_id
WHERE t2.age BETWEEN 30 AND 60 < AND "whatever">
ORDER BY t1.id, t2.ranking desc
) t WHERE rank <= 2 ;
replace with fields you want to select like name, id, gender
< AND "whatever"> replace with all conditions you have like "ranking > 10 " etc

Related

Finding non-matches on same table in MS Access

I'm a bit of a novice in MS Access but I've started doing some data validation at work and figured it was time to get down to a more simplified way of doing it.
First time posting, I'm having an issue trying to "only" display non-matching values within the same table i.e Errors
I have a table (query) where I have employee details one from one database and one from another. Both have the same information in them however there is a some details in both which are not correct and need to be updated. As an example see below:
Table1
Employee ID Surname EmpID Surname1
123456789 Smith 123456789 Smith
654987321 Daniels 654987321 Volate
987654321 Hanks 987654321 Hanks
741852963 Donald 741852963 Draps
Now what I want to identify is the ones that are not matched by "Surname" and "Surname1"
This should be Employee ID
741852963 Donald 741852963 Draps
654987321 Daniels 654987321 Volate
I'm going to append this to an Errors table with I can list all the errors where values don't match.
What I've tried is the following:
Field: Matches: IIf([Table1].[Surname]<>[Table1].[Surname1],"Yes","No")
This doesn't seem to work as all the results display as Yes and I know for a fact there are inconsistencies.
Does anyone know what or how to do this? Ask any questions if need be.
Thanks
UPDATE
Ok I think it might be better if I gave you all the actual names of the columns. I thought it would be easier to simplify it but maybe not.
Assignment PayC HRIS Assignment No WAPayCycle
12345678 No Payroll 12345678 Pay Cycle 1
20001868 SCP Pay Cycle 1 20001868 SCP Pay Cycle 1
20003272-2 SCP Pay Cycle 1 #Error
20014627 SCP Pay Cycle 1 20014627 SCP Pay Cycle 1
So this gives and idea of what I am doing and the possible errors I need to counter for. The first one has a mismatch so I expect that to Error. The 3rd row has a Null value in one column and a Null in another however one is #Error where the other is just blank. The rest are matched.
LINK TO SCREEN DUMPS
https://drive.google.com/open?id=0B-5TRrOketfyb0tCbElYSWNSM1k
This option handles Errors an Nulls in [HRIS Assignment No]:
SELECT * , IIf([Assignment]<>IIf(IsError([HRIS Assignment No]),"",Nz([HRIS Assignment No]​),""),"Yes","No") As Err
FROM [pc look up]
WHERE [Assignment]<>IIf(IsError([HRIS Assignment No]),"",Nz([HRIS Assignment No]​),"")
This should work:
SELECT *
FROM Table
WHERE EmployeeID = EmpID
AND Surname <> Surname1
OR Len(Nz(Surname,'')) = 0
OR Len(Nz(Surname1,'')) = 0
Kind regards,
Rene
In your question you state "one from one database and one from another".
Assuming you start with two tables (you've shown us a query joining the four fields together?) then this query would work:
SELECT T1.[Employee ID]
,T1.Surname
,T2.EmpID
T2.Surname1
FROM Table1 T1 INNER JOIN Table2 T2 ON T1.[Employee ID] = T2.EmpID AND
T1.Surname <> T2.Surname1
ORDER BY T1.[Employee ID]
An INNER JOIN will give you the result you're after. A LEFT JOIN will show all the values in Table1 (aliased as T1) and only those matching in Table2 (aliased as T2) - the other values will be NULL, a RIGHT JOIN will show it the other way around.

Why would a SQL query need to be so complicated like this feature allows?

I am studying for SQL exam, and I came across this fact, regarding subqueries:
2. Main query and subquery can get data from different tables
When is a case when this feature would be useful? I find it difficult to imagine such a case.
Millions of situations call for finding information in different tables, it's the basis of relational data. Here's an example:
Find the emergency contact information for all students who are in a chemistry class:
SELECT Emergency_Name, Emergency_Phone
FROM tbl_StudentInfo
WHERE StudentID IN (SELECT b.StudentID
FROM tbl_ClassEnroll b
WHERE Subject = 'Chemistry')
SELECT * FROM tableA
WHERE id IN (SELECT id FROM tableB)
There is plenty of reasons why you have to get data from different tables, such as select sth from main query, which is based on subquery/subqueries from another tables. The usage is really huge.
choose customers from main query which is based on regions and their values
SELECT * FROM customers
WHERE country IN(SELECT name FROM country WHERE name LIKE '%land%')
choose products from main query which is greater or lower than average incoming salary of customers and so on...
You could do something like,
SELECT SUM(trans) as 'Transactions', branch.city as 'city'
FROM account
INNER JOIN branch
ON branch.bID = account.bID
GROUP BY branch.city
HAVING SUM(account.trans) < 0;
This would for a company to identify which branch makes the most profit and which branch is making a loss, it would help identify if the company had to make changes to their marketing approach in certain regions, in theory allowing for the company to become more dynamic and reactive to changes in the economy at any give time.

Update a record in table1 if table2 has updates (MySQL)

In MySQL, I have 2 tables (currently MyISAM).
Let's call them table1 and table2.
table1 has:
id
passwd
full_name
email
table2 has:
user_id
book (the book's title)
Basically, a user can rent books, and a row is created in table2, with records about the user_id, and the book title the user rented.
In table1, I would like to add a column called 'books_rented'.
In PHP, I can calculate all the rows that x user_id has, and it will return how many books that person rented.
However, I would like to know if it's possible to do that within MySQL itself, it would seem more optimal to me.
PS.: I am giving books as an example as I thought it would make it simpler, but the tables are actually employer/employee relational. If an employee deletes his account, then it won't really be optimal doing it with PHP as it would need to wait until the employer logs in again to update this. I can't really do this with PHP unless I run a cron job which I don't really like.
what is relation between that 2 tables, i assume that table1.id with table2.id.
update table1 set books_rented=(select count(*) from table2 where user_id=id)
Please try this at your dev machine first.
I think adding the books_rented column is necessary. You can just simply query it to get the quantity of books that a specific user has rented.
SELECT T1.id
, T1.passwd
, T1.full_name
, T1.email
, IFNULL(COUNT(T2.book), 0) AS books_rented
FROM table1 T1
LEFT JOIN table2 T2 ON T2.user_id = T1.id
GROUP BY T1.id
I suggest that you work with your design, since there are instances that you will be:
renting the same book (will that add to the count of books rented?)
are you counting only the books that are currently being borrowed? Or all the books that have been rented before are included?
Since users and books are many to many relationship, you need to add a JUNCTION TABLE. So you will end up with: users, books and userbooks.
Triggers are exactly what I was looking for. Great tutorial about it.

Correct use of the HAVING clause, to return the unique row

I have 3 tables in this scenario: Teams, Players and PlayersInTeams.
A Player is just a registered user. No W/L data is associated with a Player.
A Team maintains win/loss records. If a player is playing by himself, then he plays with his "solo" team (a Team with only that Player in it). Every time Bob and Jan win together, their Team entry gets wins++. Every time Jason and Tommy lose together, their Team entry gets losses++.
The PlayersInTeams table only has 2 columns, and it's an intersection table between Players and Teams:
> desc PlayersInTeams ;
+------------+---------+
| Field | Type |
+------------+---------+
| fkPlayerId | int(11) |
| fkTeamId | int(11) |
+------------+---------+
So here is the tough part:
Because a Player can be part of multiple Teams, it is important to fetch the right TeamId from the Teams table at the beginning of a match.
A Player's SOLO team is given by
select fkTeamId from PlayersInTeams where
fkPlayerId=1 HAVING count(fkTeamId)=1;
NO IT'S NOT!! But I don't understand why.
I'm trying to say:
Get the fkTeamId from PlayersInTeams where
the fkPlayerId=1, but also, the count of rows
that have this particular fkTeamId is exactly 1.
The query returns (empty set), and actually if I change the HAVING clause to being incorrect (HAVING count(fkTeamId)<>1;), it returns the row I want.
To fix your query, add a group by. To compute the count per team, you'll need to change the where clause to return all teams that player 1 is on:
select fkTeamId
from PlayersInTeams
where fkTeamId in
(
select fkTeamId
from PlayersInTeams
where fkPlayerId = 1
)
group by
fkTeamId
having count(*) = 1;
Example at SQL Fiddle.
Below a detailed explanation of why your count(*) = 1 condition works in a surprising way. When a query contains an aggregate like count, but there is no group by clause, the database will treat the entire result set as a single group.
In databases other than MySQL, you could not select a column that is not in a group by without an aggregate. In MySQL, all those columns are returned with the first value encountered by the database (essentially a random value from the group.)
For example:
create table YourTable (player int, team int);
insert YourTable values (1,1), (1,2), (2,2);
select player
, team
, count(team)
from YourTable
where player = 2
-->
player team count(team)
1 1 1
The first two columns come from a random row with player = 1. The count(team) value is 2, because there are two rows with player = 1 and a non-null team. The count says nothing about the number of players in the team.
The most natural thing to do is to count the rows to see what is going on:
select fkTeamId, count(*)
from PlayersInTeams
where fkPlayerId=1
group by fkTeamId;
The group by clause is a more natural way to write the query:
select fkTeamId
from PlayersInTeams
where fkPlayerId=1
having count(fkteamid) = 1
However, if there is only one row for a player, then your original version should work -- the filtering would take it to one row, the fkTeamId would be the team on the row and the having would be satisfied. One possibility is that you have duplicate rows in the data.
If duplicates are a problem, you can do this:
select fkTeamId
from PlayersInTeams
where fkPlayerId=1
having count(distinct fkteamid) = 1
EDIT for "solo team":
As pointed out by Andomar, the definition of solo team is not quite what I expected. It is a player being the only player on the team. So, to get the list of teams where a given player is the team:
select fkTeamId
from PlayersInTeams
group by fkTeamId
having sum(fkPlayerId <> 1) = 0
That is, you cannot filter out the other players and expect to get this information. You specifically need them, to be sure they are not on the team.
If you wanted to get all solo teams:
select fkTeamId
from PlayersInTeams
group by fkTeamId
having count(*) = 1
HAVING is usually used with a GROUP BY statement - it's like a WHERE which gets applied to the grouped data.
SELECT fkTeamId
FROM PlayersInTeams
WHERE fkPlayerId = 1
GROUP BY fkTeamId
HAVING COUNT(fkPlayerId) = 1
SqlFiddle here: http://sqlfiddle.com/#!2/36530/21
Try finding teams with one player first, then using that to find the player id if they are in any of those teams:
select DISTINCT PlayersInTeams.fkTeamId
from (
select fkTeamId
from PlayersInTeams
GROUP BY fkTeamId
HAVING count(fkPlayerId)=1
) AS Sub
INNER JOIN PlayersInTeams
ON PlayersInTeams.fkTeamId = Sub.fkTeamId
WHERE PlayersInTeams.fkPlayerId = 1;
Everybody's answers here were very helpful. Besides missing GROUP BY, I found that my problem was mainly that my WHERE clause was "too early".
Say we had
insert into PlayersInTeams values
(1, 1), -- player 1 is in solo team id=1
(1, 2),(2,2) -- player 1&2 are on duo team id=2
;
Now we try:
select fkTeamId from PlayersInTeams
where fkPlayerId=1 -- X kills off data we need
group by fkTeamId
HAVING count(fkTeamId)=1;
In particular the WHERE was filtering the temporary resultset so that any row that didn't have fkPlayerId=1 was being cut out of the result set. Since each fkPlayerId is associated with each fkTeamId only once, of course the COUNT of the number of occurrences by fkTeamId is always 1. So you always get back just a list of the teams player 1 is part of, but you still don't know his SOLO team.
Gordon's answer with my addendum
select fkPlayerId,fkTeamId
from PlayersInTeams
group by fkTeamId
having count(fkTeamId) = 1
and fkPlayerId=1 ;
Was particularly good because it's a cheap query and does what we want. It chooses all the SOLO teams first, (chooses all fkTeamIds that only occur ONCE in the table (via HAVING count(*)=1)), then from there we go and say, "ok, now just give me the one that has this fkPlayerId=1". And voila, the solo team.
DUO teams were harder, but similar. The difficulty was similar to the problem above only with ambiguity from teams of 3 with teams of 2. Details in an SQL Fiddle.

Finding and dealing with duplicate users

In a large user database with the following format and sample data, we are trying to identify duplicated people:
id first_name last_name email
---------------------------------------------------
1 chris baker
2 chris baker chris#gmail.com
3 chris baker chris#hotmail.com
4 chris baker crayzyguy#crazy.com
5 carl castle castle#npr.org
6 mike rotch fakeuser#sample.com
I am using the following query:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
This works great; I get a list of duplicates with the id numbers of the involved rows.
We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.
The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.
We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.
What I tried
I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.
Question Summed Up
How can I build exceptions into the above query without looping results or multiple queries?
Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.
Try to add the is_not_duplicate boolean field and modify your code as follows:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count",
SUM(is_not_duplicate) AS "real_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
duplicate_count - real_count > 0
Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown
My brain is too fried to come up with the actual query for this at the moment, but I might be able to give you a nudge in a path that should work :)
What if you did add another column (maybe a table of valid duplicated users instead?...both will accomplish the same thing), and ran a subquery that would count up all of the valid duplicates and then you could compare against the count in your current query. You would exclude any users that have matching counts, and would pull in any with counts that are higher. Hopefully that makes sense; I will create a use case:
Chris Baker with id 1 and 4 are marked as valid_duplicates
There are 4 Chris Baker's in the system
You get a count of valid Chris Baker's
You get a count of all Chris Baker's
valid_count <> total_count, so return Chris Baker
*You probably can even modify the query so that it does not even list the duplicate id's (even if you get a duplicate marking of only 1 id). Rather than having to re-check which are the valids. This would be a little more complicated. Without it, at least you ignore Chris Baker until another enters the system
I have written up the basic query, dealing with excluding specific id's I will try to roll in tonight. But, this at least solves your initial need. If you do not need the more complicated query, do let me know so that I do not waste my time on it :)
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
duplicate_users.name = users.name
AND valid_duplicate_count = duplicate_count
)
GROUP BY
name
HAVING
duplicate_count > 1
Below is the query that should do the same as above, but the final list will only print the id's that are not in the valid list. This actually ended up being a lot simpler than I thought. And, it is mostly the same as above, but the only reason I kept above is to keep the two options and in case I messed the above up...it does get complicated as it is many nested queries. If CTE's are available to you, or even temp tables. It might make the query more expressive to break it up into temp tables :). Hopefully this helps and is what you are looking for
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "final_duplicate_count"
--This count could actually be 1 due to the nature of the query
FROM
users
--get the list of duplicated user names
WHERE EXISTS
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "total_duplicate_count"
FROM
users AS total_dup_users
--ignore valid_users whose count still matches
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users AS valid_users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
--join inner table to outer table
duplicate_users.name = total_dup_users.name
--valid count check
AND valid_duplicate_count = total_duplicate_count
)
--join inner table to outer table
AND total_dup_users.Name = users.Name
GROUP BY
name
HAVING
duplicate_count > 1
)
--ignore users that are valid when doing the actual counts
AND NOT EXISTS
(
SELECT 1
FROM users AS valid
WHERE
--join inner table to outer table
users.name =
CONCAT(UPPER(valid.first_name), UPPER(valid.last_name))
--only valid users
AND valid.is_valid_duplicate = 1 --true
)
GROUP BY
FinalDuplicates.Name
Since this is basically a many-to-many relationship I would add a new table not_duplicate with fields user1 and user2.
I would probably add two rows for each not_duplicate relationship such that I have one row for 2 -> 3 and a symmetric row for 3 -> 2 to ease querying, but that may introduce data inconsistencies so make sure you delete both rows at the same time (or have only one row and make the correct query in your script).
well it seems to me that the is_not_duplicate column is not complex enough to hold the information you want to store - from what I understand you want to manually tell your detection that two distinct users are not duplicates of each other. so either you create a column like is_not_duplicate_of=other-user-id or if you want to keep the possibility open that one user can be manually defined not duplicate of more than one users, you need a seperate table with two user-id columns.
the query telling you the non overridden duplicates probably has to be a bit more complex than the one you suggested, I cannot think of one that works with a group by and having logic. The only thing that would come to my mind is something like
SELECT u1.* FROM users u1
INNER JOIN users u2
ON u1.id <> u2.id
AND u2.name = u1.name
WHERE NOT EXISTS (
SELECT *
FROM users_non_dups un
WHERE (un.id1 = u1.id AND un.id2 = u2.id)
OR (un.id1 = u2.id AND un.id2 = u1.id)
)
If you were to correct all duplicates each time you run the report, then a very simple solution might be to modify the query:
SELECT
GROUP_CONCAT(id) AS "ids",
MAX(id) AS "max_id",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
max_id > MAX_ID_LAST_TIME_DUPLICATE_REPORT_WAS_GENERATED;
I would go ahead and make the "confirmed_unique" column, defaulted as "False."
In order to avoid the problems you mentioned,
Then I would select all elements that may look like duplicates and have a "False" entry for "confirmed_unique."
I am not sure if this will work, but could you consider the reverse logic of adding a *is_duplicate_of* column? That way you can mark duplicates by entering the ID of the first record at this column which will be greater than zero. The records that you wish to retain will have a 0 value at this field. You can set the default (unchecked records) to -1 to keep track of the validation status for each record.
Afterwards you can keep executing an SQL that will compare new records only with correct records having is_duplicate_of = 0 .
If you are ok to make a slight change to the format of the report. You could do a self-join like this -
SELECT
CONCAT(u1.id,",", u2.id) AS "ids",
CONCAT(UPPER(u1.first_name), UPPER(u1.last_name)) AS "name"
FROM
users u1, users u2
WHERE
u1.id < u2.id AND
UPPER(u1.first_name) = UPPER(u2.first_name) AND
UPPER(u1.last_name) = UPPER(u2.last_name) AND
CONCAT(u1.id,",", u2.id) NOT IN (SELECT ids from not_dupe)
which reports duplicates as follows:
ids | name
----|--------
1,2 | CHRISBAKER
1,3 | CHRISBAKER
...
And the not_dupe table would have rows like below:
ids
------
1,2
3,4
...
I think it would make sense to create a lookup-table storing the ids of the ones that are not duplicates. Thus confirmed non duplicants are removed and the query will only have to ad a small look up for duplicates actualy found on the lookup table.
for instance in this example we would have
id 1 | id 2
2 4
if crayzyguy#crazy.com and chris#gmail.com are diffrent persons.
If I were you, I will add some geolocalisation tables/fields to my database schema.
The probability two end-users are having the same names AND are living in the same place is very very low - except in very big town - but you can split geolocalization to small areas too - it's about granularity.
Good luck.
I would suggest you to create a couple of things:
A Boolean column to flag confirmed users
A String column to save ids
A trigger that will check if the first name and last name are already there to fill up the flag, and save in the string column all ids to which this one is a possible duplicate.
And then build a report that looks for duplicated true and decode the string field to match the possible duplicated
I gave Justin Pihony +1 as the 1st to suggest comparing the duplicate count with the not duplicate count, and Hrant Khachatrian +1 for being the 1st to show an efficient way of doing that.
Here is a slightly different method, plus some renaming to make everything a bit more self explanatory, plus some extra columns in the query to make it obvious which records need to be compared as potential duplicates.
I would call the new column "CONFIRMED_UNIQUE" instead of "IS_NOT_DUPLICATE". Like Hrant I would make it Boolean (tinyint(1) with 0=FALSE and 1=TRUE).
The "potential_duplicate_count" is the maximum number of records that would have to be deleted.
select
group_concat(case when not confirmed_unique then id end) as potential_duplicate_ids,
group_concat(case when confirmed_unique then id end) as confirmed_unique_ids,
concat(upper(first_name), upper(last_name)) as name,
sum( case when not confirmed_unique then 1 end ) - (not max(confirmed_unique)) as potential_duplicate_count
from
users
group by
name
having
potential_duplicate_count > 0
I see someone else has been voted down for the suggestion of merging, but nothing about your problem statement says the data needs to be inplace. The OP followed up with their solution which happens to be a put SQL one, that doesn't imply that every solution needs to be limited to that.
The issue as I understand is around contacts having multiple, similar, but not necessarily identical records in your database, which has cost and reputational implications so you're looking to deduplicate these records.
I would write a batch job that searches for potential duplicates (this can be as complicated or as simple as you like) and then close the two records that it finds are dupes and create a new record.
To enable that you'd need four new columns:
Status, which would be either Open, Merged, Split
RelatedId, which would hold the value of who the record was merged with
ChainId, the new record Id
DateStatusChanged, obvious enough
Open would be the default status
Merged would be when the record is merged (effectively closed and replaced)
Split would be if the merge was reversed
So, as an example, go through all of the records that, for example, have the same name. Merge them in pairs. So if you have three Chris Bakers, records 1, 2 and 3, merge 1 and 2 to make record 4 and then 3 and 4 to make record 5. Your table would end up something like:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED [other rows omitted]
1 Chris Baker MERGED 2 4 27-AUG-2012
2 Chris Baker MERGED 1 4 27-AUG-2012
3 Chris Baker MERGED 4 5 28-AUG-2012
4 Chris Baker MERGED 3 5 28-AUG-2012
5 Chris Baker OPEN
This way you have a full record of what has happened to your data can reverse any changes by unmerging, if for example contacts 1 and 2 weren't the same you reverse the merge of 3 and 4, reverse the merge of 1 and 2, you'd end up with this:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED
1 Chris Baker SPLIT 2 4 29-AUG-2012
2 Chris Baker SPLIT 1 4 29-AUG-2012
3 Chris Baker SPLIT 4 5 29-AUG-2012
4 Chris Baker CLOSED 3 5 29-AUG-2012
5 Chris Baker CLOSED 29-AUG-2012
You could then manually merge, as you'd probably not want your job to automatically remerge split records.
Is there a good reason for not merging duplicate accounts into a single account?
From the comments, it seems like the information is being used mostly for contact information so merging should be relatively painless and low risk. Once you merge users they will no longer appear in your duplicate report. Furthermore, you users table will actually shrink which could help with performance.
Add is_not_duplicate by datatype bit to your table and use below query after set is_not_duplicate data value:
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name"
FROM users
GROUP BY name
HAVING COUNT(*) > SUM(CAST(is_not_duplicate AS INT))
above query compare total duplicate rows by total valid duplicate rows.
Why don't you make the email column to be a unique identifier in this case, and after you cleanse your records once, you do not allow duplicates from there onwards?