mySQL one-to-many query

mySQL one-to-many query - mysql

I've got 3 tables that are something like this (simplified here ofc):
users
user_id
user_name
info
info_id
user_id
rate
contacts
contact_id
user_id
contact_data
users has a one-to-one relationship with info, although info doesn't always have a related entry.
users has a one-to-many relationship with contacts, although contacts doesn't always have related entries.
I know I can grab the proper 'users' + 'info' with a left join, is there a way to get all the data I want at once?
For example, one returned record might be:
user_id: 5
user_name: tom
info_id: 1
rate: 25.00
contact_id: 7
contact_data: 555-1212
contact_id: 8
contact_data: 555-1315
contact_id: 9
contact_data: 555-5511
Is this possible with a single query? Or must I use multiple?

It is possible to do what you're asking in one query, but you'd either need a variable number of columns which is evil because SQL isn't designed for that, or you'd have to have a fixed number of columns, which is even more evil because there is no sensible fixed number of columns you could choose.
I'd suggest using one of two alternatives:
1. Return one row for each contact data, repeating the data in other columns:
5 tom 1 25.00 7 555-1212
5 tom 1 25.00 8 555-1315
5 tom 1 25.00 9 555-5511
The problem with this of course is that redundant data is normally a bad idea, but if you don't have too much redundant data it will be OK. Use your judgement here.
2. Use two queries. This means a slightly longer turnaround time, but less data to transfer.
In most cases I'd prefer the second solution.
You should try to avoid making a large number of queries inside a loop. This can almost always be rewritten to a single query. But if using two queries is the most natural way to solve your problem, just use two queries. Don't try to cram all the data you need into a single query just for the sake of reducing the number of queries.

Each row of result must have the same columns, so you can't aggregate multiple rows of contact not having the other columns as well.
Hopefully, this query would achieve what you need:
SELECT
u.user_id as user_id,
u.user_name as user_name,
i.info_id as info_id,
i.rate as rate,
c.contact_id as contact_id,
c.contact_data as contact_data
FROM users as u
LEFT JOIN info as i ON i.user_id = u.user_id
LEFT JOIN contacts as c ON c.user_id = u.user_id

Related

Querying multiple SQL tables

I'm having trouble understanding joins and subqueries and when to use each. I'm sure that one of them is appropriate here.
I have a table ("owners") of (to keep things simple) unit numbers, names and email addresses.
I have another table ("widgets") of unit numbers and the number of widgets assigned to each unit. Each unit has 0, 1 or 2 widgets.
I need to send an email to each unit depending on whether they have 0, 1 or 2 widgets. In other words (and in plain English, not even remotely an attempt at semi-correct SQL):
select numwidgets from widgets where unit=x
then where owners.unit = widgets.unit
select unit, name, email
The data that I need to pass to my script will look like this:
unit name email widgets
1 Bob Smith bob#example.com 2
I can visualise in my mind the data that I need, but it's extracting it from two different tables that is the problem. The "owners" table is a permanent table, and the "widgets" table is a temporary one for tracking a specific issue that is being addressed in the email I'm sending. I don't need help sending the email, just creating the SQL I need to use to extract the data (numwidgets, name, email) for one email.
Thanks.
EDIT:
Input data:
owners table:
unit, name, email
1,Bob Smith, bob#example.com
widgets table:
unit,widgets
1,2

try this, a inner join selects all rows from both tables as long as there is a match between the columns in both tables.
Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you’d need to do each aggregation in a distinct step.
select owners.unit, name, email, widgets.numwidgets
from owners
inner join widgets On owners.unit = widgets.unit
where owners.unit = x

For your case you need an inner join. To understand that you need to see the concept of keys which is pretty simple.
In your tables unitnumber is the common column in both tables. So a join has to be applied based on this column.
Subqueries are used when the output of one query is given as input to another query i.e to perform related operations
Select o.unit,o.name,o.email ,w.numwidgets from owners o inner join widgets w on o.unit=w.unit where w.unit=X
In above query pass X = 0,1,2 as per the result you want
Thanks

I think you want:
select o.*, w.widgets
from owners o
inner join widget w
on o.unit = w.unit
where o.unit = 123;

Join multiple row/column MySQL

Trying to do things a bit different with a database, I got a table called "services", this table consist off pID, uID, serviceID.
Then I got a table called "user_profile", that of course got the same uID as used in the table services.
So a user can have multiple services, let's say
pID uID serviceID
1 1 101
2 1 102
3 1 104
4 2 105
So how do I join this to my user_profile data? I'm a bit confused about that.
Let's say somebody visits the profile with uID 1.
Then I need all the services to in the same SQL call if that's possible somehow?
Hope I make abit of sense.

In order to relate tables in SQL you must have in both tables the same column, in your example uID.
Then you write something like:
select a.uID,b.pID,b.serviceID from user_profile a left join services b on a.uID=b.uID

Efficiency of Query to Select Records based on Related Records in Composite Table

Setup
I am creating an event listing where users can narrow down results by several filters. Rather than having a table for each filter (i.e. event_category, event_price) I have the following database structure (to make it easy/flexible to add more filters later):
event
event_id title description [etc...]
-------------------------------------------
fllter
filter_id name slug
-----------------------------
1 Category category
2 Price price
filter_item
filter_item_id filter_id name slug
------------------------------------------------
1 1 Music music
2 1 Restaurant restaurant
3 2 High high
4 2 Low low
event_filter_item
event_id filter_item_id
--------------------------
1 1
1 4
2 1
2 3
Goal
I want to query the database and apply the filters that users specify. For example, if a user searches for events in 'Music' (category) priced 'Low' (price) then only one event will show (with event_id = 1).
The URL would look something like:
www.site.com/events?category=music&price=low
So I need to query the database with the filter 'slugs' I receive from the URL.
This is the query I have written to make this work:
SELECT ev.* FROM event ev
WHERE
EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'category' AND fi.slug ='music')
AND EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'price' AND fi.slug = 'low')
This query is currently hardcoded but would be dynamically generated in PHP based on what filters and slugs are present in the URL.
And the big question...
Is this a reasonable way to go about this? Does anyone see a problem with having multiple EXISTS() with sub-queries, and those subqueries performing several joins? This query is extremely quick with only a couple records in the database, but what about when there are thousands or tens of thousands?
Any guidance is really appreciated!
Best,
Chris

While EXISTS is just a form of JOIN, MySQL query optimizer is notoriously "stupid" about executing it optimally. In your case, it will probably do a full table scan on the outer table, then execute the correlated subquery for each row, which is bound to scale badly. People often rewrite EXISTS as explicit JOIN for that reason. Or, just use a smarter DBMS.
In addition to that, consider using a composite PK for filter_item, where FK is at the leading edge - InnoDB tables are clustered and you'd want to group items belonging to the same filter physically close together.
BTW, tens of thousands is not a "large" number of rows - to truly test the scalability use tens of millions or more.

MySQL select users on multiple criteria

My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.

You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.

Finding and dealing with duplicate users

In a large user database with the following format and sample data, we are trying to identify duplicated people:
id first_name last_name email
---------------------------------------------------
1 chris baker
2 chris baker chris#gmail.com
3 chris baker chris#hotmail.com
4 chris baker crayzyguy#crazy.com
5 carl castle castle#npr.org
6 mike rotch fakeuser#sample.com
I am using the following query:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
This works great; I get a list of duplicates with the id numbers of the involved rows.
We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.
The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.
We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.
What I tried
I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.
Question Summed Up
How can I build exceptions into the above query without looping results or multiple queries?
Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.

Try to add the is_not_duplicate boolean field and modify your code as follows:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count",
SUM(is_not_duplicate) AS "real_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
duplicate_count - real_count > 0
Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

My brain is too fried to come up with the actual query for this at the moment, but I might be able to give you a nudge in a path that should work :)
What if you did add another column (maybe a table of valid duplicated users instead?...both will accomplish the same thing), and ran a subquery that would count up all of the valid duplicates and then you could compare against the count in your current query. You would exclude any users that have matching counts, and would pull in any with counts that are higher. Hopefully that makes sense; I will create a use case:
Chris Baker with id 1 and 4 are marked as valid_duplicates
There are 4 Chris Baker's in the system
You get a count of valid Chris Baker's
You get a count of all Chris Baker's
valid_count <> total_count, so return Chris Baker
*You probably can even modify the query so that it does not even list the duplicate id's (even if you get a duplicate marking of only 1 id). Rather than having to re-check which are the valids. This would be a little more complicated. Without it, at least you ignore Chris Baker until another enters the system
I have written up the basic query, dealing with excluding specific id's I will try to roll in tonight. But, this at least solves your initial need. If you do not need the more complicated query, do let me know so that I do not waste my time on it :)
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
duplicate_users.name = users.name
AND valid_duplicate_count = duplicate_count
)
GROUP BY
name
HAVING
duplicate_count > 1
Below is the query that should do the same as above, but the final list will only print the id's that are not in the valid list. This actually ended up being a lot simpler than I thought. And, it is mostly the same as above, but the only reason I kept above is to keep the two options and in case I messed the above up...it does get complicated as it is many nested queries. If CTE's are available to you, or even temp tables. It might make the query more expressive to break it up into temp tables :). Hopefully this helps and is what you are looking for
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "final_duplicate_count"
--This count could actually be 1 due to the nature of the query
FROM
users
--get the list of duplicated user names
WHERE EXISTS
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "total_duplicate_count"
FROM
users AS total_dup_users
--ignore valid_users whose count still matches
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users AS valid_users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
--join inner table to outer table
duplicate_users.name = total_dup_users.name
--valid count check
AND valid_duplicate_count = total_duplicate_count
)
--join inner table to outer table
AND total_dup_users.Name = users.Name
GROUP BY
name
HAVING
duplicate_count > 1
)
--ignore users that are valid when doing the actual counts
AND NOT EXISTS
(
SELECT 1
FROM users AS valid
WHERE
--join inner table to outer table
users.name =
CONCAT(UPPER(valid.first_name), UPPER(valid.last_name))
--only valid users
AND valid.is_valid_duplicate = 1 --true
)
GROUP BY
FinalDuplicates.Name

Since this is basically a many-to-many relationship I would add a new table not_duplicate with fields user1 and user2.
I would probably add two rows for each not_duplicate relationship such that I have one row for 2 -> 3 and a symmetric row for 3 -> 2 to ease querying, but that may introduce data inconsistencies so make sure you delete both rows at the same time (or have only one row and make the correct query in your script).

well it seems to me that the is_not_duplicate column is not complex enough to hold the information you want to store - from what I understand you want to manually tell your detection that two distinct users are not duplicates of each other. so either you create a column like is_not_duplicate_of=other-user-id or if you want to keep the possibility open that one user can be manually defined not duplicate of more than one users, you need a seperate table with two user-id columns.
the query telling you the non overridden duplicates probably has to be a bit more complex than the one you suggested, I cannot think of one that works with a group by and having logic. The only thing that would come to my mind is something like
SELECT u1.* FROM users u1
INNER JOIN users u2
ON u1.id <> u2.id
AND u2.name = u1.name
WHERE NOT EXISTS (
SELECT *
FROM users_non_dups un
WHERE (un.id1 = u1.id AND un.id2 = u2.id)
OR (un.id1 = u2.id AND un.id2 = u1.id)
)

If you were to correct all duplicates each time you run the report, then a very simple solution might be to modify the query:
SELECT
GROUP_CONCAT(id) AS "ids",
MAX(id) AS "max_id",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
max_id > MAX_ID_LAST_TIME_DUPLICATE_REPORT_WAS_GENERATED;

I would go ahead and make the "confirmed_unique" column, defaulted as "False."
In order to avoid the problems you mentioned,
Then I would select all elements that may look like duplicates and have a "False" entry for "confirmed_unique."

I am not sure if this will work, but could you consider the reverse logic of adding a *is_duplicate_of* column? That way you can mark duplicates by entering the ID of the first record at this column which will be greater than zero. The records that you wish to retain will have a 0 value at this field. You can set the default (unchecked records) to -1 to keep track of the validation status for each record.
Afterwards you can keep executing an SQL that will compare new records only with correct records having is_duplicate_of = 0 .

If you are ok to make a slight change to the format of the report. You could do a self-join like this -
SELECT
CONCAT(u1.id,",", u2.id) AS "ids",
CONCAT(UPPER(u1.first_name), UPPER(u1.last_name)) AS "name"
FROM
users u1, users u2
WHERE
u1.id < u2.id AND
UPPER(u1.first_name) = UPPER(u2.first_name) AND
UPPER(u1.last_name) = UPPER(u2.last_name) AND
CONCAT(u1.id,",", u2.id) NOT IN (SELECT ids from not_dupe)
which reports duplicates as follows:
ids | name
----|--------
1,2 | CHRISBAKER
1,3 | CHRISBAKER
...
And the not_dupe table would have rows like below:
ids
------
1,2
3,4
...

I think it would make sense to create a lookup-table storing the ids of the ones that are not duplicates. Thus confirmed non duplicants are removed and the query will only have to ad a small look up for duplicates actualy found on the lookup table.
for instance in this example we would have
id 1 | id 2
2 4
if crayzyguy#crazy.com and chris#gmail.com are diffrent persons.

If I were you, I will add some geolocalisation tables/fields to my database schema.
The probability two end-users are having the same names AND are living in the same place is very very low - except in very big town - but you can split geolocalization to small areas too - it's about granularity.
Good luck.

I would suggest you to create a couple of things:
A Boolean column to flag confirmed users
A String column to save ids
A trigger that will check if the first name and last name are already there to fill up the flag, and save in the string column all ids to which this one is a possible duplicate.
And then build a report that looks for duplicated true and decode the string field to match the possible duplicated

I gave Justin Pihony +1 as the 1st to suggest comparing the duplicate count with the not duplicate count, and Hrant Khachatrian +1 for being the 1st to show an efficient way of doing that.
Here is a slightly different method, plus some renaming to make everything a bit more self explanatory, plus some extra columns in the query to make it obvious which records need to be compared as potential duplicates.
I would call the new column "CONFIRMED_UNIQUE" instead of "IS_NOT_DUPLICATE". Like Hrant I would make it Boolean (tinyint(1) with 0=FALSE and 1=TRUE).
The "potential_duplicate_count" is the maximum number of records that would have to be deleted.
select
group_concat(case when not confirmed_unique then id end) as potential_duplicate_ids,
group_concat(case when confirmed_unique then id end) as confirmed_unique_ids,
concat(upper(first_name), upper(last_name)) as name,
sum( case when not confirmed_unique then 1 end ) - (not max(confirmed_unique)) as potential_duplicate_count
from
users
group by
name
having
potential_duplicate_count > 0

I see someone else has been voted down for the suggestion of merging, but nothing about your problem statement says the data needs to be inplace. The OP followed up with their solution which happens to be a put SQL one, that doesn't imply that every solution needs to be limited to that.
The issue as I understand is around contacts having multiple, similar, but not necessarily identical records in your database, which has cost and reputational implications so you're looking to deduplicate these records.
I would write a batch job that searches for potential duplicates (this can be as complicated or as simple as you like) and then close the two records that it finds are dupes and create a new record.
To enable that you'd need four new columns:
Status, which would be either Open, Merged, Split
RelatedId, which would hold the value of who the record was merged with
ChainId, the new record Id
DateStatusChanged, obvious enough
Open would be the default status
Merged would be when the record is merged (effectively closed and replaced)
Split would be if the merge was reversed
So, as an example, go through all of the records that, for example, have the same name. Merge them in pairs. So if you have three Chris Bakers, records 1, 2 and 3, merge 1 and 2 to make record 4 and then 3 and 4 to make record 5. Your table would end up something like:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED [other rows omitted]
1 Chris Baker MERGED 2 4 27-AUG-2012
2 Chris Baker MERGED 1 4 27-AUG-2012
3 Chris Baker MERGED 4 5 28-AUG-2012
4 Chris Baker MERGED 3 5 28-AUG-2012
5 Chris Baker OPEN
This way you have a full record of what has happened to your data can reverse any changes by unmerging, if for example contacts 1 and 2 weren't the same you reverse the merge of 3 and 4, reverse the merge of 1 and 2, you'd end up with this:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED
1 Chris Baker SPLIT 2 4 29-AUG-2012
2 Chris Baker SPLIT 1 4 29-AUG-2012
3 Chris Baker SPLIT 4 5 29-AUG-2012
4 Chris Baker CLOSED 3 5 29-AUG-2012
5 Chris Baker CLOSED 29-AUG-2012
You could then manually merge, as you'd probably not want your job to automatically remerge split records.

Is there a good reason for not merging duplicate accounts into a single account?
From the comments, it seems like the information is being used mostly for contact information so merging should be relatively painless and low risk. Once you merge users they will no longer appear in your duplicate report. Furthermore, you users table will actually shrink which could help with performance.

Add is_not_duplicate by datatype bit to your table and use below query after set is_not_duplicate data value:
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name"
FROM users
GROUP BY name
HAVING COUNT(*) > SUM(CAST(is_not_duplicate AS INT))
above query compare total duplicate rows by total valid duplicate rows.

Why don't you make the email column to be a unique identifier in this case, and after you cleanse your records once, you do not allow duplicates from there onwards?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

mySQL one-to-many query - mysql

Related

Querying multiple SQL tables

Join multiple row/column MySQL

Efficiency of Query to Select Records based on Related Records in Composite Table

MySQL select users on multiple criteria

Finding and dealing with duplicate users

Categories

Resources