Showing all duplicates, side by side, in MySQL

Showing all duplicates, side by side, in MySQL - mysql

I have a table like so:
Table eventlog
user | user_group | event_date | event_dur.
---- ---------- --------- ----------
xyz 1 2009-1-1 3.5
xyz 2 2009-1-1 4.5
abc 2 2009-1-2 5
abc 1 2009-1-2 5
Notice that in the above sample data, the only thing reliable is the date and the user. Through an over site that is 90% mine to blame, I have managed to allow users to duplicate their daily entries. In some instances the duplicates were intended to be updates to their duration, in others it was their attempt to change the user_group they were working with that day, and in other cases both.
Fortunately, I have a fairly strong idea (since this is an update to an older system) of which records are correct. (Basically, this all happened as an attempt to seamlessly merge the old DB with the new DB).
Unfortunately, I have to more or less do this by hand, or risk losing data that only exists on one side and not the other....
Long story short, I'm trying to figure out the right MySQL query to return all records that have more than one entry for a user on any given date. I have been struggling with GROUP BY and HAVING, but the best I can get is a list of one of the two duplicates, per duplicate, which would be great if I knew for sure it was the wrong one.
Here is the closest I've come:
SELECT *
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(user) > 1
ORDER BY event_date, user
Any help with this would be extremely useful. If need be, I have the list of users/date for each set of duplicates, so I can go by hand and remove all 400 of them, but I'd much rather see them all at once.
Thanks!

Would this work?
SELECT event_date, user
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(*) > 1
ORDER BY event_date, user
What's throwing me off is the COUNT(user) clause you have.

You can list all the field values of the duplicates with GROUP_CONCAT function, but you still get one row for each set.

I think this would work (untested)
SELECT *
FROM eventlog e1
WHERE 1 <
(
SELECT COUNT(*)
FROM eventlog e2
WHERE e1.event_date = e2.event_date
AND e1.user = e2.user
)
-- AND [maybe an additionnal constraint to find the bad duplicate]
ORDER BY event_date, user;
;

Related

SQL query help( Yes,I have already tried nested select)

this is a screenshot of the database I am talking about
So suppose I have a database full of people with their ID numbers along with a year in which they made an entry plus their favorite show in that year. The years are always in the range 2014-2018, but not every one has an entry for each year. How can I count the total number of people who have consistently had the same show as their favorite show over all the years they have been recorded for.
I tried doing a nested selected but I kept getting error. I have checked other SQL related questions here talk about calculate 'change over the years' but none of those answers are compatible with my database and the solution wasn't transferable.

I think you need something like this:
See my SQLFiddle
select id, favorite_show, count(id) as total from people
group by id, favorite_show
having count(id) > 1

Hmmm . . . this gets the people who have only one show:
select count(*)
from (select person
from t
group by person
having min(show) = max(show)
) p;

You can count the number of different favorite shows someone has, and if that's 1 then they've had the same favorite every time.
SELECT COUNT(*)
FROM (SELECT 1
FROM yourTable
GROUP BY person_id
HAVING COUNT(DISTINCT favorite_show) = 1) AS x

Repair database based on two tables

I came back here for another question related to my previous ones. A while ago I created a simple web products parser app which helped me to save some prices on different websites and do some comparison but after a while I found a relative big problem. I will explain everything below.
I have a lot of Mysql tables with the following format:
products with id, name, link
products-prices with id, id_prod, price, availability and date
As you can see, in the products-prices table there is a cell with id_prod which links to the id in the products table. When I parsed the link for every product I though they are unique but in reality something happened and for every product I have 3-4 links. For example, let's consider www.example.com/smth, instead of putting it parsed like that (without http/s and / at the final) in DB I put the whole link and for some reason now I have 4 different products (basically the same one) with http://www.example.com/smth, https://www.example.com/smth, http://www.example.com/smth/, https://www.example.com/smth/. Now I want to do a query to repair my database, basically to delete 1 to 3 entries and keep only one product from products and also change the id_prod from every entry in products-prices.
I don't want a direct answer, instead if you can route me to a tutorial/concept of what syntax I need to use I will be more than thankful. Have a good day!
Edit, real world example
https://images2.imgbox.com/f5/a5/0bdvqXcu_o.png
https://images2.imgbox.com/22/e8/BTbPLCzE_o.png
In the first picture, you can see that the only difference between those 3 products is the link, and in the link the only difference is that one of them is http the other ones are https and between those 2 https one has a slash at the final. In the second picture I have a lot (yea I know very inefficient) of entries which I want in this example to point to the product with id 2 from the first picture.

Try a simple grouping to ascertain the scale of the problem:
SELECT (COUNTPRODID) C, PRODID
FROM YOURTABLE
GROUP BY PRODID
HAVING COUNT(PRODID) >1
Once you have identified the scale of the issue, you could create a table to stage 1 of your records with a sequence based on the PRODID as below:
SELECT * INTO TmpTable
FROM
(SELECT
#row_number:=CASE
WHEN #PRODID = PRODID THEN #row_number + 1
ELSE 1
END AS SEQ,
#PRODID :=PRODID as PRODID
FROM
YOURTABLE
ORDER BY PRODID;) dups
WHERE dups.SEQ = 1
You could then delete all rows in you source
DELETE FROM YOURTABLE
WHERE PRODID IN (SELECT PRODID FROM TmpTable)
And then finally write the rows back from your temp table:
INSERT INTO YOURTABLE
SELECT field1, field2 etc. FROM TmpTable

SQL performance of a large number of sum()s

Within my J2EE web application, I need to generate a bar chart representing the percentage of users in the system with specific alerts. (EDIT - I forgot to mention, the graph only deals with alerts associated with the first situationof each user, thus the min(date) ).
A simplified (but structurally similar) version of my database schema is as follows :
users { id, name }
situations { id, user_id, date }
alerts { id, situation_id, alertA, alertB }
where users to situations are 1-n, and situations to alerts are 1-1.
I've omitted datatypes but the alerts (alertA and B) are booleans. In my actual case, there are many such alerts (30-ish).
So far, this is what I have come up with :
select sum(alerts.alertA), sum(alerts.alertB)
form alerts, (
select id, min(date)
from situations
group by user_id) as situations
where situations.id = alerts.situation_id;
and then divide these sums by
select count(users.id) from users;
This seems far from ideal.
Your recommendations/advice as to how to improve as query would be most appreciated (or maybe I need to re-think my database schema)...
Thanks,
Anthony
PS. I was also thinking of using a trigger to refresh a chart specific table whenever the alerts table is updated but I guess that's a subject for a different query (if it turns out to be problematic).

At first, think about your schema again. You will have a lot of different alerts and you probably don't want to add a single column for every one of those.
Consider changing your alerts table to something like { id, situation_id, type, value } where type would be (A,B,C,....) and value would be your boolean.
Your task to calculate the percentages would then split up into:
(1) Count the total number of users:
SELECT COUNT(id) AS total FROM users
(2) Find the "first" situation for each user:
SELECT situations.id, situations.user_id
-- selects the minimum date for every user_id
FROM (SELECT user_id, MIN(date) AS min_date
FROM situations
GROUP BY user_id) AS first_situation
-- gets the situations.id for user with minimum date
JOIN situations ON
first_situation.user_id = situations.user_id AND
first_situation.min_date = situations.date
-- limits number of situations per user to 1 (possible min_date duplicates)
GROUP BY user_id
(3) Count users for whom an alert is set in at least one of the situations in the subquery:
SELECT
alerts.type,
COUNT(situations.user_id)
FROM ( ... situations.user_id, situations.id ... ) AS situations
JOIN alerts ON
situations.id = alerts.situation_id
WHERE
alerts.value = 1
GROUP BY
alerts.type
Put those three steps together to get something like:
SELECT
alerts.type,
COUNT(situations.user_id)/users.total
FROM (SELECT situations.id, situations.user_id
FROM (SELECT user_id, MIN(date) AS min_date
FROM situations
GROUP BY user_id) AS first_situation
JOIN situations ON
first_situation.user_id = situations.user_id AND
first_situation.min_date = situations.date
GROUP BY user_id
) AS situations
JOIN alerts ON
situations.id = alerts.situation_id
JOIN (SELECT COUNT(id) AS total FROM users) AS users
WHERE
alerts.value = 1
GROUP BY
alerts.type
All queries written from my head without testing. Even if they don't work exactly like that, you should still get the idea!

Finding and dealing with duplicate users

In a large user database with the following format and sample data, we are trying to identify duplicated people:
id first_name last_name email
---------------------------------------------------
1 chris baker
2 chris baker chris#gmail.com
3 chris baker chris#hotmail.com
4 chris baker crayzyguy#crazy.com
5 carl castle castle#npr.org
6 mike rotch fakeuser#sample.com
I am using the following query:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
This works great; I get a list of duplicates with the id numbers of the involved rows.
We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.
The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.
We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.
What I tried
I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.
Question Summed Up
How can I build exceptions into the above query without looping results or multiple queries?
Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.

Try to add the is_not_duplicate boolean field and modify your code as follows:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count",
SUM(is_not_duplicate) AS "real_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
duplicate_count - real_count > 0
Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

My brain is too fried to come up with the actual query for this at the moment, but I might be able to give you a nudge in a path that should work :)
What if you did add another column (maybe a table of valid duplicated users instead?...both will accomplish the same thing), and ran a subquery that would count up all of the valid duplicates and then you could compare against the count in your current query. You would exclude any users that have matching counts, and would pull in any with counts that are higher. Hopefully that makes sense; I will create a use case:
Chris Baker with id 1 and 4 are marked as valid_duplicates
There are 4 Chris Baker's in the system
You get a count of valid Chris Baker's
You get a count of all Chris Baker's
valid_count <> total_count, so return Chris Baker
*You probably can even modify the query so that it does not even list the duplicate id's (even if you get a duplicate marking of only 1 id). Rather than having to re-check which are the valids. This would be a little more complicated. Without it, at least you ignore Chris Baker until another enters the system
I have written up the basic query, dealing with excluding specific id's I will try to roll in tonight. But, this at least solves your initial need. If you do not need the more complicated query, do let me know so that I do not waste my time on it :)
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
duplicate_users.name = users.name
AND valid_duplicate_count = duplicate_count
)
GROUP BY
name
HAVING
duplicate_count > 1
Below is the query that should do the same as above, but the final list will only print the id's that are not in the valid list. This actually ended up being a lot simpler than I thought. And, it is mostly the same as above, but the only reason I kept above is to keep the two options and in case I messed the above up...it does get complicated as it is many nested queries. If CTE's are available to you, or even temp tables. It might make the query more expressive to break it up into temp tables :). Hopefully this helps and is what you are looking for
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "final_duplicate_count"
--This count could actually be 1 due to the nature of the query
FROM
users
--get the list of duplicated user names
WHERE EXISTS
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "total_duplicate_count"
FROM
users AS total_dup_users
--ignore valid_users whose count still matches
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users AS valid_users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
--join inner table to outer table
duplicate_users.name = total_dup_users.name
--valid count check
AND valid_duplicate_count = total_duplicate_count
)
--join inner table to outer table
AND total_dup_users.Name = users.Name
GROUP BY
name
HAVING
duplicate_count > 1
)
--ignore users that are valid when doing the actual counts
AND NOT EXISTS
(
SELECT 1
FROM users AS valid
WHERE
--join inner table to outer table
users.name =
CONCAT(UPPER(valid.first_name), UPPER(valid.last_name))
--only valid users
AND valid.is_valid_duplicate = 1 --true
)
GROUP BY
FinalDuplicates.Name

Since this is basically a many-to-many relationship I would add a new table not_duplicate with fields user1 and user2.
I would probably add two rows for each not_duplicate relationship such that I have one row for 2 -> 3 and a symmetric row for 3 -> 2 to ease querying, but that may introduce data inconsistencies so make sure you delete both rows at the same time (or have only one row and make the correct query in your script).

well it seems to me that the is_not_duplicate column is not complex enough to hold the information you want to store - from what I understand you want to manually tell your detection that two distinct users are not duplicates of each other. so either you create a column like is_not_duplicate_of=other-user-id or if you want to keep the possibility open that one user can be manually defined not duplicate of more than one users, you need a seperate table with two user-id columns.
the query telling you the non overridden duplicates probably has to be a bit more complex than the one you suggested, I cannot think of one that works with a group by and having logic. The only thing that would come to my mind is something like
SELECT u1.* FROM users u1
INNER JOIN users u2
ON u1.id <> u2.id
AND u2.name = u1.name
WHERE NOT EXISTS (
SELECT *
FROM users_non_dups un
WHERE (un.id1 = u1.id AND un.id2 = u2.id)
OR (un.id1 = u2.id AND un.id2 = u1.id)
)

If you were to correct all duplicates each time you run the report, then a very simple solution might be to modify the query:
SELECT
GROUP_CONCAT(id) AS "ids",
MAX(id) AS "max_id",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
max_id > MAX_ID_LAST_TIME_DUPLICATE_REPORT_WAS_GENERATED;

I would go ahead and make the "confirmed_unique" column, defaulted as "False."
In order to avoid the problems you mentioned,
Then I would select all elements that may look like duplicates and have a "False" entry for "confirmed_unique."

I am not sure if this will work, but could you consider the reverse logic of adding a *is_duplicate_of* column? That way you can mark duplicates by entering the ID of the first record at this column which will be greater than zero. The records that you wish to retain will have a 0 value at this field. You can set the default (unchecked records) to -1 to keep track of the validation status for each record.
Afterwards you can keep executing an SQL that will compare new records only with correct records having is_duplicate_of = 0 .

If you are ok to make a slight change to the format of the report. You could do a self-join like this -
SELECT
CONCAT(u1.id,",", u2.id) AS "ids",
CONCAT(UPPER(u1.first_name), UPPER(u1.last_name)) AS "name"
FROM
users u1, users u2
WHERE
u1.id < u2.id AND
UPPER(u1.first_name) = UPPER(u2.first_name) AND
UPPER(u1.last_name) = UPPER(u2.last_name) AND
CONCAT(u1.id,",", u2.id) NOT IN (SELECT ids from not_dupe)
which reports duplicates as follows:
ids | name
----|--------
1,2 | CHRISBAKER
1,3 | CHRISBAKER
...
And the not_dupe table would have rows like below:
ids
------
1,2
3,4
...

I think it would make sense to create a lookup-table storing the ids of the ones that are not duplicates. Thus confirmed non duplicants are removed and the query will only have to ad a small look up for duplicates actualy found on the lookup table.
for instance in this example we would have
id 1 | id 2
2 4
if crayzyguy#crazy.com and chris#gmail.com are diffrent persons.

If I were you, I will add some geolocalisation tables/fields to my database schema.
The probability two end-users are having the same names AND are living in the same place is very very low - except in very big town - but you can split geolocalization to small areas too - it's about granularity.
Good luck.

I would suggest you to create a couple of things:
A Boolean column to flag confirmed users
A String column to save ids
A trigger that will check if the first name and last name are already there to fill up the flag, and save in the string column all ids to which this one is a possible duplicate.
And then build a report that looks for duplicated true and decode the string field to match the possible duplicated

I gave Justin Pihony +1 as the 1st to suggest comparing the duplicate count with the not duplicate count, and Hrant Khachatrian +1 for being the 1st to show an efficient way of doing that.
Here is a slightly different method, plus some renaming to make everything a bit more self explanatory, plus some extra columns in the query to make it obvious which records need to be compared as potential duplicates.
I would call the new column "CONFIRMED_UNIQUE" instead of "IS_NOT_DUPLICATE". Like Hrant I would make it Boolean (tinyint(1) with 0=FALSE and 1=TRUE).
The "potential_duplicate_count" is the maximum number of records that would have to be deleted.
select
group_concat(case when not confirmed_unique then id end) as potential_duplicate_ids,
group_concat(case when confirmed_unique then id end) as confirmed_unique_ids,
concat(upper(first_name), upper(last_name)) as name,
sum( case when not confirmed_unique then 1 end ) - (not max(confirmed_unique)) as potential_duplicate_count
from
users
group by
name
having
potential_duplicate_count > 0

I see someone else has been voted down for the suggestion of merging, but nothing about your problem statement says the data needs to be inplace. The OP followed up with their solution which happens to be a put SQL one, that doesn't imply that every solution needs to be limited to that.
The issue as I understand is around contacts having multiple, similar, but not necessarily identical records in your database, which has cost and reputational implications so you're looking to deduplicate these records.
I would write a batch job that searches for potential duplicates (this can be as complicated or as simple as you like) and then close the two records that it finds are dupes and create a new record.
To enable that you'd need four new columns:
Status, which would be either Open, Merged, Split
RelatedId, which would hold the value of who the record was merged with
ChainId, the new record Id
DateStatusChanged, obvious enough
Open would be the default status
Merged would be when the record is merged (effectively closed and replaced)
Split would be if the merge was reversed
So, as an example, go through all of the records that, for example, have the same name. Merge them in pairs. So if you have three Chris Bakers, records 1, 2 and 3, merge 1 and 2 to make record 4 and then 3 and 4 to make record 5. Your table would end up something like:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED [other rows omitted]
1 Chris Baker MERGED 2 4 27-AUG-2012
2 Chris Baker MERGED 1 4 27-AUG-2012
3 Chris Baker MERGED 4 5 28-AUG-2012
4 Chris Baker MERGED 3 5 28-AUG-2012
5 Chris Baker OPEN
This way you have a full record of what has happened to your data can reverse any changes by unmerging, if for example contacts 1 and 2 weren't the same you reverse the merge of 3 and 4, reverse the merge of 1 and 2, you'd end up with this:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED
1 Chris Baker SPLIT 2 4 29-AUG-2012
2 Chris Baker SPLIT 1 4 29-AUG-2012
3 Chris Baker SPLIT 4 5 29-AUG-2012
4 Chris Baker CLOSED 3 5 29-AUG-2012
5 Chris Baker CLOSED 29-AUG-2012
You could then manually merge, as you'd probably not want your job to automatically remerge split records.

Is there a good reason for not merging duplicate accounts into a single account?
From the comments, it seems like the information is being used mostly for contact information so merging should be relatively painless and low risk. Once you merge users they will no longer appear in your duplicate report. Furthermore, you users table will actually shrink which could help with performance.

Add is_not_duplicate by datatype bit to your table and use below query after set is_not_duplicate data value:
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name"
FROM users
GROUP BY name
HAVING COUNT(*) > SUM(CAST(is_not_duplicate AS INT))
above query compare total duplicate rows by total valid duplicate rows.

Why don't you make the email column to be a unique identifier in this case, and after you cleanse your records once, you do not allow duplicates from there onwards?

Grouping MySQL results to only allow a record to add to the total if the last one found is over two hours old

I am currently tracking actions performed by employees in a table, which has three rows: id, user_id, and action_time, customer_id. In order to track performance, I can simply pick an employee on a date, and count the actions they've performed, easy peasy.
SELECT COUNT(DISTINCT `customer_id`) AS `action_count`
FROM `actions`
WHERE `user_id` = 1
AND DATE(`action_time`) = DATE(NOW())
However, I now wish to make it so that actions performed more than two hours apart will class as two actions towards the total. I've looked into grouping by HOUR() / 2 but an action performed at 9:59 and 10:01 will count as two, not quite what I want.
Anyone have any ideas?

You must self-JOIN the actions table, try something like this:
SELECT COUNT(DISTINCT id) FROM (
SELECT a1.id, ABS(UNIX_TIMESTAMP(a1.action_time) - UNIX_TIMESTAMP(a2.action_time))>=7200
AS action_time_diff FROM actions a1 JOIN actions a2 ON a1.user_id=a2.user_id) AS t
WHERE action_time_diff = 1
Not sure if this works, perhaps you should provide more exact details about the table structure.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008