sql query to find potential duplicate records

sql query to find potential duplicate records - mysql

I am working on employer's data to find out duplicate employers based on their names.
Data is Like this:
Employer ID | Legal Name | Operating Name
------------- | ---------------| --------------------
1 | AA | AA
2 | BB | AA
3 | CC | BB
4 | DD | DD
5 | ZZ | ZZ
Now if I try to find all duplicates of employer AA the query should return the following result:
Employer ID | Legal Name | Operating Name
------------- | ---------------| --------------------
1 | AA | AA
2 | BB | AA
3 | CC | BB
Employer 1's legal name and Employer 2's Operating Name are the direct match with the search.
But the catch is employer 3 which is not directly related with the search string but employer 2's legal name matches with employer 3's operating name.
And I need the search results up to nth level. I am not sure if that can be achieved by recursive query of something like that.
Please help
I was trying to achieve this by Recursive CTE but then I realized that it is going into infinite recursion. Here is the code:
DECLARE #SearchName VARCHAR(50)
SET #SearchName = 'AA'
;With CTE_EmployerNames
AS
(
-- Anchor Member definition
select *
from [dbo].[Name_Table]
where Leg_Name = #SearchName
OR Op_Name = #SearchName
UNION ALL
-- Recursive Member definition
select N.*
from [dbo].[Name_Table] N
JOIN CTE_EmployerNames C
ON N.ID <> C.ID
AND (N.Leg_Name = C.Leg_Name
OR N.Leg_Name = C.Op_Name
OR N.Op_Name = C.Leg_Name
OR N.Op_Name = C.Op_Name)
)
select *
from CTE_EmployerNames
Update:
I created a stored procedure to achieve what I want. But this procedure is a bit slow because of looping and cursor. As of now this is solving my problem by little compromising with execution time. Any suggestion to optimize it or another way to do this will be highly appreciated. thanks guys. Here is the code:
CREATE PROCEDURE [dbo].[Get_Similar_Name_Employers]
#P_BaseName VARCHAR(100)
AS
BEGIN
DECLARE #ID INT
DECLARE #Leg_Name VARCHAR(50)
DECLARE #Op_Name VARCHAR(50)
-- Create temp table to hold data temporarily
CREATE TABLE #Temp_Employers
(
[ID] [int] NULL,
[Leg_Name] [varchar](50) NULL,
[Op_Name] [varchar](50) NULL,
[Status] [bit] null -- To keep track if that record is processed or not
)
-- Insert all records which are directly matching with search criteria
INSERT INTO #Temp_Employers
SELECT NT.ID, NT.Leg_Name, NT.Op_Name, 0
FROM dbo.Name_Table NT
WHERE NT.Leg_Name = #P_BaseName
OR NT.Op_Name = #P_BaseName
while EXISTS (SELECT 1 from #Temp_Employers where Status = 0) -- until all rows are processed
BEGIN
DECLARE #EmployerCursor CURSOR
SET #EmployerCursor = CURSOR FAST_FORWARD
FOR
SELECT ID, Leg_Name, Op_Name
from #Temp_Employers
where Status = 0
OPEN #EmployerCursor
FETCH NEXT
FROM #EmployerCursor
INTO #ID, #Leg_Name, #Op_Name
WHILE ##FETCH_STATUS = 0
BEGIN
-- For every unprocessed record in temp table check if there is any possible duplicate.
-- and insert all possible duplicate records in same table for further processing to find their possible duplicates
INSERT INTO #Temp_Employers
select ID, Leg_Name, Op_Name, 0
from dbo.Name_Table
WHERE (Leg_Name = #Leg_Name
OR Op_Name = #Op_Name
OR Leg_Name = #Op_Name
OR Op_Name = #Leg_Name)
AND ID NOT IN ( select ID
FROM #Temp_Employers)
-- Update status of recently processed record to avoid processing again
UPDATE #Temp_Employers
SET Status = 1
WHERE ID = #ID
FETCH NEXT
FROM #EmployerCursor
INTO #ID, #Leg_Name, #Op_Name
END
-- close cursor and deallocate memory
CLOSE #EmployerCursor
DEALLOCATE #EmployerCursor
END
select ID,
Leg_Name,
Op_Name
from #Temp_Employers
Order By ID
DROP TABLE #Temp_Employers
END

You are basically trying to build a directed acyclic graph in which the nodes are names and you want to find all the names that lead to your employee.
There is a beginning tutorial at Oracle Tip: Solving directed graph problems with SQL, part 1, and a related StackOverflow question at Directed graph SQL.

You can do this with two self joins. I used DISTINCT to be safe - you don't need it for your example, but probably will for your actual data:
SELECT DISTINCT T2.EMPID, T2.LEGAL_NAME, T.LEGAL_NAME
FROM TABLE T
INNER JOIN TABLE T2 ON T.LEGAL_NAME = T2.OPERATING_NAME
INNER JOIN TABLE T3 ON T2.OPERATING_NAME = T3.OPERATING_NAME
WHERE T.LEGAL_NAME <> T3.LEGAL_NAME
Rename and alias tables and columns as you like.
SQL Fiddle Example
Edit - If you also want records where the op name is simply different from the legal name, UNION those in:
SELECT DISTINCT T2.EMPID, T2.LEGAL_NAME, T.LEGAL_NAME
FROM TABLE T
INNER JOIN TABLE T2 ON T.LEGAL_NAME = T2.OPERATING_NAME
INNER JOIN TABLE T3 ON T2.OPERATING_NAME = T3.OPERATING_NAME
WHERE T.LEGAL_NAME <> T3.LEGAL_NAME
UNION
SELECT EMPID, LEGAL_NAME, OP_NAME
FROM TABLE
WHERE LEGAL_NAME <> OP_NAME
SQL Fiddle Example 2

Related

Update MySQL Table from Subquery/Joined Same Table

I've seen many questions along this issue, but can't get this to work.
I want to UPDATE multiple columns in a table (but will start with one) based upon a calculated value from the same table.
It is a list of transactions per customer, per month.
TransID | Cust | Month | Value | PastValue | FutureValue
1 | 1 | 2018-01-01 | 45 |
2 | 1 | 2018-02-01 | 0 |
3 | 1 | 2018-03-01 | 35 |
4 | 1 | 2018-04-01 | 80 |
.
UPDATE tbl_transaction a
SET PrevMnthValue =
(SELECT COUNT(TransactionID) FROM tbl_transaction b WHERE b.Cust=a.Cust AND b.Month<a.Month)
But we get the dreaded 'Can't update a table using a where with a subquery of the same table).
I've tried to nest the subquery as this has been touted as a workaround:
UPDATE tbl_transactions a
SET
PastValue =
(
SELECT CNT FROM
(
SELECT
COUNT(TransactionID) AS CNT
FROM tbl_transactions b
WHERE
b.CustomerRef=a.CustomerRef AND b.Month<a.Month
) x
),
FutureValue =
(
SELECT CNT FROM
(
SELECT
COUNT(TransactionID) AS CNT
FROM tbl_transactions b
WHERE
b.CustomerRef=a.CustomerRef AND b.Month>a.Month
) x
)
But I get an UNKNOWN a.CustomerRef in WHERE clause. Where am I going wrong?

You can't update and read from one table at the same time.
MySQL documentation tell about it
You cannot update a table and select from the same table in a subquery.
At first you must select necessary data and save them to somewhere, for example to temporary table
CREATE TEMPORARY TABLE IF NOT EXISTS `temp` AS (
SELECT
COUNT(`TransactionID`) AS CNT,
`CustomerRef`,
`Month`
FROM `tbl_transactions`
GROUP BY `Custom,erRef`, `Month`
);
After it, you can use JOIN statement for update table
UPDATE `tbl_transactions` RIGTH
JOIN `temp` ON `temp`.`CustomerRef` = `tbl_transactions`.`CustomerRef`
AND `temp`.`Month` < `tbl_transactions`.`Month`
SET `tbl_transactions`.`PastValue` = `temp`.`cnt`
UPDATED: if you want to update several columns by different condition you can combine temporary table, UPDATE + RIGHT JOIN and CASE statement. For example:
UPDATE `tbl_transactions`
RIGTH JOIN `temp` ON `temp`.`CustomerRef` = `tbl_transactions`.`CustomerRef`
SET `tbl_transactions`.`PastValue` = CASE
WHEN `temp`.`Month` < `tbl_transactions`.`Month` THEN `temp`.`cnt`
ELSE `tbl_transactions`.`PastValue`
END,
`tbl_transactions`.`FutureValue` = CASE
WHEN `temp`.`Month` > `tbl_transactions`.`Month` THEN `temp`.`cnt`
ELSE `tbl_transactions`.`FutureValue`
END

You can try below
UPDATE tbl_transactions a
Join
( SELECT CustomerRef,COUNT(TransactionID) AS CNT FROM tbl_transactions b
group by CustomerRef)x
SET PastValue = CNT
WHERE x.CustomerRef=a.CustomerRef AND x.Month<a.Month

How to write select query along with IF condition in place of column name in mysql query

SELECT DISTINCT o.receipt,
if(SELECT status FROM list WHERE receipt = o.receipt GROUP BY receipt) as status
FROM orderlist o
What is the correct way to write the above query If condition and get result like below example.
Same receipt(orderId) has more than one tuple(row) and all this tuple might have different status value.And I want to set value as per receipt(orderId) first tuple status using IF statement.
IF(status = 'shipped', "Yes", "NO");

If you data model looks like this (ie you have a way of distinguishing the order of events) then a limit in your subquery might do.
drop table if exists t,t1;
create table t(id int);
create table t1(id int, tid int, status varchar(10));
insert into t values (1),(2);
insert into t1 values (1,1,'a'),(2,1,'Shipped'),(3,1,'Returned'), (4,2,'shipped');
select t.id,
if(
(select status from t1 where t1.tid = t.id order by id limit 1)
= 'Shipped','yes','no') shipped
from t;
Result
+------+---------+
| id | shipped |
+------+---------+
| 1 | no |
| 2 | yes |
+------+---------+
2 rows in set (0.00 sec)
But isn't shipment usually the last status?

Try this.
SELECT DISTINCT o.receipt, CASE status when 'Shipped' then 'Yes' ELSE 'No' END as status
FROM orderlist o join receipt r on o.receipt = r.receipt

Return rows with maximum date less than each value in a set of dates in SQL

Consider the following table:
CREATE TABLE foo (
id INT PRIMARY KEY,
effective_date DATETIME NOT NULL UNIQUE
)
Given a set of dates D, how do you fetch all rows from foo whose effective_date is the greatest value less than each date in D in a single query?
For simplicity, assume that each date will have exactly one matching row.
Suppose foo has the following rows.
---------------------
| id |effective_date|
---------------------
| 0 | 2013-01-07|
---------------------
| 1 | 2013-02-03|
---------------------
| 2 | 2013-04-19|
---------------------
| 3 | 2013-04-20|
---------------------
| 4 | 2013-05-11|
---------------------
| 5 | 2013-06-30|
---------------------
| 6 | 2013-12-08|
---------------------
If you were given D = {2013-02-20, 2013-06-30, 2013-12-19}, the query should return the following:
---------------------
| id |effective_date|
---------------------
| 1 | 2013-02-03|
| 4 | 2013-05-11|
| 6 | 2013-12-08|
If D had only one element, say D = {2013-06-30}, you could just do:
SELECT *
FROM foo
WHERE effective_date = SELECT MAX(effective_date) FROM foo WHERE effective_date < 2013-06-30
How do you generalize this query when the size of D is greater than 1, assuming D will be specified in an IN clause?

Actually, your problem is - that you have a list of values, which will be treated in MySQL as row - and not as a set - in most cases. That is - one of possible solutions is to generate your set properly in application so it will look like:
SELECT '2013-02-20'
UNION ALL
SELECT '2013-06-30'
UNION ALL
SELECT '2013-12-19'
-and then use produced set inside JOIN. Also, that will be great, if MySQL could accept static list in ANY subqueries - like for IN keyword, but it can't. ANY also expects rows set, not list (which will be treated as row with N columns, where N is count of items in your list).
Fortunately, in your particular case your issue has important restriction: there could be no more items in list, than rows in your foo table (it makes no sense otherwise). So you can dynamically build that list, and then use it like:
SELECT
foo.*,
final.period
FROM
(SELECT
period,
MAX(foo.effective_date) AS max_date
FROM
(SELECT
period
FROM
(SELECT
ELT(#i:=#i+1, '2013-02-20', '2013-06-30', '2013-12-19') AS period
FROM
foo
CROSS JOIN (SELECT #i:=0) AS init) AS dates
WHERE period IS NOT NULL) AS list
LEFT JOIN foo
ON foo.effective_date<list.period
GROUP BY period) AS final
LEFT JOIN foo
ON final.max_date=foo.effective_date
-your list will be automatically iterated via ELT(), so you can pass it directly to query without any additional restructuring. Note, that this method, however, will iterate through all foo records to produce row set, so it will work - but doing the stuff in application may be more useful in terms of performance.
The demo for your table can be found here.

perhaps this can help :
SELECT *
FROM foo
WHERE effective_date IN
(
(SELECT MAX(effective_date) FROM foo WHERE effective_date < '2013-02-20'),
(SELECT MAX(effective_date) FROM foo WHERE effective_date < '2013-06-30'),
(SELECT MAX(effective_date) FROM foo WHERE effective_date < '2013-12-19')
)
result :
---------------------
| id |effective_date|
---------------------
| 1 | 2013-02-03| -- different
| 4 | 2013-05-11|
| 6 | 2013-12-08|
UPDATE - 06 December
create procedure :
DELIMITER $$
USE `test`$$ /*change database name*/
DROP PROCEDURE IF EXISTS `myList`$$
CREATE PROCEDURE `myList`(ilist VARCHAR(100))
BEGIN
/*var*/
/*DECLARE ilist VARCHAR(100) DEFAULT '2013-02-20,2013-06-30,2013-12-19';*/
DECLARE delimeter VARCHAR(10) DEFAULT ',';
DECLARE pos INT DEFAULT 0;
DECLARE item VARCHAR(100) DEFAULT '';
/*drop temporary table*/
DROP TABLE IF EXISTS tmpList;
/*loop*/
loop_item: LOOP
SET pos = pos + 1;
/*split*/
SET item =
REPLACE(
SUBSTRING(SUBSTRING_INDEX(ilist, delimeter, pos),
LENGTH(SUBSTRING_INDEX(ilist, delimeter, pos -1)) + 1),
delimeter, '');
/*break*/
IF item = '' THEN
LEAVE loop_item;
ELSE
/*create temporary table*/
CREATE TEMPORARY TABLE IF NOT EXISTS tmpList AS (
SELECT item AS sdate
);
END IF;
END LOOP loop_item;
/*view*/
SELECT * FROM tmpList;
END$$
DELIMITER ;
call procedure :
CALL myList('2013-02-20,2013-06-30,2013-12-19');
query :
SELECT
*,
(SELECT MAX(effective_date) FROM foo WHERE effective_date < sdate) AS effective_date
FROM tmpList
result :
------------------------------
| sdate |effective_date|
------------------------------
| 2013-02-20 | 2013-02-03 |
| 2013-06-30 | 2013-05-11 |
| 2013-12-19 | 2013-12-08 |

The bad way first (without ordered analytical functions, or rank/row_number)
sel tmp.min_effective_date, for_id.id
from
(
Sel crossed.effective_date,max(SRC.effective_date) as min_effective_date
from
foo as src
cross join
foo as crossed
where
src.effective_date <cross.effective_date
and crossed.effective_date in
(given dates here)
group by 1
) tmp inner join foo as for_id on
tmp.effective_date =for_id.effective_date
Next, with row_number
SEL TGT.id, TGT.effective_date
(Sel id, effective_date, row_number() over(order by effective_date asc) as ordered
) SRC
INNER JOIN
(Sel id, effective_date, row_number() over(order by effective_date asc) as ordered ) TGT
on
src.ordered+1=TGT.ordered
where src.effective_date in (given dates)
with ordered analytical functions:
sel f.id, tmp.eff
foo as f inner join
(SEL ID, max(effective_date) over(order by effective_date asc ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) as eff
from foo
) TMP
on f.id = tmp.id
where f.effective_date in (given dates)
and tmp.eff is not null
the queries above assume id needs to be selected, and the ids in the source don't follow the same sequence (eg ascending) as the dates. Otherwise, you can straight away use the ordered analytical function.

sql matching users that have activities in common

so, i have a table with users and activities
users
id | activities
1 | "-2-3-4-"
2 | "-3-4-"
3 | "-1-2-3-4-"
activities
id | title
1 | running
2 | walking
3 | climbing
4 | singing
and I am trying for a user with id 3 to find users that have at least two same activities
what I tried to do is this
SELECT u.id FROM users u
WHERE ( SELECT COUNT(a.id) FROM activities a
WHERE a.id IN(TRIM( ',' FROM REPLACE( u.activities, '-', ',' ) ))
AND a.id IN(1,2,3) ) >= 2
any ideas?

For the love of god, make a 3rd table that contains user_id and activity_id.
This isn't a suitable solution in any way.
You should have a table which makes the connection between users and activities, not store all activities in a row in your users table.

You can first create a function that takes the user.activities string and splits the string in activities int ids like this:
create FUNCTION dbo.SplitStringToIds (#acts nvarchar(MAX))
RETURNS #acivityids TABLE (Id int)
AS
BEGIN
DECLARE #stringToInsert nvarchar (max)
set #stringToInsert=''
DECLARE #intToInsert int
set #intToInsert=0
DECLARE #stidx int
set #stidx=0
DECLARE #endidx int
set #endidx=0
WHILE LEN(#acts) > 3
BEGIN
set #stidx=CHARINDEX('-', #acts, 1)
set #acts=substring(#acts,#stidx+1,len(#acts))
set #endidx=CHARINDEX('-', #acts, 1)-1
set #stringToInsert=substring(#acts,1,#endidx)
set #intToInsert=cast(#stringToInsert as int)
INSERT INTO #acivityids
VALUES
(
#intToInsert
)
END
-- Return the result of the function
RETURN
END
GO
and then you can try something like this to get the users that have 2 and more same activities with user with id=3
select u.id,count(u.id) as ActivitiesCounter from users as u
cross apply SplitStringToIds(u.activities) as v
where v.id in (select v.id from users as u
cross apply SplitStringToIds(u.activities) as v
where u.id=3)
group by u.id having count(u.id)>=2
But i think that this way of storing relationships between tables is going to give you only troubles and its better to add a relationship table if you can.

Split a MYSQL string from GROUP_CONCAT into an ( array, like, expression, list) that IN () can understand

This question follows on from MYSQL join results set wiped results during IN () in where clause?
So, short version of the question. How do you turn the string returned by GROUP_CONCAT into a comma-seperated expression list that IN() will treat as a list of multiple items to loop over?
N.B. The MySQL docs appear to refer to the "( comma, seperated, lists )" used by IN () as 'expression lists', and interestingly the pages on IN() seem to be more or less the only pages in the MySQL docs to ever refer to expression lists. So I'm not sure if functions intended for making arrays or temp tables would be any use here.
Long example-based version of the question: From a 2-table DB like this:
SELECT id, name, GROUP_CONCAT(tag_id) FROM person INNER JOIN tag ON person.id = tag.person_id GROUP BY person.id;
+----+------+----------------------+
| id | name | GROUP_CONCAT(tag_id) |
+----+------+----------------------+
| 1 | Bob | 1,2 |
| 2 | Jill | 2,3 |
+----+------+----------------------+
How can I turn this, which since it uses a string is treated as logical equivalent of ( 1 = X ) AND ( 2 = X )...
SELECT name, GROUP_CONCAT(tag.tag_id) FROM person LEFT JOIN tag ON person.id = tag.person_id
GROUP BY person.id HAVING ( ( 1 IN (GROUP_CONCAT(tag.tag_id) ) ) AND ( 2 IN (GROUP_CONCAT(tag.tag_id) ) ) );
Empty set (0.01 sec)
...into something where the GROUP_CONCAT result is treated as a list, so that for Bob, it would be equivalent to:
SELECT name, GROUP_CONCAT(tag.tag_id) FROM person INNER JOIN tag ON person.id = tag.person_id AND person.id = 1
GROUP BY person.id HAVING ( ( 1 IN (1,2) ) AND ( 2 IN (1,2) ) );
+------+--------------------------+
| name | GROUP_CONCAT(tag.tag_id) |
+------+--------------------------+
| Bob | 1,2 |
+------+--------------------------+
1 row in set (0.00 sec)
...and for Jill, it would be equivalent to:
SELECT name, GROUP_CONCAT(tag.tag_id) FROM person INNER JOIN tag ON person.id = tag.person_id AND person.id = 2
GROUP BY person.id HAVING ( ( 1 IN (2,3) ) AND ( 2 IN (2,3) ) );
Empty set (0.00 sec)
...so the overall result would be an exclusive search clause requiring all listed tags that doesn't use HAVING COUNT(DISTINCT ... ) ?
(note: This logic works without the AND, applying to the first character of the string. e.g.
SELECT name, GROUP_CONCAT(tag.tag_id) FROM person LEFT JOIN tag ON person.id = tag.person_id
GROUP BY person.id HAVING ( ( 2 IN (GROUP_CONCAT(tag.tag_id) ) ) );
+------+--------------------------+
| name | GROUP_CONCAT(tag.tag_id) |
+------+--------------------------+
| Jill | 2,3 |
+------+--------------------------+
1 row in set (0.00 sec)

Instead of using IN(), would using FIND_IN_SET() be an option too?
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_find-in-set
mysql> SELECT FIND_IN_SET('b','a,b,c,d');
-> 2
Here's a full example based on the example problem in the question, confirmed as tested by the asker in an earlier edit to the question:
SELECT name FROM person LEFT JOIN tag ON person.id = tag.person_id GROUP BY person.id
HAVING ( FIND_IN_SET(1, GROUP_CONCAT(tag.tag_id)) ) AND ( FIND_IN_SET(2, GROUP_CONCAT(tag.tag_id)) );
+------+
| name |
+------+
| Bob |
+------+

You can pass a string as array, using a split separator, and explode it in a function, that will work with the results.
For a trivial example, if you have a string array like this: 'one|two|tree|four|five', and want to know if two is in the array, you can do this way:
create function str_in_array( split_index varchar(10), arr_str varchar(200), compares varchar(20) )
returns boolean
begin
declare resp boolean default 0;
declare arr_data varchar(20);
-- While the string is not empty
while( length( arr_str ) > 0 ) do
-- if the split index is in the string
if( locate( split_index, arr_str ) ) then
-- get the last data in the string
set arr_data = ( select substring_index(arr_str, split_index, -1) );
-- remove the last data in the string
set arr_str = ( select
replace(arr_str,
concat(split_index,
substring_index(arr_str, split_index, -1)
)
,'')
);
-- if the split index is not in the string
else
-- get the unique data in the string
set arr_data = arr_str;
-- empties the string
set arr_str = '';
end if;
-- in this trivial example, it returns if a string is in the array
if arr_data = compares then
set resp = 1;
end if;
end while;
return resp;
end
|
delimiter ;
I want to create a set of usefull mysql functions to work with this method. Anyone interested please contact me.
For more examples, visit http://blog.idealmind.com.br/mysql/how-to-use-string-as-array-in-mysql-and-work-with/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

sql query to find potential duplicate records - mysql

Related

Update MySQL Table from Subquery/Joined Same Table

How to write select query along with IF condition in place of column name in mysql query

Return rows with maximum date less than each value in a set of dates in SQL

sql matching users that have activities in common

Split a MYSQL string from GROUP_CONCAT into an ( array, like, expression, list) that IN () can understand

Categories

Resources