SQL: Detect duplicate customers - mysql

im trying to create a sql query, that will detect (possible) duplicate customers in my database:
I have two tables:
Customer with the columns: cid, firstname, lastname, zip. Note that cid is the unique customer id and primary key for this table.
IgnoreForDuplicateCustomer with the columns: cid1, cid2. Both columns are foreign keys, which references to Customer(cid). This table is used to say, that the customer with cid1 is not the same as the customer with the cid2.
So for example, if i have
a Customer entry with cid = 1, firstname="foo", lastname="anonymous" and zip="11231"
and another Customer entry with cid=2, firstname="foo", lastname="anonymous" and zip="11231".
So my sql query should search for customers, that have the same firstname, lastname and zip and the detect that customer with cid = 1 is the same as customer with cid = 2.
However, it should be possible to say, that customer cid = 1 and cid=2 are not the same, by storing a new entry in the IgnoreForDuplicateCustomer table by setting cid1 = 1 and cid2 = 2.
So detecting the duplicate customers work well with this sql query script:
SELECT cid, firstname, lastname, zip, COUNT(*) AS NumOccurrences
FROM Customer
GROUP BY fistname, lastname,zip
HAVING ( COUNT(*) > 1 )
My problem is, that i am not able, to integrate the IgnoreForDuplicateCustomer table, to that
like in my previous example the customer with cid = 1 and cid=2 will not be marked / queried as the same, since there is an entry/rule in the IgnoreForDuplicateCustomer table.
So i tried to extend my previous query by adding a where clause:
SELECT cid, firstname, lastname, COUNT(*) AS NumOccurrences
FROM Customer
WHERE cid NOT IN (
SELECT cid1 FROM IgnoreForDuplicateCustomer WHERE cid2=cid
UNION
SELECT cid2 FROM IgnoreForDuplicateCustomer WHERE cid1=cid
)
GROUP BY firstname, lastname, zip
HAVING ( COUNT(*) > 1 )
Unfortunately this additional WHERE clause has absolutely no impact on my result.
Any suggestions?

Here you are:
Select a.*
From (
select c1.cid 'CID1', c2.cid 'CID2'
from Customer c1
join Customer c2 on c1.firstname=c2.firstname
and c1.lastname=c2.lastname and c1.zip=c2.zip
and c1.cid < c2.cid) a
Left Join (
Select cid1 'CID1', cid2 'CID2'
From ignoreforduplicatecustomer one
Union
Select cid2 'CID1', cid1 'CID2'
From ignoreforduplicatecustomer two) b on a.cid1 = b.cid1 and a.cid2 = b.cid2
where b.cid1 is null
This will get you the IDs of duplicate records from customer table, which are not in table ignoreforduplicatecustomer.
Tested with:
CREATE TABLE IF NOT EXISTS `customer` (
`CID` int(11) NOT NULL AUTO_INCREMENT,
`Firstname` varchar(50) NOT NULL,
`Lastname` varchar(50) NOT NULL,
`ZIP` varchar(10) NOT NULL,
PRIMARY KEY (`CID`))
ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=100 ;
INSERT INTO `customer` (`CID`, `Firstname`, `Lastname`, `ZIP`) VALUES
(1, 'John', 'Smith', '1234'),
(2, 'John', 'Smith', '1234'),
(3, 'John', 'Smith', '1234'),
(4, 'Jane', 'Doe', '1234');
And:
CREATE TABLE IF NOT EXISTS `ignoreforduplicatecustomer` (
`CID1` int(11) NOT NULL,
`CID2` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `ignoreforduplicatecustomer` (`CID1`, `CID2`) VALUES
(1, 2);
Results for my test setup are:
CID1 CID2
1 3
2 3

Edit as per TPete's comment (dind't try it):
SELECT
C1.cid, C1.firstname, C1.lastname
FROM
Customer C1,
Customer C2
WHERE
C1.cid < C2.cid AND
C1.firstname = C2.firstname AND
C1.lastname = C2.lastname AND
C1.zip = C2.zip AND
CAST(C1.cid AS VARCHAR)+' ' +CAST(C2.cid AS VARCHAR) <>
(SELECT CAST(cid1 AS VARCHAR)+' '+CAST(cid2 AS VARCHAR) FROM IgnoreForDuplicateCustomer I WHERE I.cid1 = C1.cid AND I.cid2 = C2.cid);
Initially I thought that IgnoreForDuplicateCustomer was a field in the customer table.

crazy but I think it works :)
first I join the customer tables with itself on the names to get the duplicates
then I exclud the keys on the IgnoreForDuplicateCustomer table (the union is because the first query returns cid1, cid2 and cid2,cid1
the result will be duplicated but I think you can get the info you need
select c1.cid, c2.cid
from Customer c1
join Customer c2 on c1.firstname=c2.firstname
and c1.lastname=c2.lastname and c1.zip=c2.zip
and c1.cid!=c2.cid
except
(
select cid1,cid2 from IgnoreForDuplicateCustomer
UNION
select cid2,cid1 from IgnoreForDuplicateCustomer
)
second shot:
select firstname,lastname,zip from Customer
group by firstname,lastname,zip
having (count(*)>1)
except
select c1.firstname, c1.lastname, c1.zip
from Customer c1 join IgnoreForDuplicateCustomer IG on c1.cid=ig.cid1 join Customer c2 on ig.cid2=c2.cid
third:
select firstname,lastname,zip from (
select firstname,lastname,zip from Customer
group by firstname,lastname,zip
having (count(*)>1)
) X
where firstname not in (
select c1.firstname
from Customer c1 join IgnoreForDuplicateCustomer IG on c1.cid=ig.cid1 join Customer c2 on ig.cid2=c2.cid
)

Related

Write a SQL query to find and display a customer who made 2 consecutive orders in same category?

Question: Write a SQL query to find and display a customer who made 2 consecutive orders in the same category?
I am struggling with the answer. Any help would be appreciated.
Queries:
CREATE TABLE customers (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
name TEXT,
email TEXT);
CREATE TABLE orders (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
customer_id INTEGER,
item TEXT,
price REAL,
ORDER_DATE DATETIME,
category TEXT);
INSERT INTO customers (name, email) VALUES ("Doctor Who", "doctorwho#timelords.com");
INSERT INTO customers (name, email) VALUES ("Harry Potter", "harry#potter.com");
INSERT INTO customers (name, email) VALUES ("Captain Awesome", "captain#awesome.com");
INSERT INTO orders (customer_id, item, price,ORDER_DATE,category)
VALUES (1, "Sonic Screwdriver", 1000.00,'21-04-15 09.00.00','tools');
INSERT INTO orders (customer_id, item, price,ORDER_DATE,category)
VALUES (1, "Light", 1000.00,'21-10-15 09.00.00','tools');
INSERT INTO orders (customer_id, item, price,ORDER_DATE,category)
VALUES (2, "High Quality Broomstick", 40.00,'20-12-20 09.00.00','cleaner');
INSERT INTO orders (customer_id, item, price,ORDER_DATE,category)
VALUES (3, "TARDIS", 1000000.00,'21-01-20 09.00.00','other');
Step: 1
First of all, you add a foreign key in the column containing the customer id of the order table, then after that add the customers and orders tables together.
Step: 2
After adding both tables together run this query and you will get your result.
SELECT DISTINCT orders.category , customers.id,customers.name,customers.email FROM customers JOIN orders ON customers.id= orders.customer_id WHERE orders.category in ( select category from orders group by category having count(*) >= 2 )
You can also solve it by using LEAD. Get lead_category and lead_customers_id and filter with category, customers_id
select * from
(SELECT orders.category , orders.item, customers.id,customers.name,customers.email,
LEAD(category) OVER (ORDER BY customers.id ASC) AS lead_category,
LEAD(customers.id) OVER (ORDER BY customers.id ASC) AS lead_customers_id
FROM orders
JOIN
customers ON
orders.customer_id = customers.id) AS T
where category = lead_category and id = lead_customers_id
SELECT DISTINCT t1.customer_id
FROM orders t1
JOIN orders t2 USING (customer_id, category)
WHERE t1.ORDER_DATE < t2.ORDER_DATE
AND NOT EXISTS ( SELECT NULL
FROM orders t3
WHERE t1.customer_id = t3.customer_id
AND t1.category != t3.category
AND t1.ORDER_DATE < t3.ORDER_DATE
AND t3.ORDER_DATE < t2.ORDER_DATE )
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=a26be6164e027b1a4b0aa9a736764da3
I.e. we simply search for a pair of orders for the same customer and category where an order with the same customer but another category not exists between these orders.
Join customers table if needed.

Search Column after LEFT JOIN

Currently I have two tables.
Customers:
id
name
status
1
adam
1
2
bob
1
3
cain
2
Orders:
customer_id
item
1
apple
1
banana
1
bonbon
2
carrot
3
egg
I'm trying to do an INNER JOIN first then use the resulting table to query against.
So a user can type in a partial name or partial item and get all the names and items.
For example if a user type in "b" it would kick back:
customer_id
name
status
items
1
adam
1
apple/banana/bonbon
2
bob
1
carrot
What I am currently doing is:
SELECT * FROM(
SELECT customers.* , GROUP_CONCAT(orders.item SEPARATOR '|') as items
FROM customers
LEFT JOIN orders
ON customers.id = orders.customer_id
group by customers.id
) as t
WHERE t.status = 1 AND ( t.name LIKE "%b%" OR t.items LIKE "%b%")
Which does work, but it is incredibly slow (+2 seconds).
The strange part though is if I run the queries individually the subquery executes in .0004 seconds and the outer query executes in .006 seconds.
But for some reason combining them increases the wait time a lot.
Is there a more efficient way to do this?
CREATE TABLE IF NOT EXISTS `customers` (
`id` int(6),
`name` varchar(255) ,
`status` int(6),
PRIMARY KEY (`id`,`name`,`status`)
);
INSERT INTO `customers` (`id`, `name` , `status`) VALUES
('1', 'Adam' , 1),
('2', 'bob' , 1),
('3', 'cain' , 2);
CREATE TABLE IF NOT EXISTS `orders` (
`customer_id` int(6),
`item` varchar(255) ,
PRIMARY KEY (`customer_id`,`item`)
);
INSERT INTO `orders` (`customer_id`, `item`) VALUES
('1', 'apple'),
('1', 'banana'),
('1', 'bonbon'),
('2', 'carrot'),
('3', 'egg');
According to the query, you are trying to perform a full-text search on the fields name and item. I would suggest adding full-text indexes to them using ngram tokenisation as you are looking up by part of a word:
ALTER TABLE customers ADD FULLTEXT INDEX ft_idx_name (name) WITH PARSER ngram;
ALTER TABLE orders ADD FULLTEXT INDEX ft_idx_item (item) WITH PARSER ngram;
In this case, your query would look as follows:
SELECT
customers.*, GROUP_CONCAT(orders.item SEPARATOR '|')
FROM
customers
LEFT JOIN orders on customers.id = orders.customer_id
WHERE
orders.customer_id IS NOT NULL
AND customers.status = 1
AND (MATCH(customers.name) AGAINST('bo')
OR MATCH(orders.item) AGAINST('bo'))
GROUP BY
customers.id
If needed, you could modify ngram_token_size MySQL system variable as its value is 2 by default, which means two or more characters should be input to perform the search.
Another approach is to implement it by means of a dedicated search engine, e.g. Elasticsearch, when requirements evolve.
SELECT * FROM(
SELECT customers.* , GROUP_CONCAT(orders.item SEPARATOR '|') as items
FROM customers
LEFT JOIN orders
ON customers.id = orders.customer_id AND customers.name LIKE "%adam" AND orders.item LIKE "%b"
group by customers.AI
It will be faster to filter the records when starting to left join

Delete all duplicate rows in mysql

i have MySQL data which is imported from csv file and have multiple duplicate files on it,
I picked all non duplicates using Distinct feature.
Now i need to delete all duplicates using SQL command.
Note i don't need any duplicates i just need to fetch only noon duplicates
thanks.
for example if number 0123332546666 is repeated 11 time i want to delete 12 of them.
Mysql table format
ID, PhoneNumber
Just COUNT the number of duplicates (with GROUP BY) and filter by HAVING. Then supply the query result to DELETE statement:
DELETE FROM Table1 WHERE PhoneNumber IN (SELECT a.PhoneNumber FROM (
SELECT COUNT(*) AS cnt, PhoneNumber FROM Table1 GROUP BY PhoneNumber HAVING cnt>1
) AS a);
http://sqlfiddle.com/#!9/a012d21/1
complete fiddle:
schema:
CREATE TABLE Table1
(`ID` int, `PhoneNumber` int)
;
INSERT INTO Table1
(`ID`, `PhoneNumber`)
VALUES
(1, 888),
(2, 888),
(3, 888),
(4, 889),
(5, 889),
(6, 111),
(7, 222),
(8, 333),
(9, 444)
;
delete query:
DELETE FROM Table1 WHERE PhoneNumber IN (SELECT a.PhoneNumber FROM (
SELECT COUNT(*) AS cnt, PhoneNumber FROM Table1 GROUP BY PhoneNumber HAVING cnt>1
) AS a);
you could try using a left join with the subquery for min id related to each phonenumber ad delete where not match
delete m
from m_table m
left join (
select min(id), PhoneNumber
from m_table
group by PhoneNumber
) t on t.id = m.id
where t.PhoneNumber is null
otherwise if you want delete all the duplicates without mantain at least a single row you could use
delete m
from m_table m
INNER join (
select PhoneNumber
from m_table
group by PhoneNumber
having count(*) > 1
) t on t.PhoneNumber= m.PhoneNumber
Instead of deleting from the table, I would suggest creating a new one:
create table table2 as
select min(id) as id, phonenumber
from table1
group by phonenumber
having count(*) = 1;
Why? Deleting rows has a lot of overhead. If you are bringing the data in from an external source, then treat the first landing table as a staging table and the second as the final table.

Get all Items attached to sellerId - SQL

When execute my query i just get 1 item back that i attached to the sellerId instead of 2. Does anyone know how i can say?
select the name of item and re seller for each item that belongs to the re seller. With a rating higher than 4?
Current Query:
SELECT items.name, sellers.name
FROM items
inner JOIN sellers
on items.id=sellers.id
WHERE rating > 4
ORDER BY sellerId
The query for tables inc. data:
CREATE TABLE sellers (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL,
rating INTEGER NOT NULL
);
CREATE TABLE items (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL,
sellerId INTEGER REFERENCES sellers(id)
);
INSERT INTO sellers(id, name, rating) values(1, 'Roger', 3);
INSERT INTO sellers(id, name, rating) values(2, 'Penny', 5);
INSERT INTO items(id, name, sellerId) values(1, 'Notebook', 2);
INSERT INTO items(id, name, sellerId) values(2, 'Stapler', 1);
INSERT INTO items(id, name, sellerId) values(3, 'Pencil', 2);
You've got the wrong join, here's a corrected query;
SELECT items.name, sellers.name
FROM items
inner JOIN sellers
on items.sellerId=sellers.id
WHERE rating > 4
ORDER BY sellerId
You're joining on id = id, you want sellerid = id
Notice in your table definition that item.sellerId is the field that joins to seller.id
CREATE TABLE items (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL,
sellerId INTEGER REFERENCES sellers(id)
);
You need to join on the correct column:
SELECT i.name, s.name
FROM items i INNER JOIN
sellers s
ON i.sellerid = s.id
----------^
WHERE rating > 4
ORDER BY i.sellerId
Note that I also introduced table aliases and qualified column names. These make a query easier to write and to read.
SELECT items.name, sellers.name
FROM items, sellers
WHERE items.sellerId = sellers.id and sellers.rating>4;
Here is the right query:
SELECT items.name as items, sellers.name as sellers
FROM sellers
INNER JOIN items
ON (sellers.id = items.sellerid)
WHERE sellers.rating > 4

SQL Server : INNER JOIN returning incorrect row

I have two tables Person and Table1. I want to join the person table to table1 where the foodId is 2.
However, when I do the inner join it is only joining the record where foodId is 1.
Person:
id fName lName
1 John Smith
Table1:
id personId foodId date
1 1 1 2014-10-28
2 1 2 2014-10-28
The query I tried is:
SELECT *
FROM Person p
INNER JOIN Table1 t ON p.id = t.personId AND foodId = 2
I also tried:
SELECT *
FROM Person p
INNER JOIN Table1 t ON p.id = t.personId
WHERE t.foodId = 2
Both of those queries show empty results.
Any suggestions would be appreciated!
(SQLFiddle seems to be dead ... can't connect - therefore, posted here in all its glory...)
Setup as shown in original question:
DECLARE #Person TABLE (id INT, fName VARCHAR(20), lName VARCHAR(50))
INSERT INTO #Person
(id, fName, lName)
VALUES
(1, -- id - int
'John', -- fName - varchar(20)
'Smith' -- lName - varchar(50)
)
DECLARE #Table1 TABLE (id INT, PersonId INT, FoodID INT, T1Date DATE)
INSERT INTO #Table1
(id, PersonId, FoodID, T1Date)
VALUES
(1, -- id - int
1, -- PersonId - int
1, -- FoodID - int
'20141028' -- T1Date - date
), (2, 1, 2, '20141027')
Query #2 shown in original question:
SELECT *
FROM #Person p
INNER JOIN #Table1 t ON p.id = t.personId
WHERE t.foodId = 2
Output from that query:
There must be something else going on here - or you've oversimplified the setup to make it not work anymore. But your query #2 DOES return one row - the one you would expect.
Maybe you need left join rather than inner join.
SELECT * FROM Table1 t LEFT JOIN Person p ON p.id = t.personId WHERE t.foodId = 2
When you do Inner Join, only records which are present in both tables will be shown.
I love this picture