Delete duplicates in mySQL table - mysql

I am trying to write my first mySQL query. I need to delete rows if they have the same article-number field. I wrote this query:
SELECT
article_number, COUNT(*)
FROM
article_numbers
GROUP BY
article_number
HAVING
COUNT(*) > 1
It shows me all the rows that are duplicate. But how can I delete all but 1 for each duplicate?
Thanks
EDIT:
I tried this query:
delete article_numbers from article_numbers inner join
(select article_number
from article_numbers
group by article_number
having count(1) > 1) as duplicates
on (duplicates.article_number = article_numbers.article_number)
but it gives me this error:
Cannot delete or update a parent row: a foreign key constraint fails (api.products, CONSTRAINT products_article_number_id_foreign FOREIGN KEY (article_number_id) REFERENCES article_numbers (id))
EDIT 2:
I disabled the foreign key temporarily, and now my delete query works. But how can I modify it that one of the duplicate rows is not deleted?

Use a CROSS JOIN.
Query
delete t1
from article_numbers t1,
article_numbers t2
where t1.id > t2.id
and t1.article_number = t2.article_number;
Fiddle demo

I use a rather simple query to remove dupes:
;WITH DEDUPE AS (
SELECT ROW_NUMBER() OVER(
PARTITION BY article_number
ORDER BY (SELECT 1)) AS RN
FROM article_numbers)
DELETE FROM DEDUPE
WHERE RN != 1

Delete c
from (select *,rank() over(order by article_number) as r from article_numbers )c
where c.r!=1

Delete a row if same article_number but higher id exists:
delete from article_numbers t1
where exists (select 1 from article_numbers t2
where t2.article_number = t1.article_number
and t2.id > t1.id)
Core ANSI SQL, so I suppose it works with both MySQL and SQL Server.

I think this would help:
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1

I modified my query and I think it works now:
SET FOREIGN_KEY_CHECKS=0;
delete article_numbers from article_numbers inner join
(select min(id) minid, article_number
from article_numbers
group by article_number
having count(1) > 1) as duplicates
on (duplicates.article_number = article_numbers.article_number and duplicates.minid <> article_numbers.id)
But it seems very complex. I will check #Ullas method to see if it works, too.

Related

How to remove duplicate rows in MySQL?

I tried to remove duplicate rows from a table TT
here is my query
delete t1
from TT t1
, TT t2
where t1.id < t2.id
and t1.url = t2.url
Here id is the primary key and url has the unique key in the table TT. You must be wondering why there are duplicate rows with unique index?
Actually it did happen and I don't know why but right now I want to remove the duplicate rows first. I am able to run the query in phpmyadmin but no duplicate rows are deleted at all(There is duplicate rows in the Table TT).
What could be the reason? Thanks!
You can use ROW_NUMBER() to remove duplicate
;WITH cte AS (
SELECT *
, ROW_NUMBER OVER(PARTITION BY url ORDER BY url) AS rn
FROM TT
)
DELETE FROM cte
WHERE rn > 1

MySQL Select works fine but Delete hangs indefinitely based on the position of GROUP BY

select * from table1 where ID in (
select min(a.ID) from (select * from table1) a group by id_x, id_y, col_z having count(*) > 1)
Above query ran in 2.2 seconds returning four result. Now when I change the select * to delete, it hangs up indefinitely.
delete from table1 where ID in (
select min(a.ID) from (select * from table1) a group by id_x, id_y, col_z having count(*) > 1)
If I move the position of group by clause inside the alias select query, it will no longer hang.
delete from table1 where ID in (
select a.ID from (select min(ID) from table1 group by id_x, id_y, col_z having count(*) > 1) a)
Why does it hang? Even though (select * from table1) pulls millions of records, the query doesn't seem to stop executing for hours. Can anybody explain what huddles the query? It puzzles me because the select query works fine whereas the delete query hangs.
EDIT:
My focus here is why it hangs. I already posted work-around that works fine. But in order to develop prevention system, I need to get to the root cause of this..
Use a JOIN instead of WHERE ID IN (SELECT ...).
DELETE t1
FROM table1 AS t1
JOIN (
SELECT MIN(id) AS minId
FROM table1
GROUP BY id_x, id_y, col_z
HAVING COUNT(*) > 1) AS t2
ON t1.id = t2.minId
I think your query is not being optimized because it has to recalculate the subquery after each deletion, since deleting a row could change the MIN(id) for that group. Using a JOIN requires the grouping and aggregation to be done just once.
Try this:
delete t
from table1 t join
(select min(id) as min_id
from table1
group byid_x, id_y, col_z
having count(*) >= 2
) tt
on tt.min_id = t.id;
That said, you probably don't want to delete just the minimum id. I'm guessing you want to keep the most recent id. If so:
delete t
from table1 t left join
(select max(id) as max_id
from table1
group byid_x, id_y, col_z
having count(*) >= 2
) tt
on tt.max_id = t.id
where tt.max_id is null;

delete mysql duplicate enty and use the most up2date one based on timestamp

I have 2 entries in mysql database which has duplicate entries under 'hwaddr' column as shown in the added screenshot.
how do i search for duplicates based on 'hwaddr' column and leave only the most up2date one based on 'timestamp' column? (delete the old one)
Try this to delete:
DELETE FROM `list` WHERE `id`IN (SELECT *
FROM (SELECT MAX( id )
FROM `list`
GROUP BY ip
HAVING COUNT( ip ) >1) AS e)
assuming greater number primary key (id) will have latest timestamp, so you can use below query which is more optimized and will delete all duplicate records not just one, excluding latest one.
DELETE b.*
FROM mytable b
LEFT JOIN (SELECT MAX(id) FROM mytable GROUP BY hwaddr) a ON a.id=b.id
WHERE a.id IS NULL;
This is single query command even for production server there can be better options.
listing hwaddr duplication
SELECT * FROM `general2` WHERE hwaddr in ( SELECT hwaddr FROM general2 GROUP BY hwaddr HAVING COUNT( * ) >1 )
listing what will be deleted
select t1.* FROM general2 t1, general2 t2 WHERE t1.hwaddr=t2.hwaddr AND t1.timestamp < t2.timestamp
deleting by hwaddr leaving the most updated by timestamp.
DELETE t1 FROM general2 t1, general2 t2 WHERE t1.hwaddr=t2.hwaddr AND t1.timestamp < t2.timestamp

DELETE FROM table WHERE NOT MAX

Okay so I have a table that has xid. Each xid can have several pids. I am trying to delete everything except the row that has the highest pid for each xid.
I am trying:
DELETE FROM table WHERE `pid` NOT IN
( SELECT MAX(`pid`)
FROM table
GROUP BY `xid`
)
If I use the same query but with SELECT instead of DELETE, I get all of the records that I want to delete. When the DELETE is there, I get the error:
#1093 - You can't specify target table 'mod_personnel' for update in FROM clause
Use a JOIN rather than NOT IN:
DELETE t1.* FROM table t1
LEFT JOIN (SELECT xid, MAX(pid) pid
FROM table
GROUP BY xid) t2
ON t1.pid = t2.pid
WHERE t2.pid IS NULL
DELETE FROM table WHERE `pid` NOT IN
(SELECT maxpid FROM
( SELECT MAX(`pid`) as maxpid
FROM table
GROUP BY `xid`
)as m
)

Get around self-referencing in a DELETE query

I'm trying to delete all records which aren't the latest version under their name but apparently you can't reference access a table you are modifying in the same query.
I tried this but it doesn't work for the reasons above:
DELETE FROM table
WHERE CONCAT(name, version ) NOT IN (
SELECT CONCAT( name, MAX( version ) )
FROM table
GROUP name
)
How can I get around this?
Cheers
Wrap the inner reference in a derived table.
DELETE FROM table
WHERE Concat(name, version) NOT IN (SELECT nv
FROM (SELECT Concat(name, Max(version))
AS nv
FROM table
GROUP BY name) AS derived)
delete t1
from table_name1 t1, table_name1 t2
where t1.version < t2.version
and t1.name = t2.name;
//creating alias is the need here