How to remove duplicate entries from a mysql db? - mysql

I have a table with some ids + titles. I want to make the title column unique, but it has over 600k records already, some of which are duplicates (sometimes several dozen times over).
How do I remove all duplicates, except one, so I can add a UNIQUE key to the title column after?

This command adds a unique key, and drops all rows that generate errors (due to the unique key). This removes duplicates.
ALTER IGNORE TABLE table ADD UNIQUE KEY idx1(title);
Edit: Note that this command may not work for InnoDB tables for some versions of MySQL. See this post for a workaround. (Thanks to "an anonymous user" for this information.)

Create a new table with just the distinct rows of the original table. There may be other ways but I find this the cleanest.
CREATE TABLE tmp_table AS SELECT DISTINCT [....] FROM main_table
More specifically:
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
DROP TABLE tableName;
INSERT tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;

Since the MySql ALTER IGNORE TABLE has been deprecated, you need to actually delete the duplicate date before adding an index.
First write a query that finds all the duplicates. Here I'm assuming that email is the field that contains duplicates.
SELECT
s1.email
s1.id,
s1.created
s2.id,
s2.created
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
/* Emails are the same */
s1.email = s2.email AND
/* DON'T select both accounts,
only select the one created later.
The serial id could also be used here */
s2.created > s1.created
;
Next select only the unique duplicate ids:
SELECT
DISTINCT s2.id
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
;
Once you are sure that only contains the duplicate ids you want to delete, run the delete. You have to add (SELECT * FROM tblname) so that MySql doesn't complain.
DELETE FROM
student
WHERE
id
IN (
SELECT
DISTINCT s2.id
FROM
(SELECT * FROM student) AS s1
INNER JOIN
(SELECT * FROM student) AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
);
Then create the unique index:
ALTER TABLE
student
ADD UNIQUE INDEX
idx_student_unique_email(email)
;

Below query can be used to delete all the duplicate except the one row with lowest "id" field value
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id > t2.id AND t1.name = t2.name
In the similar way, we can keep the row with the highest value in 'id' as follows
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id < t2.id AND t1.name = t2.name

This shows how to do it in SQL2000. I'm not completely familiar with MySQL syntax but I'm sure there's something comparable
create table #titles (iid int identity (1, 1), title varchar(200))
-- Repeat this step many times to create duplicates
insert into #titles(title) values ('bob')
insert into #titles(title) values ('bob1')
insert into #titles(title) values ('bob2')
insert into #titles(title) values ('bob3')
insert into #titles(title) values ('bob4')
DELETE T FROM
#titles T left join
(
select title, min(iid) as minid from #titles group by title
) D on T.title = D.title and T.iid = D.minid
WHERE D.minid is null
Select * FROM #titles

delete from student where id in (
SELECT distinct(s1.`student_id`) from student as s1 inner join student as s2
where s1.`sex` = s2.`sex` and
s1.`student_id` > s2.`student_id` and
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)

The solution posted by Nitin seems to be the most elegant / logical one.
However it has one issue:
ERROR 1093 (HY000): You can't specify target table 'student' for
update in FROM clause
This can however be resolved by using (SELECT * FROM student) instead of student:
DELETE FROM student WHERE id IN (
SELECT distinct(s1.`student_id`) FROM (SELECT * FROM student) AS s1 INNER JOIN (SELECT * FROM student) AS s2
WHERE s1.`sex` = s2.`sex` AND
s1.`student_id` > s2.`student_id` AND
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)
Give your +1's to Nitin for coming up with the original solution.

Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way (with examples for different use cases).
In this case, something like this should work:
-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;
-- add a unique constraint
ALTER TABLE tmp_table1 ADD UNIQUE(id, title);
-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;
-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;

Related

MySQL: delete query taking too long

I am trying to delete records from table with duplicate column values but it's taking forever. Basically it gets stuck and no response for hours. I have a significantly large table with over 1.3M records. Is the query inefficient? any wat to optimize it?
delete n1 from ids n1, ids n2 where n1.id > n2.id and n1.user_id = n2.user_id
Database is remote, and am using putty to run queries.
Add an index:
ALTER TABLE ids ADD INDEX (user_id, id);
This makes it efficient to find all the rows with the same user ID and higher IDs.
It will also help to join with a subquery.
DELETE n1
FROM ids AS n1
JOIN (SELECT user_id, MIN(id) AS minid
FROM ids
GROUP BY user_id) AS n2
ON n1.user_id = n2.user_id AND n1.id > n2.minid
This will still be faster with the above index.
yes, that query is very inefficient. Even if you used explicit joins you need to keep in mind that basically every row "N" is being matched up with every row before "N", and every row "N-1" is being matched up with the rows before it.
Try something like this:
DROP TEMPORARY TABLE IF EXISTS keeps;
CREATE TEMPORARY TABLE keeps (
user_id INT,
keepID INT,
INDEX (user_id, keepID)
)
INSERT INTO keeps (user_id, keepID)
SELECT user_id, MIN(id) As keepID
FROM ids
GROUP BY user_id;
DELETE FROM ids WHERE (user_id, id) NOT IN (SELECT user_id, keepID FROM keeps);
DROP TEMPORARY TABLE IF EXISTS keeps;
I'm also tempted to suggest trying something like the below, but I can't remember if MySQL allows subquerying the delete table in the delete query ... which is why I suggested the temp table in the first one.
DELETE a
FROM ids AS a
WHERE EXISTS (
SELECT *
FROM ids AS b
WHERE b.id < a.id
AND b.user_id = a.user_id
)

MySQL: Use a single Column to show as multiple select column in single row

Consider these example tables
E_id in Table1 is a primary key. From and Assign_to are foreign keys referenced with E_id.
I want to show a table like this:
I am not sure how I can implement it. Please share the SQL query which returns the desired table.
You could JOIN to Table1 twice:
SELECT
t2.work_name,
t1f.E_name AS `From`,
t1a.E_Name AS `Assign_to`
FROM Table2 t2
INNER JOIN Table1 t1f
ON t1f.E_id = t2.`from`
INNER JOIN Table1 t1a
ON t1a.E_id =t2.Assign_to
You can solve that problem with a simple temp table. It is not the most sophisticated way to solve it, but the solution is easy to comprehend.
The step are as followed:
Create a table with all data from table2
Add 2 columns to that table to store the name values for from and Assign_to
Update the columns with the name values from table1
Select your Data
MySQL-Code
-- create temp-table
CREATE TABLE table2_temp
SELECT * FROM table2;
-- add columns to enrich table with E_name from table1
ALTER TABLE table2_temp
ADD COLUMN E_name_from VARCHAR (125),
ADD COLUMN E_name_assign_to VARCHAR (125);
-- update temp-table with names from table1
-- for E_name_from
UPDATE table2_temp A
INNER JOIN table1 B ON (A.`from` = E_id)
SET A.E_name_from = B.E_name;
-- for E_name_assign_to
UPDATE table2_temp A
INNER JOIN table1.B ON (A.Assign_to = E_id)
SET A.E_name_assign_to = B.E_name;
-- now you can select your date from the temp-table
SELECT
work_name,
E_name_from AS `From`,
E_name_assign_to AS `Assign_to`
FROM
table2_temp;
-- drop table after work is done
drop table if exists table2_temp ;

Delete multiple rows without knowing the names of the rows

I have a huge database that contains writer names.
There are multiple records in my database but I don't know which rows are duplicate.
How can I delete duplicate rows without knowing the value?
Try:
delete from tbl
where writer_id in
(select writer_id
from (select * from tbl) t
where exists (select 1
from (select * from tbl) x
where x.writer_name = t.writer_name
and t.writer_id < x.writer_id));
See demo:
http://sqlfiddle.com/#!2/845ca3/1/0
This keeps the first row for each writer_name, in order of writer_id ascending.
The EXISTS subquery will run for every row, however. You could also try:
delete t
from
tbl t
left join ( select writer_name, min(writer_id) as writer_id
from tbl
group by writer_name ) x
on t .writer_name = x.writer_name
and x.writer_id = t .writer_id
where
x.writer_name is null;
demo: http://sqlfiddle.com/#!2/075f9/1/0
If there are no foreign key constraints on the table you could also use create table as select to create a new table without the duplicate entries, drop the old table, and rename the new table to that of the old table's name, getting what you want in the end. (This would not be the way to go if this table has foreign keys though)
That would look like this:
create table tbl2 as (select distinct writer_name from tbl);
drop table tbl;
alter table tbl2 add column writer_id int not null auto_increment first,
add primary key (writer_id);
rename table tbl2 to tbl;
demo: http://sqlfiddle.com/#!2/8886d/1/0
SELECT a.*
FROM the_table a
INNER JOIN the_table b ON a.field1 = b.field1 AND (etc)
WHERE a.pk != b.pk
hope that query can solve your problem.
DELETE a
FROM tbl a
LEFT JOIN tbl b
ON a.field1 = b.field1 (etc)
WHERE a.id < b.id
this must help you

delete duplicate records in mysql

We have 2 tables called : "post" and " post_extra"
summery construction of "post" table's are: id,postdate,title,description
And for post_extra they are: eid,news_id,rating,views
"id" filed in the first table is related to "news_id" to the second table.
There are more than 100,000 records on the table, that many of them are duplicated. I want to keep only one record and remove duplicate records on "post" table that have the same title, and then remove the related record on "post_extra"
I ran this query on phpmyadmin but the server was crashed. And I had to restart it.
DELETE e
FROM Post p1, Post p2, Post_extra e
WHERE p1.postdate > p2.postdate
AND p1.title = p2.title
AND e.news_id = p1.id
How can I do this?
Suppose you have table named as 'tables' in which you have the duplicate records.
Firstly you have to do group by column on which you want to delete duplicate.But I am not doing it with group by.I am writing self join instead of writing nested query or creating temporary table.
SELECT * FROM `names` GROUP BY title, id having count(title) > 1;
This query return number of duplicate records with their title and id.
You don't need to create the temporary table in this case.
To Delete duplicate except one record:
In this table it should have auto increment column. The possible solution that I've just come across:
DELETE t1 FROM tables t1, tables t2 WHERE t1.id > t2.id AND t1.title = t2.title
if you want to keep the row with the lowest auto increment id value OR
DELETE t1 FROM tables t1, tables t2 WHERE t1.id < t2.id AND t1.title = n2.title
if you want to keep the row with the highest auto increment id value.
You can cross check your solution,by selecting the duplicate records again by given query:
SELECT * FROM `tables` GROUP BY title, id having count(title) > 1;
If it return 0 result, then you query is successful.
This will keep entries with the lowest id for each title
DELETE p, e
FROM Post p
left join Post_extra e on e.news_id = p.id
where id not in
(
select * from
(
select min(id)
from post
group by title
) x
)
SQLFiddle demo
You can delete duplicate record by creating a temporary table with unique index on the fields that you need to check for the duplicate value
then issue
Insert IGNORE into select * from TableWithDuplicates
You will get a temporary table without duplicates .
then delete the records from the original table (TableWithDuplicates) by JOIN the tables
Should be something like
CREATE TEMPORARY TABLE `tmp_post` (
`id` INT(10) NULL,
`postDate` DATE NULL,
`title` VARCHAR(50) NULL,
`description` VARCHAR(50) NULL, UNIQUE INDEX `postDate_title_description` (`postDate`, `title`, `description`) );
INSERT IGNORE INTO tmp_post
SELECT id,postDate,title,description
FROM post ;
DELETE post.*
FROM post
LEFT JOIN tmp_post tmp ON tmp.id = post.id
WHERE tmp.id IS NULL ;
Sorry I didn't tested this code

Select a record that has a duplicate

I'd like to select all records from a table (names) where lastname is not unique. Preferably I would like to delete all records that are duplicates.
How would this be done? Assume that I don't want to rerun one query multiple times until it quits.
To find which lastnames have duplicates:
SELECT lastname, COUNT(lastname) AS rowcount
FROM table
GROUP BY lastname
HAVING rowcount > 1
To delete one of the duplicates of all the last names. Run until it doesn't do anything. Not very graceful.
DELETE FROM table
WHERE id IN (SELECT id
FROM (SELECT * FROM table) AS t
GROUP BY lastname
HAVING COUNT(lastname) > 1)
The fastest and easiest way to delete duplicate records is my issuing a very simple command.
ALTER IGNORE TABLE [TABLENAME] ADD UNIQUE INDEX UNIQUE_INDEX ([FIELDNAME])
This will lock the table, if this is an issue, try:
delete t1 from table1 t1, table2 t2
where table1.duplicate_field= table2.duplicate_field (add more if need ie. and table.duplicate_field2=table2.duplicate_field2)
and table1.unique_field > table2.unique_field
and breakup into ranges to run faster
dup How can I remove duplicate rows?
DELETE names
FROM names
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, lastname
FROM names
GROUP BY lastname
) as KeepRows ON
names.lastname = KeepRows.lastname
WHERE
KeepRows.RowId IS NULL
assumption: you have an RowId column
SELECT COUNT(*) as mycountvar FROM names GROUP BY lastname WHERE mycountvar > 1;
and then
DELETE FROM names WHERE lastname = '$mylastnamevar' LIMIT $mycountvar-1
but: why don't you just flag the fielt "lastname" als unique, so it isn't possible that duplicates can come in?