delete duplicate records in mysql - mysql

We have 2 tables called : "post" and " post_extra"
summery construction of "post" table's are: id,postdate,title,description
And for post_extra they are: eid,news_id,rating,views
"id" filed in the first table is related to "news_id" to the second table.
There are more than 100,000 records on the table, that many of them are duplicated. I want to keep only one record and remove duplicate records on "post" table that have the same title, and then remove the related record on "post_extra"
I ran this query on phpmyadmin but the server was crashed. And I had to restart it.
DELETE e
FROM Post p1, Post p2, Post_extra e
WHERE p1.postdate > p2.postdate
AND p1.title = p2.title
AND e.news_id = p1.id
How can I do this?

Suppose you have table named as 'tables' in which you have the duplicate records.
Firstly you have to do group by column on which you want to delete duplicate.But I am not doing it with group by.I am writing self join instead of writing nested query or creating temporary table.
SELECT * FROM `names` GROUP BY title, id having count(title) > 1;
This query return number of duplicate records with their title and id.
You don't need to create the temporary table in this case.
To Delete duplicate except one record:
In this table it should have auto increment column. The possible solution that I've just come across:
DELETE t1 FROM tables t1, tables t2 WHERE t1.id > t2.id AND t1.title = t2.title
if you want to keep the row with the lowest auto increment id value OR
DELETE t1 FROM tables t1, tables t2 WHERE t1.id < t2.id AND t1.title = n2.title
if you want to keep the row with the highest auto increment id value.
You can cross check your solution,by selecting the duplicate records again by given query:
SELECT * FROM `tables` GROUP BY title, id having count(title) > 1;
If it return 0 result, then you query is successful.

This will keep entries with the lowest id for each title
DELETE p, e
FROM Post p
left join Post_extra e on e.news_id = p.id
where id not in
(
select * from
(
select min(id)
from post
group by title
) x
)
SQLFiddle demo

You can delete duplicate record by creating a temporary table with unique index on the fields that you need to check for the duplicate value
then issue
Insert IGNORE into select * from TableWithDuplicates
You will get a temporary table without duplicates .
then delete the records from the original table (TableWithDuplicates) by JOIN the tables
Should be something like
CREATE TEMPORARY TABLE `tmp_post` (
`id` INT(10) NULL,
`postDate` DATE NULL,
`title` VARCHAR(50) NULL,
`description` VARCHAR(50) NULL, UNIQUE INDEX `postDate_title_description` (`postDate`, `title`, `description`) );
INSERT IGNORE INTO tmp_post
SELECT id,postDate,title,description
FROM post ;
DELETE post.*
FROM post
LEFT JOIN tmp_post tmp ON tmp.id = post.id
WHERE tmp.id IS NULL ;
Sorry I didn't tested this code

Related

Delete multiple rows without knowing the names of the rows

I have a huge database that contains writer names.
There are multiple records in my database but I don't know which rows are duplicate.
How can I delete duplicate rows without knowing the value?
Try:
delete from tbl
where writer_id in
(select writer_id
from (select * from tbl) t
where exists (select 1
from (select * from tbl) x
where x.writer_name = t.writer_name
and t.writer_id < x.writer_id));
See demo:
http://sqlfiddle.com/#!2/845ca3/1/0
This keeps the first row for each writer_name, in order of writer_id ascending.
The EXISTS subquery will run for every row, however. You could also try:
delete t
from
tbl t
left join ( select writer_name, min(writer_id) as writer_id
from tbl
group by writer_name ) x
on t .writer_name = x.writer_name
and x.writer_id = t .writer_id
where
x.writer_name is null;
demo: http://sqlfiddle.com/#!2/075f9/1/0
If there are no foreign key constraints on the table you could also use create table as select to create a new table without the duplicate entries, drop the old table, and rename the new table to that of the old table's name, getting what you want in the end. (This would not be the way to go if this table has foreign keys though)
That would look like this:
create table tbl2 as (select distinct writer_name from tbl);
drop table tbl;
alter table tbl2 add column writer_id int not null auto_increment first,
add primary key (writer_id);
rename table tbl2 to tbl;
demo: http://sqlfiddle.com/#!2/8886d/1/0
SELECT a.*
FROM the_table a
INNER JOIN the_table b ON a.field1 = b.field1 AND (etc)
WHERE a.pk != b.pk
hope that query can solve your problem.
DELETE a
FROM tbl a
LEFT JOIN tbl b
ON a.field1 = b.field1 (etc)
WHERE a.id < b.id
this must help you

Best way of deleting SQL rows how i want

How can i construct a SQL query to delete how i want.
I have two tables.
Table 1.
ID: Some Random Not Significant To This Question Columns : DateTime : UserID
Table 2.
ID: Some Random Not Significant To This Question Columns : DateTime : UserID
The two tables are related by DateTime and UserID
Is there anyway i can create a query so that it deletes from table 2 if no rows in table1 have a matching DateTime & UserID.
Thanks
You can use LEFT JOIN :
DELETE table2
FROM table2 t2 LEFT JOIN table1 t1 ON t1.`DateTime` = t2.`DateTime`
AND t1.`UserID` = t2.`UserID`
WHERE t1.`UserID` IS NULL
DELETE
FROM table2 t2
WHERE NOT EXISTS
(
SELECT NULL
FROM table1 t1
WHERE (t1.userId, t1.dateTime) = (t2.userId, t2.dateTime)
)
First of all: create a backup before you delete lots of records :)
The idea:
DELETE FROM
table1
WHERE
NOT EXISTS (SELECT 1 FROM table2 WHERE table1.referenceColumn = table2.referenceColumn)
You can check which records will be deleted by replacing the DELETE with SELECT *
And now the solution
DELETE FROM
table2
WHERE
NOT EXISTS (
SELECT 1 FROM
table1
WHERE
table2.UserID = table1.UserID
AND table2.DateTime = table1.DateTime
)

how to Delete Duplicate Rows but keeping 1 based on two columns

I have table called scheduler. It contains following columns:
ID
sequence_id
schedule_time (timestamp)
processed
source_order
I need to delete duplicate rows from the table but keeping 1 row which has same schedule_time and source_order for a particular sequence_id where processed=0
DELETE yourTable FROM yourTable LEFT OUTER JOIN (
SELECT MIN(ID) AS minID FROM yourTable WHERE processed = 0 GROUP BY schedule_time, source_order
) AS keepRowTable ON yourTable.ID = keepRowTable.minID
WHERE keepRowTable.ID IS NULL AND processed = 0
I apply from this post ;P How can I remove duplicate rows?
Have you seen it?
--fixed version--
DELETE yourTable FROM yourTable LEFT OUTER JOIN (
SELECT MIN(ID) AS minID FROM yourTable WHERE processed = 0 GROUP BY schedule_time, source_order
) AS keepRowTable ON yourTable.ID = keepRowTable.minID
WHERE keepRowTable.minID IS NULL AND processed = 0
For mysql
DELETE a from tbl a , tbl b WHERE a.Id>b.Id and
a.sequence_id= b.sequence_id and a.processed=0;
The fastest way to remove duplicates - is definitely to force them out by adding an index, leaving only one copy of each left in the table:
ALTER IGNORE TABLE dates ADD PRIMARY KEY (
ID
sequence_id
schedule_time
processed
source_order
)
Now if you have a key, you might need to delete it and so on, but the point is that when you add a unique key with IGNORE to a table with duplicates - the bahavior is to delete all the extra records / duplicates. So after you added this key, you now just need to delete it again to be able to make new duplicates :-)
Now if you need to do more complex filtering (on witch one of the duplicates to keep that you can not just include in indexes - although unlikely), you can create a table at the same time as you select and input what you want in it - all in the same query:
CREATE TABLE tmp SELECT ..fields.. GROUP BY ( ..what you need..)
DROP TABLE original_table
ALTER TABLE tmp RENAME TO original_table_name

How to remove duplicate entries from a mysql db?

I have a table with some ids + titles. I want to make the title column unique, but it has over 600k records already, some of which are duplicates (sometimes several dozen times over).
How do I remove all duplicates, except one, so I can add a UNIQUE key to the title column after?
This command adds a unique key, and drops all rows that generate errors (due to the unique key). This removes duplicates.
ALTER IGNORE TABLE table ADD UNIQUE KEY idx1(title);
Edit: Note that this command may not work for InnoDB tables for some versions of MySQL. See this post for a workaround. (Thanks to "an anonymous user" for this information.)
Create a new table with just the distinct rows of the original table. There may be other ways but I find this the cleanest.
CREATE TABLE tmp_table AS SELECT DISTINCT [....] FROM main_table
More specifically:
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
DROP TABLE tableName;
INSERT tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;
Since the MySql ALTER IGNORE TABLE has been deprecated, you need to actually delete the duplicate date before adding an index.
First write a query that finds all the duplicates. Here I'm assuming that email is the field that contains duplicates.
SELECT
s1.email
s1.id,
s1.created
s2.id,
s2.created
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
/* Emails are the same */
s1.email = s2.email AND
/* DON'T select both accounts,
only select the one created later.
The serial id could also be used here */
s2.created > s1.created
;
Next select only the unique duplicate ids:
SELECT
DISTINCT s2.id
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
;
Once you are sure that only contains the duplicate ids you want to delete, run the delete. You have to add (SELECT * FROM tblname) so that MySql doesn't complain.
DELETE FROM
student
WHERE
id
IN (
SELECT
DISTINCT s2.id
FROM
(SELECT * FROM student) AS s1
INNER JOIN
(SELECT * FROM student) AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
);
Then create the unique index:
ALTER TABLE
student
ADD UNIQUE INDEX
idx_student_unique_email(email)
;
Below query can be used to delete all the duplicate except the one row with lowest "id" field value
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id > t2.id AND t1.name = t2.name
In the similar way, we can keep the row with the highest value in 'id' as follows
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id < t2.id AND t1.name = t2.name
This shows how to do it in SQL2000. I'm not completely familiar with MySQL syntax but I'm sure there's something comparable
create table #titles (iid int identity (1, 1), title varchar(200))
-- Repeat this step many times to create duplicates
insert into #titles(title) values ('bob')
insert into #titles(title) values ('bob1')
insert into #titles(title) values ('bob2')
insert into #titles(title) values ('bob3')
insert into #titles(title) values ('bob4')
DELETE T FROM
#titles T left join
(
select title, min(iid) as minid from #titles group by title
) D on T.title = D.title and T.iid = D.minid
WHERE D.minid is null
Select * FROM #titles
delete from student where id in (
SELECT distinct(s1.`student_id`) from student as s1 inner join student as s2
where s1.`sex` = s2.`sex` and
s1.`student_id` > s2.`student_id` and
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)
The solution posted by Nitin seems to be the most elegant / logical one.
However it has one issue:
ERROR 1093 (HY000): You can't specify target table 'student' for
update in FROM clause
This can however be resolved by using (SELECT * FROM student) instead of student:
DELETE FROM student WHERE id IN (
SELECT distinct(s1.`student_id`) FROM (SELECT * FROM student) AS s1 INNER JOIN (SELECT * FROM student) AS s2
WHERE s1.`sex` = s2.`sex` AND
s1.`student_id` > s2.`student_id` AND
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)
Give your +1's to Nitin for coming up with the original solution.
Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way (with examples for different use cases).
In this case, something like this should work:
-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;
-- add a unique constraint
ALTER TABLE tmp_table1 ADD UNIQUE(id, title);
-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;
-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;

How to delete duplicate records in mysql database?

What's the best way to delete duplicate records in a mysql database using rails or mysql queries?
What you can do is copy the distinct records into a new table by:
select distinct * into NewTable from MyTable
Here's another idea in no particular language:
rs = `select a, b, count(*) as c from entries group by 1, 2 having c > 1`
rs.each do |a, b, c|
`delete from entries where a=#{a} and b=#{b} limit #{c - 1}`
end
Edit:
Kudos to Olaf for that "having" hint :)
well, if it's a small table, from rails console you can do
class ActiveRecord::Base
def non_id_attributes
atts = self.attributes
atts.delete('id')
atts
end
end
duplicate_groups = YourClass.find(:all).group_by { |element| element.non_id_attributes }.select{ |gr| gr.last.size > 1 }
redundant_elements = duplicate_groups.map { |group| group.last - [group.last.first] }.flatten
redundant_elements.each(&:destroy)
Check for Duplicate entries :
SELECT DISTINCT(req_field) AS field, COUNT(req_field) AS fieldCount FROM
table_name GROUP BY req_field HAVING fieldCount > 1
Remove Duplicate Queries :
DELETE FROM table_name
USING table_name, table_name AS vtable
WHERE
(table_name.id > vtable.id)
AND (table_name.req_field=req_field)
Replace req_field and table_name - should work without any issues.
New to SQL :-)
This is a classic question - often asked in interviews:-)
I don't know whether it'll work in MYSQL but it works in most databases -
> create table t(
> a char(2),
> b char(2),
> c smallint )
> select a,b,c,count(*) from t
> group by a,b,c
> having count(*) > 1
a b c
-- -- ------ -----------
(0 rows affected)
> insert into t values ("aa","bb",1)
(1 row affected)
> insert into t values ("aa","bb",1)
(1 row affected)
> insert into t values ("aa","bc",1)
(1 row affected)
> select a,b,c,count(*) from t group by a,b,c having count(*) > 1
a b c
-- -- ------ -----------
aa bb 1 2
(1 row affected)
If you have PK (id) in table (EMP) and want to older delete duplicate records with name column. For large data following query may be good approach.
DELETE t3
FROM (
SELECT t1.name, t1.id
FROM (
SELECT name
FROM EMP
GROUP BY name
HAVING COUNT(name) > 1
) AS t0 INNER JOIN EMP t1 ON t0.name = t1.name
) AS t2 INNER JOIN EMP t3 ON t3.name = t2.name
WHERE t2.id < t3.id;
suppose we have a table name tbl_product and there is duplicacy in the field p_pi_code and p_nats_id in maximum no of count then
first create a new table insert the data from existing table ...
ie from tbl_product to newtable1 if anything else then newtable1 to newtable2
CREATE TABLE `newtable2` (
`p_id` int(10) unsigned NOT NULL auto_increment,
`p_status` varchar(45) NOT NULL,
`p_pi_code` varchar(45) NOT NULL,
`p_nats_id` mediumint(8) unsigned NOT NULL,
`p_is_special` tinyint(4) NOT NULL,
PRIMARY KEY (`p_id`)
) ENGINE=InnoDB;
INSERT INTO newtable1 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT
p_status, p_pi_code, p_nats_id, p_is_special FROM tbl_product group by p_pi_code;
INSERT INTO newtable2 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT
p_status, p_pi_code, p_nats_id, p_is_special FROM newtable1 group by p_nats_id;
after that we see all the duplicacy in the field is removed
I had to do this recently on Oracle, but the steps would have been the same on MySQL. It was a lot of data, at least compared to what I'm used to working with, so my process to de-dup was comparatively heavyweight. I'm including it here in case someone else comes along with a similar problem.
My duplicate records had different IDs, different updated_at times, possibly different updated_by IDs, but all other columns the same. I wanted to keep the most recently updated of any duplicate set.
I used a combination of Rails logic and SQL to get it done.
Step one: run a rake script to identify the IDs of the duplicate records, using model logic. IDs go in a text file.
Step two: create a temporary table with one column, the IDs to delete, loaded from the text file.
Step three: create another temporary table with all the records I'm going to delete (just in case!).
CREATE TABLE temp_duplicate_models
AS (SELECT * FROM models
WHERE id IN (SELECT * FROM temp_duplicate_ids));
Step four: actual deleting.
DELETE FROM models WHERE id IN (SELECT * FROM temp_duplicate_ids);
You can use:
http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html
to get the duplicates and then just delete them via Ruby code or SQL code (I would do it in SQL code but thats up to you :-)
If your table has a PK (or you can easily give it one), you can specify any number of columns in the table to be equal (to qualify is as a duplicate) with the following query (may be a bit messy looking but it works):
DELETE FROM table WHERE pk_id IN(
SELECT DISTINCT t3.pk_id FROM (
SELECT t1.* FROM table AS t1 INNER JOIN (
SELECT col1, col2, col3, col4, COUNT(*) FROM table
GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
t1.col4 = t2.col4)
AS t3, (
SELECT t1.* FROM table AS t1 INNER JOIN (
SELECT col1, col2, col3, col4, COUNT(*) FROM table
GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
t1.col4 = t2.col4)
AS t4
WHERE t3.col1 = t4.col1 AND t3.pk_id > t4.pk_id
)
This will leave the first record entered into the database, deleting the 'newest' duplicates. If you want to keep the last record, switch the > to <.
In MySql when I put something like
delete from A where IDA in (select IDA from A )
mySql said something like "you can't use the same table in the select part of the delete operation."
I've just have to delete some duplicate records, and I have succeeded with a .php program like that
<?php
...
$res = hacer_sql("SELECT MIN(IDESTUDIANTE) as IDTODELETE
FROM `estudiante` group by `LASTNAME`,`FIRSTNAME`,`CI`,`PHONE`
HAVING COUNT(*) > 1 )");
while ( $reg = mysql_fetch_assoc($res) ) {
hacer_sql("delete from estudiante where IDESTUDIANTE = {$reg['IDTODELETE']}");
}
?>
I am using Alter Table
ALTER IGNORE TABLE jos_city ADD UNIQUE INDEX(`city`);
I used #krukid's answer above to do the following on a table with around 70,000 entries:
rs = 'select a, b, count(*) as c from table group by 1, 2 having c > 1'
# get a hashmap
dups = MyModel.connection.select_all(rs)
# convert to array
dupsarr = dups.map { |i| [i.a, i.b, i.c] }
# delete dups
dupsarr.each do |a,b,c|
ActiveRecord::Base.connection.execute("delete from table_name where a=#{MyModel.sanitize(a)} and b=#{MyModel.sanitize(b)} limit #{c-1}")
end
Here is the rails solution I came up with. May not be the most efficient, but not a big deal if its a one time migration.
distinct_records = MyTable.all.group(:distinct_column_1, :distinct_column_2).map {|mt| mt.id}
duplicates = MyTable.all.to_a.reject!{|mt| distinct_records.include? mt.id}
duplicates.each(&:destroy)
First, groups by all columns that determine uniqueness, the example shows 2 but you could have more or less
Second, selects the inverse of that group...all other records
Third, Deletes all those records.
Firstly do group by column on which you want to delete duplicate.But I am not doing it with group by.I am writing self join.
You don't need to create the temporary table.
Delete duplicate except one record:
In this table it should have auto increment column.
The possible solution that I've just come across:
DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name
if you want to keep the row with the lowest auto increment id value OR
DELETE n1 FROM names n1, names n2 WHERE n1.id < n2.id AND n1.name = n2.name
if you want to keep the row with the highest auto increment id value.
You can cross check your solution, find duplicate again:
SELECT * FROM `names` GROUP BY name, id having count(name) > 1;
If it return 0 result, then you query is successful.