How to delete entries that share similar pattern in MySQL - mysql

I have a column that may contain entries like this:
abc.yahoo.com
efg.yshoo.com
hij.yahoo.com
I need to delete all the duplicates and LEAVE ONE ONLY as I don't need the others. Such command can be easily done if I know the second part (ex: yahoo.com) but my problem is that the part (yahoo.com) is not fixed. I may have entries such as:
abc.msn.com
efg.msn.com
hij.msn.com
And I want to treat all these cases at once. Is this possible?

To delete the duplicates you can use
DELETE FROM your_table t1
LEFT JOIN
(
SELECT MIN(id) AS id
FROM your_table
GROUP BY SUBSTRING_INDEX(REVERSE(col), '.', 2)
) t2 ON t2.id = t1.id
WHERE b.id IS NULL
If you need to create an UNIQUE constraint for that you can do the following
1.Add another field to hold the domain value
ALTER TABLE your_table ADD COLUMN `domain` VARCHAR(100) NOT NULL DEFAULT '';
2.Update it with the correct values
UPDATE your_table set domain = REVERSE(SUBSTRING_INDEX(REVERSE(col), '.', 2));
3.Add the unique constraint
ALTER IGNORE TABLE your_table ADD UNIQUE domain (domain);
4.Add before insert and before update trggers to set the domain column
DELIMITER $$
CREATE TRIGGER `your_trigger` BEFORE INSERT ON `your_table ` FOR EACH ROW
BEGIN
set new.domain = REVERSE(SUBSTRING_INDEX(REVERSE(new.col1), '.', 2));
END$$
CREATE TRIGGER `your_trigger` BEFORE UPDATE ON `your_table ` FOR EACH ROW
BEGIN
set new.domain = REVERSE(SUBSTRING_INDEX(REVERSE(new.col1), '.', 2));
END$$
DELIMITER ;
Note: this assumes the domain is the last 2 words when separated by '.', it will not work for a domain such as ebay.co.uk . For that you will probably need to make a stored function which returns the domain for a given host and use it instead of REVERSE(SUBSTRING_INDEX....

This is assuming that you just want to take out the letters before the first . then group on the column:
DELETE a FROM tbl a
LEFT JOIN
(
SELECT MIN(id) AS id
FROM tbl
GROUP BY SUBSTRING(column, LOCATE('.', column))
) b ON a.id = b.id
WHERE b.id IS NULL
Where id is your primary key column name, and column is the column that contains the values to group on.
This will also account for domains like xxx.co.uk where you have two parts at the end.
Make sure you have a backup of your current data or run this operation within a transaction (where you can ROLLBACK; if it didn't fit your needs).
EDIT: If after deleting the duplicates you want to replace the letters before the first . with *, you can simply use:
UPDATE tbl
SET column = CONCAT('*', SUBSTRING(column, LOCATE('.', column)))

Related

MYSQL Delete rows that has same value in 2 columns [duplicate]

I have a table with the following fields:
id (Unique)
url (Unique)
title
company
site_id
Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):
SELECT title, site_id, location, id, count( * )
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1
After running this query, I can remove duplicates using a server side script.
But, I want to know if this can be done only using SQL query.
A really easy way to do this is to add a UNIQUE index on the 3 columns. When you write the ALTER statement, include the IGNORE keyword. Like so:
ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);
This will drop all the duplicate rows. As an added benefit, future INSERTs that are duplicates will error out. As always, you may want to take a backup before running something like this...
Edit: no longer works in MySQL 5.7+
This feature has been deprecated in MySQL 5.6 and removed in MySQL 5.7, so it doesn't work.
If you don't want to alter the column properties, then you can use the query below.
Since you have a column which has unique IDs (e.g., auto_increment columns), you can use it to remove the duplicates:
DELETE `a`
FROM
`jobs` AS `a`,
`jobs` AS `b`
WHERE
-- IMPORTANT: Ensures one version remains
-- Change "ID" to your unique column's name
`a`.`ID` < `b`.`ID`
-- Any duplicates you want to check for
AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);
In MySQL, you can simplify it even more with the NULL-safe equal operator (aka "spaceship operator"):
DELETE `a`
FROM
`jobs` AS `a`,
`jobs` AS `b`
WHERE
-- IMPORTANT: Ensures one version remains
-- Change "ID" to your unique column's name
`a`.`ID` < `b`.`ID`
-- Any duplicates you want to check for
AND `a`.`title` <=> `b`.`title`
AND `a`.`company` <=> `b`.`company`
AND `a`.`site_id` <=> `b`.`site_id`;
MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:
create temporary table tmpTable (id int);
insert into tmpTable
(id)
select id
from YourTable yt
where exists
(
select *
from YourTabe yt2
where yt2.title = yt.title
and yt2.company = yt.company
and yt2.site_id = yt.site_id
and yt2.id > yt.id
);
delete
from YourTable
where ID in (select id from tmpTable);
From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:
DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id
Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.
There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.
However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.
The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).
This is how it can be achieved. Given we have a table employee, with the following columns:
employee (id, first_name, last_name, start_date, ssn)
In order to delete the rows with a duplicate ssn column, and keeping only the first entry found, the following process can be followed:
-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;
-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;
-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
Technical explanation
Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table
Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates
Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries
Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table
⇒ Using this approach, 1.6M registers were converted into 6k in less than 200s.
Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:
CREATE TABLE tmp_jobs LIKE jobs;
ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);
INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;
RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;
Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.
✔ Variation for keeping the last entry instead of the first one
Sometimes we need to keep the last duplicated entry instead of the first one.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, the ORDER BY id DESC clause makes the last ID's to get priority over the rest
✔ Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found
Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, a new column n_duplicates is created
On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATE query is used to perform an additional update when a duplicate is found (in this case, increasing a counter)
The INSERT INTO ... ON DUPLICATE KEY UPDATE query can be used to perform different types of updates for the duplicates found.
✔ Variation for regenerating the auto-incremental field id
Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically
✔ Further variations
Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
CREATE TABLE tmp_employee2 LIKE tmp_employee;
INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;
DROP TABLE tmp_employee;
RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;
If the IGNORE statement won't work like in my case, you can use the below statement:
CREATE TABLE your_table_deduped LIKE your_table;
INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
index2_id;
RENAME TABLE your_table TO your_table_with_dupes;
RENAME TABLE your_table_deduped TO your_table;
#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);
#OPTIONAL
DROP TABLE your_table_with_dupes;
There is another solution :
DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...
A solution that is simple to understand and works with no primary key:
add a new boolean column
alter table mytable add tokeep boolean;
add a constraint on the duplicated columns AND the new column
alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint
update ignore mytable set tokeep = true;
delete rows that have not been marked as tokeep
delete from mytable where tokeep is null;
drop the added column
alter table mytable drop tokeep;
I suggest that you keep the constraint you added, so that new duplicates are prevented in the future.
This will delete the duplicate rows with same values for title, company and site. The last occurrence will be kept and the remaining duplicates will be deleted (if you want to keep the first occurrence and delete the others, change the comparison on id to be greater than e.g. t1.id > t2.id)
DELETE t1 FROM tablename t1
INNER JOIN tablename t2
WHERE
t1.id < t2.id AND
t1.title = t2.title AND
t1.company=t2.company AND
t1.site_ID=t2.site_ID;
if you have a large table with huge number of records then above solutions will not work or take too much time. Then we have a different solution
-- Create temporary table
CREATE TABLE temp_table LIKE table1;
-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);
-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;
-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;
I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:
DELETE
FROM Table
WHERE Table.idTable IN (
SELECT MAX(idTable)
FROM idTable
GROUP BY field1, field2, field3
HAVING COUNT(*) > 1)
I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:
DELETE
FROM jobs
WHERE jobs.id IN (
SELECT MAX(id)
FROM jobs
GROUP BY site_id, company, title, location
HAVING COUNT(*) > 1)
I found a simple way. (keep latest)
DELETE t1 FROM table_name t1 INNER JOIN table_name t2
WHERE t1.primary_id < t2.primary_id
AND t1.check_duplicate_col_1 = t2.check_duplicate_col_1
AND t1.check_duplicate_col_2 = t2.check_duplicate_col_2
...
Simple and fast for all cases:
CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*) > 1);
DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates);
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;
Delete duplicate rows using DELETE JOIN statement
MySQL provides you with the DELETE JOIN statement that you can use to remove duplicate rows quickly.
The following statement deletes duplicate rows and keeps the highest id:
DELETE t1 FROM contacts t1
INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;
As of version 8.0 (2018), MySQL finally supports window functions.
Window functions are both handy and efficient. Here is a solution that demonstrates how to use them to solve this assignment.
In a subquery, we can use ROW_NUMBER() to assign a position to each record in the table within column1/column2 groups, ordered by id. If there is no duplicates, the record will get row number 1. If duplicate exists, they will be numbered by ascending id (starting at 1).
Once records are properly numbered in the subquery, the outer query just deletes all records whose row number is not 1.
Query :
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT
id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
FROM output
) t
WHERE rn > 1
)
I keep visiting this page anytime I google "remove duplicates form mysql" but for my theIGNORE solutions don't work because I have an InnoDB mysql tables
this code works better anytime
CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;
tableToclean = the name of the table you need to clean
tableToclean_temp = a temporary table created and deleted
This solution will move the duplicates into one table and the uniques into another.
-- speed up creating uniques table if dealing with many rows
CREATE INDEX temp_idx ON jobs(site_id, company, title, location);
-- create the table with unique rows
INSERT jobs_uniques SELECT * FROM
(
SELECT *
FROM jobs
GROUP BY site_id, company, title, location
HAVING count(1) > 1
UNION
SELECT *
FROM jobs
GROUP BY site_id, company, title, location
HAVING count(1) = 1
) x
-- create the table with duplicate rows
INSERT jobs_dupes
SELECT *
FROM jobs
WHERE id NOT IN
(SELECT id FROM jobs_uniques)
-- confirm the difference between uniques and dupes tables
SELECT COUNT(1)
AS jobs,
(SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques)
AS sum
FROM jobs
Delete duplicate rows with the DELETE JOIN statement:
DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
t1.id < t2.id AND
t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;
To Delete the duplicate record in a table.
delete from job s
where rowid < any
(select rowid from job k
where s.site_id = k.site_id and
s.title = k.title and
s.company = k.company);
or
delete from job s
where rowid not in
(select max(rowid) from job k
where s.site_id = k.site_id and
s.title = k.title and
s.company = k.company);
Here is what I used, and it works:
create table temp_table like my_table;
t_id is my unique column
insert into temp_table (id) select id from my_table GROUP by t_id;
delete from my_table where id not in (select id from temp_table);
drop table temp_table;
In Order to duplicate records with unique columns, e.g. COL1,COL2, COL3 should not be replicated (suppose we have missed 3 column unique in table structure and multiple duplicate entries have been made into the table)
DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3;
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;
Hope will help dev.
I have a table which forget to add a primary key in the id row. Though is has auto_increment on the id. But one day, one stuff replay the mysql bin log on the database which insert some duplicate rows.
I remove the duplicate row by
select the unique duplicate rows and export them
select T1.* from table_name T1 inner join (select count(*) as c,id from table_name group by id) T2 on T1.id = T2.id where T2.c > 1 group by T1.id;
delete the duplicate rows by id
insert the row from the exported data.
Then add the primary key on id
This is perfect if you are trying to delete one of the duplicates and leave the other. Note that without subqueries you would get a #1093 error.
DELETE FROM table_name
WHERE id IN (
SELECT * FROM (SELECT n.id FROM table_name n
WHERE n.column2 != "value"
GROUP BY n.column HAVING COUNT(n.column ) > 1) x
)
I like to be a bit more specific as to which records I delete so here is my solution:
delete
from jobs c1
where not c1.location = 'Paris'
and c1.site_id > 64218
and exists
(
select * from jobs c2
where c2.site_id = c1.site_id
and c2.company = c1.company
and c2.location = c1.location
and c2.title = c1.title
and c2.site_id > 63412
and c2.site_id < 64219
)
You can easily delete the duplicate records from this code..
$qry = mysql_query("SELECT * from cities");
while($qry_row = mysql_fetch_array($qry))
{
$qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'");
if(mysql_num_rows($qry2) > 1){
while($row = mysql_fetch_array($qry2)){
$city_arry[] = $row;
}
$total = sizeof($city_arry) - 1;
for($i=1; $i<=$total; $i++){
mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'");
}
}
//exit;
}
I had to do this with text fields and came across the limit of 100 bytes on the index.
I solved this by adding a column, doing a md5 hash of the fields, and the doing the alter.
ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ;
UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`))
ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);

MySQL query works fine with Select but Delete query hangs indefinitely based on the position of GROUP BY [duplicate]

I have a table with the following fields:
id (Unique)
url (Unique)
title
company
site_id
Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):
SELECT title, site_id, location, id, count( * )
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1
After running this query, I can remove duplicates using a server side script.
But, I want to know if this can be done only using SQL query.
A really easy way to do this is to add a UNIQUE index on the 3 columns. When you write the ALTER statement, include the IGNORE keyword. Like so:
ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);
This will drop all the duplicate rows. As an added benefit, future INSERTs that are duplicates will error out. As always, you may want to take a backup before running something like this...
Edit: no longer works in MySQL 5.7+
This feature has been deprecated in MySQL 5.6 and removed in MySQL 5.7, so it doesn't work.
If you don't want to alter the column properties, then you can use the query below.
Since you have a column which has unique IDs (e.g., auto_increment columns), you can use it to remove the duplicates:
DELETE `a`
FROM
`jobs` AS `a`,
`jobs` AS `b`
WHERE
-- IMPORTANT: Ensures one version remains
-- Change "ID" to your unique column's name
`a`.`ID` < `b`.`ID`
-- Any duplicates you want to check for
AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);
In MySQL, you can simplify it even more with the NULL-safe equal operator (aka "spaceship operator"):
DELETE `a`
FROM
`jobs` AS `a`,
`jobs` AS `b`
WHERE
-- IMPORTANT: Ensures one version remains
-- Change "ID" to your unique column's name
`a`.`ID` < `b`.`ID`
-- Any duplicates you want to check for
AND `a`.`title` <=> `b`.`title`
AND `a`.`company` <=> `b`.`company`
AND `a`.`site_id` <=> `b`.`site_id`;
MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:
create temporary table tmpTable (id int);
insert into tmpTable
(id)
select id
from YourTable yt
where exists
(
select *
from YourTabe yt2
where yt2.title = yt.title
and yt2.company = yt.company
and yt2.site_id = yt.site_id
and yt2.id > yt.id
);
delete
from YourTable
where ID in (select id from tmpTable);
From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:
DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id
Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.
There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.
However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.
The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).
This is how it can be achieved. Given we have a table employee, with the following columns:
employee (id, first_name, last_name, start_date, ssn)
In order to delete the rows with a duplicate ssn column, and keeping only the first entry found, the following process can be followed:
-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;
-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;
-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
Technical explanation
Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table
Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates
Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries
Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table
⇒ Using this approach, 1.6M registers were converted into 6k in less than 200s.
Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:
CREATE TABLE tmp_jobs LIKE jobs;
ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);
INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;
RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;
Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.
✔ Variation for keeping the last entry instead of the first one
Sometimes we need to keep the last duplicated entry instead of the first one.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, the ORDER BY id DESC clause makes the last ID's to get priority over the rest
✔ Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found
Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, a new column n_duplicates is created
On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATE query is used to perform an additional update when a duplicate is found (in this case, increasing a counter)
The INSERT INTO ... ON DUPLICATE KEY UPDATE query can be used to perform different types of updates for the duplicates found.
✔ Variation for regenerating the auto-incremental field id
Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically
✔ Further variations
Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
CREATE TABLE tmp_employee2 LIKE tmp_employee;
INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;
DROP TABLE tmp_employee;
RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;
If the IGNORE statement won't work like in my case, you can use the below statement:
CREATE TABLE your_table_deduped LIKE your_table;
INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
index2_id;
RENAME TABLE your_table TO your_table_with_dupes;
RENAME TABLE your_table_deduped TO your_table;
#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);
#OPTIONAL
DROP TABLE your_table_with_dupes;
There is another solution :
DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...
A solution that is simple to understand and works with no primary key:
add a new boolean column
alter table mytable add tokeep boolean;
add a constraint on the duplicated columns AND the new column
alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint
update ignore mytable set tokeep = true;
delete rows that have not been marked as tokeep
delete from mytable where tokeep is null;
drop the added column
alter table mytable drop tokeep;
I suggest that you keep the constraint you added, so that new duplicates are prevented in the future.
This will delete the duplicate rows with same values for title, company and site. The last occurrence will be kept and the remaining duplicates will be deleted (if you want to keep the first occurrence and delete the others, change the comparison on id to be greater than e.g. t1.id > t2.id)
DELETE t1 FROM tablename t1
INNER JOIN tablename t2
WHERE
t1.id < t2.id AND
t1.title = t2.title AND
t1.company=t2.company AND
t1.site_ID=t2.site_ID;
if you have a large table with huge number of records then above solutions will not work or take too much time. Then we have a different solution
-- Create temporary table
CREATE TABLE temp_table LIKE table1;
-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);
-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;
-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;
I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:
DELETE
FROM Table
WHERE Table.idTable IN (
SELECT MAX(idTable)
FROM idTable
GROUP BY field1, field2, field3
HAVING COUNT(*) > 1)
I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:
DELETE
FROM jobs
WHERE jobs.id IN (
SELECT MAX(id)
FROM jobs
GROUP BY site_id, company, title, location
HAVING COUNT(*) > 1)
I found a simple way. (keep latest)
DELETE t1 FROM table_name t1 INNER JOIN table_name t2
WHERE t1.primary_id < t2.primary_id
AND t1.check_duplicate_col_1 = t2.check_duplicate_col_1
AND t1.check_duplicate_col_2 = t2.check_duplicate_col_2
...
Simple and fast for all cases:
CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*) > 1);
DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates);
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;
Delete duplicate rows using DELETE JOIN statement
MySQL provides you with the DELETE JOIN statement that you can use to remove duplicate rows quickly.
The following statement deletes duplicate rows and keeps the highest id:
DELETE t1 FROM contacts t1
INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;
As of version 8.0 (2018), MySQL finally supports window functions.
Window functions are both handy and efficient. Here is a solution that demonstrates how to use them to solve this assignment.
In a subquery, we can use ROW_NUMBER() to assign a position to each record in the table within column1/column2 groups, ordered by id. If there is no duplicates, the record will get row number 1. If duplicate exists, they will be numbered by ascending id (starting at 1).
Once records are properly numbered in the subquery, the outer query just deletes all records whose row number is not 1.
Query :
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT
id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
FROM output
) t
WHERE rn > 1
)
I keep visiting this page anytime I google "remove duplicates form mysql" but for my theIGNORE solutions don't work because I have an InnoDB mysql tables
this code works better anytime
CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;
tableToclean = the name of the table you need to clean
tableToclean_temp = a temporary table created and deleted
This solution will move the duplicates into one table and the uniques into another.
-- speed up creating uniques table if dealing with many rows
CREATE INDEX temp_idx ON jobs(site_id, company, title, location);
-- create the table with unique rows
INSERT jobs_uniques SELECT * FROM
(
SELECT *
FROM jobs
GROUP BY site_id, company, title, location
HAVING count(1) > 1
UNION
SELECT *
FROM jobs
GROUP BY site_id, company, title, location
HAVING count(1) = 1
) x
-- create the table with duplicate rows
INSERT jobs_dupes
SELECT *
FROM jobs
WHERE id NOT IN
(SELECT id FROM jobs_uniques)
-- confirm the difference between uniques and dupes tables
SELECT COUNT(1)
AS jobs,
(SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques)
AS sum
FROM jobs
Delete duplicate rows with the DELETE JOIN statement:
DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
t1.id < t2.id AND
t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;
To Delete the duplicate record in a table.
delete from job s
where rowid < any
(select rowid from job k
where s.site_id = k.site_id and
s.title = k.title and
s.company = k.company);
or
delete from job s
where rowid not in
(select max(rowid) from job k
where s.site_id = k.site_id and
s.title = k.title and
s.company = k.company);
Here is what I used, and it works:
create table temp_table like my_table;
t_id is my unique column
insert into temp_table (id) select id from my_table GROUP by t_id;
delete from my_table where id not in (select id from temp_table);
drop table temp_table;
In Order to duplicate records with unique columns, e.g. COL1,COL2, COL3 should not be replicated (suppose we have missed 3 column unique in table structure and multiple duplicate entries have been made into the table)
DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3;
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;
Hope will help dev.
I have a table which forget to add a primary key in the id row. Though is has auto_increment on the id. But one day, one stuff replay the mysql bin log on the database which insert some duplicate rows.
I remove the duplicate row by
select the unique duplicate rows and export them
select T1.* from table_name T1 inner join (select count(*) as c,id from table_name group by id) T2 on T1.id = T2.id where T2.c > 1 group by T1.id;
delete the duplicate rows by id
insert the row from the exported data.
Then add the primary key on id
This is perfect if you are trying to delete one of the duplicates and leave the other. Note that without subqueries you would get a #1093 error.
DELETE FROM table_name
WHERE id IN (
SELECT * FROM (SELECT n.id FROM table_name n
WHERE n.column2 != "value"
GROUP BY n.column HAVING COUNT(n.column ) > 1) x
)
I like to be a bit more specific as to which records I delete so here is my solution:
delete
from jobs c1
where not c1.location = 'Paris'
and c1.site_id > 64218
and exists
(
select * from jobs c2
where c2.site_id = c1.site_id
and c2.company = c1.company
and c2.location = c1.location
and c2.title = c1.title
and c2.site_id > 63412
and c2.site_id < 64219
)
You can easily delete the duplicate records from this code..
$qry = mysql_query("SELECT * from cities");
while($qry_row = mysql_fetch_array($qry))
{
$qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'");
if(mysql_num_rows($qry2) > 1){
while($row = mysql_fetch_array($qry2)){
$city_arry[] = $row;
}
$total = sizeof($city_arry) - 1;
for($i=1; $i<=$total; $i++){
mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'");
}
}
//exit;
}
I had to do this with text fields and came across the limit of 100 bytes on the index.
I solved this by adding a column, doing a md5 hash of the fields, and the doing the alter.
ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ;
UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`))
ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);

Duplicates removal with MySQL query [duplicate]

I have a table with the following fields:
id (Unique)
url (Unique)
title
company
site_id
Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):
SELECT title, site_id, location, id, count( * )
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1
After running this query, I can remove duplicates using a server side script.
But, I want to know if this can be done only using SQL query.
A really easy way to do this is to add a UNIQUE index on the 3 columns. When you write the ALTER statement, include the IGNORE keyword. Like so:
ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);
This will drop all the duplicate rows. As an added benefit, future INSERTs that are duplicates will error out. As always, you may want to take a backup before running something like this...
Edit: no longer works in MySQL 5.7+
This feature has been deprecated in MySQL 5.6 and removed in MySQL 5.7, so it doesn't work.
If you don't want to alter the column properties, then you can use the query below.
Since you have a column which has unique IDs (e.g., auto_increment columns), you can use it to remove the duplicates:
DELETE `a`
FROM
`jobs` AS `a`,
`jobs` AS `b`
WHERE
-- IMPORTANT: Ensures one version remains
-- Change "ID" to your unique column's name
`a`.`ID` < `b`.`ID`
-- Any duplicates you want to check for
AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);
In MySQL, you can simplify it even more with the NULL-safe equal operator (aka "spaceship operator"):
DELETE `a`
FROM
`jobs` AS `a`,
`jobs` AS `b`
WHERE
-- IMPORTANT: Ensures one version remains
-- Change "ID" to your unique column's name
`a`.`ID` < `b`.`ID`
-- Any duplicates you want to check for
AND `a`.`title` <=> `b`.`title`
AND `a`.`company` <=> `b`.`company`
AND `a`.`site_id` <=> `b`.`site_id`;
MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:
create temporary table tmpTable (id int);
insert into tmpTable
(id)
select id
from YourTable yt
where exists
(
select *
from YourTabe yt2
where yt2.title = yt.title
and yt2.company = yt.company
and yt2.site_id = yt.site_id
and yt2.id > yt.id
);
delete
from YourTable
where ID in (select id from tmpTable);
From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:
DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id
Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.
There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.
However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.
The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).
This is how it can be achieved. Given we have a table employee, with the following columns:
employee (id, first_name, last_name, start_date, ssn)
In order to delete the rows with a duplicate ssn column, and keeping only the first entry found, the following process can be followed:
-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;
-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;
-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
Technical explanation
Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table
Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates
Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries
Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table
⇒ Using this approach, 1.6M registers were converted into 6k in less than 200s.
Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:
CREATE TABLE tmp_jobs LIKE jobs;
ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);
INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;
RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;
Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.
✔ Variation for keeping the last entry instead of the first one
Sometimes we need to keep the last duplicated entry instead of the first one.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, the ORDER BY id DESC clause makes the last ID's to get priority over the rest
✔ Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found
Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, a new column n_duplicates is created
On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATE query is used to perform an additional update when a duplicate is found (in this case, increasing a counter)
The INSERT INTO ... ON DUPLICATE KEY UPDATE query can be used to perform different types of updates for the duplicates found.
✔ Variation for regenerating the auto-incremental field id
Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically
✔ Further variations
Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.
CREATE TABLE tmp_employee LIKE employee;
ALTER TABLE tmp_employee ADD UNIQUE(ssn);
ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
CREATE TABLE tmp_employee2 LIKE tmp_employee;
INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;
DROP TABLE tmp_employee;
RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;
If the IGNORE statement won't work like in my case, you can use the below statement:
CREATE TABLE your_table_deduped LIKE your_table;
INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
index2_id;
RENAME TABLE your_table TO your_table_with_dupes;
RENAME TABLE your_table_deduped TO your_table;
#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);
#OPTIONAL
DROP TABLE your_table_with_dupes;
There is another solution :
DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...
A solution that is simple to understand and works with no primary key:
add a new boolean column
alter table mytable add tokeep boolean;
add a constraint on the duplicated columns AND the new column
alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint
update ignore mytable set tokeep = true;
delete rows that have not been marked as tokeep
delete from mytable where tokeep is null;
drop the added column
alter table mytable drop tokeep;
I suggest that you keep the constraint you added, so that new duplicates are prevented in the future.
This will delete the duplicate rows with same values for title, company and site. The last occurrence will be kept and the remaining duplicates will be deleted (if you want to keep the first occurrence and delete the others, change the comparison on id to be greater than e.g. t1.id > t2.id)
DELETE t1 FROM tablename t1
INNER JOIN tablename t2
WHERE
t1.id < t2.id AND
t1.title = t2.title AND
t1.company=t2.company AND
t1.site_ID=t2.site_ID;
if you have a large table with huge number of records then above solutions will not work or take too much time. Then we have a different solution
-- Create temporary table
CREATE TABLE temp_table LIKE table1;
-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);
-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;
-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;
I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:
DELETE
FROM Table
WHERE Table.idTable IN (
SELECT MAX(idTable)
FROM idTable
GROUP BY field1, field2, field3
HAVING COUNT(*) > 1)
I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:
DELETE
FROM jobs
WHERE jobs.id IN (
SELECT MAX(id)
FROM jobs
GROUP BY site_id, company, title, location
HAVING COUNT(*) > 1)
I found a simple way. (keep latest)
DELETE t1 FROM table_name t1 INNER JOIN table_name t2
WHERE t1.primary_id < t2.primary_id
AND t1.check_duplicate_col_1 = t2.check_duplicate_col_1
AND t1.check_duplicate_col_2 = t2.check_duplicate_col_2
...
Simple and fast for all cases:
CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*) > 1);
DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates);
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;
Delete duplicate rows using DELETE JOIN statement
MySQL provides you with the DELETE JOIN statement that you can use to remove duplicate rows quickly.
The following statement deletes duplicate rows and keeps the highest id:
DELETE t1 FROM contacts t1
INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;
As of version 8.0 (2018), MySQL finally supports window functions.
Window functions are both handy and efficient. Here is a solution that demonstrates how to use them to solve this assignment.
In a subquery, we can use ROW_NUMBER() to assign a position to each record in the table within column1/column2 groups, ordered by id. If there is no duplicates, the record will get row number 1. If duplicate exists, they will be numbered by ascending id (starting at 1).
Once records are properly numbered in the subquery, the outer query just deletes all records whose row number is not 1.
Query :
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT
id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
FROM output
) t
WHERE rn > 1
)
I keep visiting this page anytime I google "remove duplicates form mysql" but for my theIGNORE solutions don't work because I have an InnoDB mysql tables
this code works better anytime
CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;
tableToclean = the name of the table you need to clean
tableToclean_temp = a temporary table created and deleted
This solution will move the duplicates into one table and the uniques into another.
-- speed up creating uniques table if dealing with many rows
CREATE INDEX temp_idx ON jobs(site_id, company, title, location);
-- create the table with unique rows
INSERT jobs_uniques SELECT * FROM
(
SELECT *
FROM jobs
GROUP BY site_id, company, title, location
HAVING count(1) > 1
UNION
SELECT *
FROM jobs
GROUP BY site_id, company, title, location
HAVING count(1) = 1
) x
-- create the table with duplicate rows
INSERT jobs_dupes
SELECT *
FROM jobs
WHERE id NOT IN
(SELECT id FROM jobs_uniques)
-- confirm the difference between uniques and dupes tables
SELECT COUNT(1)
AS jobs,
(SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques)
AS sum
FROM jobs
Delete duplicate rows with the DELETE JOIN statement:
DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
t1.id < t2.id AND
t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;
To Delete the duplicate record in a table.
delete from job s
where rowid < any
(select rowid from job k
where s.site_id = k.site_id and
s.title = k.title and
s.company = k.company);
or
delete from job s
where rowid not in
(select max(rowid) from job k
where s.site_id = k.site_id and
s.title = k.title and
s.company = k.company);
Here is what I used, and it works:
create table temp_table like my_table;
t_id is my unique column
insert into temp_table (id) select id from my_table GROUP by t_id;
delete from my_table where id not in (select id from temp_table);
drop table temp_table;
In Order to duplicate records with unique columns, e.g. COL1,COL2, COL3 should not be replicated (suppose we have missed 3 column unique in table structure and multiple duplicate entries have been made into the table)
DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3;
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;
Hope will help dev.
I have a table which forget to add a primary key in the id row. Though is has auto_increment on the id. But one day, one stuff replay the mysql bin log on the database which insert some duplicate rows.
I remove the duplicate row by
select the unique duplicate rows and export them
select T1.* from table_name T1 inner join (select count(*) as c,id from table_name group by id) T2 on T1.id = T2.id where T2.c > 1 group by T1.id;
delete the duplicate rows by id
insert the row from the exported data.
Then add the primary key on id
This is perfect if you are trying to delete one of the duplicates and leave the other. Note that without subqueries you would get a #1093 error.
DELETE FROM table_name
WHERE id IN (
SELECT * FROM (SELECT n.id FROM table_name n
WHERE n.column2 != "value"
GROUP BY n.column HAVING COUNT(n.column ) > 1) x
)
I like to be a bit more specific as to which records I delete so here is my solution:
delete
from jobs c1
where not c1.location = 'Paris'
and c1.site_id > 64218
and exists
(
select * from jobs c2
where c2.site_id = c1.site_id
and c2.company = c1.company
and c2.location = c1.location
and c2.title = c1.title
and c2.site_id > 63412
and c2.site_id < 64219
)
You can easily delete the duplicate records from this code..
$qry = mysql_query("SELECT * from cities");
while($qry_row = mysql_fetch_array($qry))
{
$qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'");
if(mysql_num_rows($qry2) > 1){
while($row = mysql_fetch_array($qry2)){
$city_arry[] = $row;
}
$total = sizeof($city_arry) - 1;
for($i=1; $i<=$total; $i++){
mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'");
}
}
//exit;
}
I had to do this with text fields and came across the limit of 100 bytes on the index.
I solved this by adding a column, doing a md5 hash of the fields, and the doing the alter.
ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ;
UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`))
ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);

update column with multiple values parsed from current column data

Hoping someone can help me with a mysql query
Here’s what I have:
I table with a column “networkname” that contains data like this:
“VLAN-338-Network1-A,VLAN-364-Network2-A,VLAN-988-Network3-A,VLAN-1051-Network4-A”
I need a MySQL query that will update that column with only the vlan numbers in ascending order, stripping out everything else. ie.
“338, 364, 988, 1051”
Thanks,
David
In this script, I create a procedure to loop through the networkname values and parse out the numbers to a separate table, and then update YourTable using a group_concat function. This assumes your networkname values follow the 'VLAN-XXX' pattern in your example where 'XXX' is the 3-4 digit number you want to extract. This also assumes each record has a unique ID.
CREATE PROCEDURE networkname_parser()
BEGIN
-- load test data
drop table if exists YourTable;
create table YourTable
(
ID int not null auto_increment,
networkname nvarchar(100),
primary key (ID)
);
insert into YourTable(networkname) values
('VLAN-338-Network1-A,VLAN-364-Network2-A,VLAN-988-Network3-A,VLAN-1051-Network4-A'),
('VLAN-231-Network1-A,VLAN-4567-Network2-A'),
('VLAN-9876-Network1-A,VLAN-321-Network2-A,VLAN-1678-Network3-A');
-- add commas to the end of networkname for parsing
update YourTable set networkname = concat(networkname,',');
-- parse networkname into related table
drop table if exists ParseYourString;
create table ParseYourString(ID int,NetworkNumbers int);
while (select count(*) from YourTable where networkname like 'VLAN-%') > 0
do
insert into ParseYourString
select ID,replace(substr(networkname,6,4),'-','')
from YourTable
where networkname like 'VLAN-%';
update YourTable
set networkname = right(networkname,char_length(networkname)-instr(networkname,','))
where networkname like 'VLAN-%';
end while;
-- update YourTable.networkname with NetworkNumbers
update YourTable t
inner join (select ID,group_concat(networknumbers order by networknumbers asc) as networknumbers
from ParseYourString
group by ID) n
on n.ID = t.ID
set t.networkname = n.networknumbers;
END//
Call to procedure and select the results:
call networkname_parser();
select * from YourTable;
SQL Fiddle: http://www.sqlfiddle.com/#!2/01c77/1

How to copy a row and insert in same table with a autoincrement field in MySQL?

In MySQL I am trying to copy a row with an autoincrement column ID=1 and insert the data into same table as a new row with column ID=2.
How can I do this in a single query?
Use INSERT ... SELECT:
insert into your_table (c1, c2, ...)
select c1, c2, ...
from your_table
where id = 1
where c1, c2, ... are all the columns except id. If you want to explicitly insert with an id of 2 then include that in your INSERT column list and your SELECT:
insert into your_table (id, c1, c2, ...)
select 2, c1, c2, ...
from your_table
where id = 1
You'll have to take care of a possible duplicate id of 2 in the second case of course.
IMO, the best seems to use sql statements only to copy that row, while at the same time only referencing the columns you must and want to change.
CREATE TEMPORARY TABLE temp_table ENGINE=MEMORY
SELECT * FROM your_table WHERE id=1;
UPDATE temp_table SET id=0; /* Update other values at will. */
INSERT INTO your_table SELECT * FROM temp_table;
DROP TABLE temp_table;
See also av8n.com - How to Clone an SQL Record
Benefits:
The SQL statements 2 mention only the fields that need to be changed during the cloning process. They do not know about – or care about – other fields. The other fields just go along for the ride, unchanged. This makes the SQL statements easier to write, easier to read, easier to maintain, and more extensible.
Only ordinary MySQL statements are used. No other tools or programming languages are required.
A fully-correct record is inserted in your_table in one atomic operation.
Say the table is user(id, user_name, user_email).
You can use this query:
INSERT INTO user (SELECT NULL,user_name, user_email FROM user WHERE id = 1)
This helped and it supports a BLOB/TEXT columns.
CREATE TEMPORARY TABLE temp_table
AS
SELECT * FROM source_table WHERE id=2;
UPDATE temp_table SET id=NULL WHERE id=2;
INSERT INTO source_table SELECT * FROM temp_table;
DROP TEMPORARY TABLE temp_table;
USE source_table;
For a quick, clean solution that doesn't require you to name columns, you can use a prepared statement as described here:
https://stackoverflow.com/a/23964285/292677
If you need a complex solution so you can do this often, you can use this procedure:
DELIMITER $$
CREATE PROCEDURE `duplicateRows`(_schemaName text, _tableName text, _whereClause text, _omitColumns text)
SQL SECURITY INVOKER
BEGIN
SELECT IF(TRIM(_omitColumns) <> '', CONCAT('id', ',', TRIM(_omitColumns)), 'id') INTO #omitColumns;
SELECT GROUP_CONCAT(COLUMN_NAME) FROM information_schema.columns
WHERE table_schema = _schemaName AND table_name = _tableName AND FIND_IN_SET(COLUMN_NAME,#omitColumns) = 0 ORDER BY ORDINAL_POSITION INTO #columns;
SET #sql = CONCAT('INSERT INTO ', _tableName, '(', #columns, ')',
'SELECT ', #columns,
' FROM ', _schemaName, '.', _tableName, ' ', _whereClause);
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
END
You can run it with:
CALL duplicateRows('database', 'table', 'WHERE condition = optional', 'omit_columns_optional');
Examples
duplicateRows('acl', 'users', 'WHERE id = 200'); -- will duplicate the row for the user with id 200
duplicateRows('acl', 'users', 'WHERE id = 200', 'created_ts'); -- same as above but will not copy the created_ts column value
duplicateRows('acl', 'users', 'WHERE id = 200', 'created_ts,updated_ts'); -- same as above but also omits the updated_ts column
duplicateRows('acl', 'users'); -- will duplicate all records in the table
DISCLAIMER: This solution is only for someone who will be repeatedly duplicating rows in many tables, often. It could be dangerous in the hands of a rogue user.
If you're able to use MySQL Workbench, you can do this by right-clicking the row and selecting 'Copy row', and then right-clicking the empty row and selecting 'Paste row', and then changing the ID, and then clicking 'Apply'.
Copy the row:
Paste the copied row into the blank row:
Change the ID:
Apply:
insert into MyTable(field1, field2, id_backup)
select field1, field2, uniqueId from MyTable where uniqueId = #Id;
A lot of great answers here. Below is a sample of the stored procedure that I wrote to accomplish this task for a Web App that I am developing:
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON
-- Create Temporary Table
SELECT * INTO #tempTable FROM <YourTable> WHERE Id = Id
--To trigger the auto increment
UPDATE #tempTable SET Id = NULL
--Update new data row in #tempTable here!
--Insert duplicate row with modified data back into your table
INSERT INTO <YourTable> SELECT * FROM #tempTable
-- Drop Temporary Table
DROP TABLE #tempTable
You can also pass in '0' as the value for the column to auto-increment, the correct value will be used when the record is created. This is so much easier than temporary tables.
Source:
Copying rows in MySQL
(see the second comment, by TRiG, to the first solution, by Lore)
I tend to use a variation of what mu is too short posted:
INSERT INTO something_log
SELECT NULL, s.*
FROM something AS s
WHERE s.id = 1;
As long as the tables have identical fields (excepting the auto increment on the log table), then this works nicely.
Since I use stored procedures whenever possible (to make life easier on other programmers who aren't too familiar with databases), this solves the problem of having to go back and update procedures every time you add a new field to a table.
It also ensures that if you add new fields to a table they will start appearing in the log table immediately without having to update your database queries (unless of course you have some that set a field explicitly)
Warning: You will want to make sure to add any new fields to both tables at the same time so that the field order stays the same... otherwise you will start getting odd bugs. If you are the only one that writes database interfaces AND you are very careful then this works nicely. Otherwise, stick to naming all of your fields.
Note: On second thought, unless you are working on a solo project that you are sure won't have others working on it stick to listing all field names explicitly and update your log statements as your schema changes. This shortcut probably is not worth the long term headache it can cause... especially on a production system.
INSERT INTO `dbMyDataBase`.`tblMyTable`
(
`IdAutoincrement`,
`Column2`,
`Column3`,
`Column4`
)
SELECT
NULL,
`Column2`,
`Column3`,
'CustomValue' AS Column4
FROM `dbMyDataBase`.`tblMyTable`
WHERE `tblMyTable`.`Column2` = 'UniqueValueOfTheKey'
;
/* mySQL 5.6 */
Try this:
INSERT INTO test_table (SELECT null,txt FROM test_table)
Every time you run this query, This will insert all the rows again with new ids. values in your table and will increase exponentially.
I used a table with two columns i.e id and txt and id is auto increment.
I was looking for the same feature but I don't use MySQL. I wanted to copy ALL the fields except of course the primary key (id). This was a one shot query, not to be used in any script or code.
I found my way around with PL/SQL but I'm sure any other SQL IDE would do. I did a basic
SELECT *
FROM mytable
WHERE id=42;
Then export it to a SQL file where I could find the
INSERT INTO table (col1, col2, col3, ... , col42)
VALUES (1, 2, 3, ..., 42);
I just edited it and used it :
INSERT INTO table (col1, col2, col3, ... , col42)
VALUES (mysequence.nextval, 2, 3, ..., 42);
insert into your_table(col1,col2,col3) select col1+1,col2,col3 from your_table where col1=1;
Note:make sure that after increment the new value of col1 is not duplicate entry if col1 is primary key.
CREATE TEMPORARY TABLE IF NOT EXISTS `temp_table` LIKE source_table;
DELETE FROM `purchasing2` ;
INSERT INTO temp_table SELECT * FROM source_table where columnid = 2;
ALTER TABLE temp_table MODIFY id INT NOT NULL;
ALTER TABLE temp_table DROP PRIMARY KEY;
UPDATE temp_table SET id=NULL ;
INSERT INTO source_table SELECT * FROM temp_table;
DROP TEMPORARY TABLE IF EXISTS temp_table ;
Dump the row you want to sql and then use the generated SQL, less the ID column to import it back in.