Quickly copying all values from one table into another – MySQL - mysql

Suppose I have two tables with BTREE indices that have the same number of rows but don't necessarily share data, and I want to copy the values in one column from one table into the other table. I've seen other questions' answers regarding SELECTing that involve generating sequential IDs for both tables, then performing some JOIN operation based on matching these IDs together. For example (from the first answer):
select result1.title1, title1.age1,result2.title2, title2.age2 from
(select #i:=#i+1 AS rowId, title1, age1 from tab1,(SELECT #i:=0) a) as result1 ,
(select #j:=#j+1 AS rowId,title2, age2 from tab2,(SELECT #j:=0) a ) as result2
where
result1.rowId = result2.rowId; #sic.
However, I have two concerns:
I'm not sure how to update this way, since I don't know if I can make a dummy column for the destination table on the fly (i.e. something equivalent to UPDATE (SELECT (#x:=#x+1), title1 FROM title1,(SELECT #x:=0) a ) INNER JOIN...).
I am suspicious that this is in O(n^2) time with respect to the number of rows. This would be the case if the dummy IDs were generated first, then each run of the JOIN required a linear search of one or both tables. Is this the case? If so, is there a faster way to do this?
For example, consider the following two tables:
CREATE TABLE t1 (
id INT NOT NULL PRIMARY KEY,
v INT
);
INSERT INTO t1 VALUES
(1, 3),
(3, 4),
(25, 7);
CREATE TABLE t2 LIKE t1;
INSERT INTO t2 VALUES
(6, 150),
(9, 143),
(14, 175);
Suppose I wanted to replace the v values in t2 with those in t1 in one query so that t1 becomes:
mysql> SELECT * FROM t1;
+----+------+
| id | v |
+----+------+
| 1 | 150 |
| 3 | 143 |
| 25 | 175 |
+----+------+
3 rows in set (0.00 sec)
How would I do so?

After lots of testing I think I've found a pretty viable solution. If you have sequential IDs on the source table, you can copy the values relatively quickly to the destination table (especially if the source table is ENGINE=MEMORY) using the following [admittedly somewhat unusual] query:
UPDATE t1 CROSS JOIN (SELECT #x:=0) a
SET t1.v=(#x:=#x+1)+(SELECT v FROM t2 WHERE id=#x)-#x;
The overhead from the SELECT v... statement is quite significant however, and it slows the query significantly.

Related

Update multiple rows in table with values from a temporary table

I'm trying to write a database migration script to add a column to a table that contains existing data, and then populate that column with appropriate data.
I'm doing the migration in a few steps. I've created a temporary table that contains a single column with ids like this:
new_column
==========
1000
1001
1002
1003
...
I now want to update my existing table so that each row in the temporary table above is used to update each row in my existing table. The existing table looks like this:
old_column_1 | old_column_2 | new_column
========================================
1 | 100 | null
2 | 101 | null
3 | 102 | null
...
I've tried a few variations of this sort of update -
select min(t.new_column)
from temp t
where t.new_column not in (select new_column from existing_table);
But I can't seem to get the syntax right...
Your problem is more complicated than you think. There's nothing reliable to join on. So, either you write a stored procedure which uses a cursor to loop through both tables and updating the existing table row by row (which can quickly become a performance nightmare, therefore I wouldn't recommend it) or you use this a little complicated query:
CREATE TABLE temp
(id int auto_increment primary key, `new_column` int)
;
INSERT INTO temp
(`new_column`)
VALUES
(1000),
(1001),
(1002),
(1003)
;
CREATE TABLE existing
(`old_column_1` int, `old_column_2` int, `new_column` varchar(4))
;
INSERT INTO existing
(`old_column_1`, `old_column_2`, `new_column`)
VALUES
(1, 100, NULL),
(2, 101, NULL),
(3, 102, NULL)
;
update
existing e
inner join (
select * from (
select
t.*
from temp t
)t
inner join
(
select
e.old_column_1, e.old_column_2,
#rownum := #rownum + 1 as rn
from existing e
, (select #rownum:=0) vars
)e on t.id = e.rn
) sq on sq.old_column_1 = e.old_column_1 and sq.old_column_2 = e.old_column_2
set e.new_column = sq.new_column;
see it working live in an sqlfiddle
I added an auto_increment column in your temporary table. Either you do it this way, or you simulate a rownumber like I did here:
select
e.old_column_1, e.old_column_2,
#rownum := #rownum + 1 as rn
from existing e
, (select #rownum:=0) vars
If you want to influence which row gets which row number, you can use ORDER BY whatever_column ASC|DESC in there.
So, what the query basically does, is, to create a row number in your existing table and join it via this column and the auto_increment column in the temporary table. Then I join this subquery again to the existing table, so that we can easily copy the column from temporary table to existing table.

Return default value for non-existing rows

To the very basic query
SELECT id, column1, column2
FROM table1
WHERE id IN ("id1", "id2", "id3")
in which the the arguments in the where statement are passed as a variable, I need to return values also for rows where the id doesn't exist. In general, this is a very similar problem as outlined here: SQL: How to return an non-existing row? However, multiple parameters are in the WHERE statement
The result right now when id2 doesn't exist:
-------------------------------
| id | column1 | column2 |
-------------------------------
| id1 | some text | some text |
| id3 | some text | some text |
-------------------------------
Desired outcome when id2 doesn't exist
-----------------------------------
| id | column1 | column2 |
-----------------------------------
| id1 | some text | some text |
| id2 | placeholder | placeholder |
| id3 | some text | some text |
-----------------------------------
My first thought was to create a temporary table and join it against the query. Unfortunately, I don't have the rights to create any kind of temporary table so that I am limited to a SELECT statement.
Is there way to do that in with a SQL SELECT query?
Edit:
Indeed, the above mentioned is a hypothetical situation. In the WHERE clause can be hundreds of ids where the amount of missing in unknown.
You can do a derived table to create something like a temp table, but it can only be used for this one query:
SELECT t.id, COALESCE(t.column1, _dflt.column1) AS column1
FROM (
SELECT 'id1' AS id, 'placeholder text 1' as column1
UNION ALL
SELECT 'id2', 'placeholder text 3'
UNION ALL
SELECT 'id3', 'placeholder text 3'
) AS _dflt
LEFT OUTER JOIN table1 t USING (id);
Re comments:
I just tested the method above on MySQL 5.6.15 to see how many distinct SELECTs I can get with a series of UNION ALLs, one row per SELECT.
I got the derived table to return 5332 rows, but I think I could go higher if I had more RAM.
If I try one more UNION ALL, I get: ERROR 1064 (42000): memory exhausted near '' at line 10665. I only have 2.0GB of RAM configured on this VM.
It doesn't matter how many ids are unknown for this solution to work. Just put them all in the derived table. By using LEFT OUTER JOIN, it automatically finds those that exist in your table1, and for the ones that are missing, the entry from the derived table will be matched up with NULLs.
The COALESCE() function returns its first non-null argument, so it'll use columns from the matched rows if those are present. Where none is found, it'll default to the columns in the derived table.
Create a stored procedure that would take as input id1, id2 and so on...
DELIMITER //
CREATE PROCEDURE P1(IN p_in varchar(5))
BEGIN
DECLARE count integer;
SELECT count(id) INTO count FROM TABLE1
WHERE id = p_in;
IF count = 1 THEN
SELECT * from table1 where id = p_in;
ELSE
select p_id, 'some text', 'some text';
END IF;
END//
DELIMITER ;
The call the procedure to get desired output..
CALL P1('id1');
CALL P2('id2');
.. and so on from your program..
Project a derived table containing all the candidate ids you want, then left join to it:
select ids.id, coalesce(table1.column1,'placeholder')
From
(Select 'id1' as id
Union
Select 'id2'
Union
Select 'id3') ids
left join table1
on ids.id1 = table1.id1
and table1.id in (...);
If you are producing the list of candidate ids from an external source (e.g. an application), you could insert the ids into a temporary table and then join to it (MySql doesn't support table variables yet).

Mysql: Create inline table within select statement?

Is there a way in MySql to create an inline table to use for join?
Something like:
SELECT LONG [1,2,3] as ID, VARCHAR(1) ['a','b','c'] as CONTENT
that would output
| ID | CONTENT |
| LONG | VARCHAR(1)|
+------+-----------+
| 1 | 'a' |
| 2 | 'b' |
| 3 | 'c' |
and that I could use in a join like this:
SELECT
MyTable.*,
MyInlineTable.CONTENT
FROM
MyTable
JOIN
(SELECT LONG [1,2,3] as ID, VARCHAR(1) ['a','b','c'] as CONTENT MyInlineTable)
ON MyTable.ID = MyInlineTable.ID
I realize that I can do
SELECT 1,'a' UNION SELECT 2,'b' UNION SELECT 3,'c'
But that seems pretty evil
I don't want to do a stored procedure because potentially a,b,c can change at every query and the size of the data as well. Also a stored procedure needs to be saved in the database, and I don't want to have to modify the database just for that.
View is the same thing.
What I am really looking for is something that does SELECT 1,'a' UNION SELECT 2,'b' UNION SELECT 3,'c' with a nicer syntax.
The only ways i can remember now is using UNION or creating a TEMPORARY TABLE and inserting those values into it. Does it suit you?
TEMPORARY_TABLE (tested and it works):
Creation:
CREATE TEMPORARY TABLE MyInlineTable (id LONG, content VARCHAR(1) );
INSERT INTO MyInlineTable VALUES
(1, 'a'),
(2, 'b'),
(3, 'c');
Usage:
SELECT
MyTable.*,
MyInlineTable.CONTENT
FROM
MyTable
JOIN
SELECT * FROM MyInlineTable;
ON MyTable.ID = MyInlineTable.ID
TEMPORARY_TABLES lifetime (reference):
Temporary tables are automatically dropped when they go out of scope, unless they have already been explicitly dropped using DROP TABLE:
.
All other local temporary tables are dropped automatically at the end of the current session.
.
Global temporary tables are automatically dropped when the session that created the table ends and all other tasks have stopped referencing them. The association between a task and a table is maintained only for the life of a single Transact-SQL statement. This means that a global temporary table is dropped at the completion of the last Transact-SQL statement that was actively referencing the table when the creating session ended.`
What I am really looking for is something that does SELECT 1,'a' UNION SELECT 2,'b' UNION SELECT 3,'c' with a nicer syntax.
Yes, it is possible with ROW CONSTRUCTOR introduced in MySQL 8.0.19:
VALUES ROW (1,'a'), ROW(2,'b'), ROW(3,'c')
and with JOIN:
SELECT *
FROM tab
JOIN (VALUES ROW (1,'a'), ROW(2,'b'), ROW(3,'c') ) sub(id, content)
ON tab.id = sub.id;
db<>fiddle demo
Yes. Do with stored procedures or views.

A multitude of the same id in an WHERE id IN () statement

I have a simple query that increases the value of a field by 1.
Now I used to loop over all id's and fire a query for each of them, but now that things are getting a bit resource heavy I wanted to optimize this. Normally I would just do
UPDATE table SET field = field + 1 WHERE id IN (all the ids here)
but now I have the problem that there are id's that occur twice (or more, I can't know that on forehand).
Is there a way to have the query run twice for id 4 if the query looks like this:
UPDATE table SET field = field + 1 WHERE id IN (1, 2, 3, 4, 4, 5)
Thanks,
lordstyx
Edit: sorry for not being clear enough.
The id here is an auto inc field, so it are all unique ID's. the id's that have to be updated are indirectly comming from users, so I can't predict which id is going to occur how often.
If there are the ID's (1, 2, 3, 4, 4, 5) I need the field of row with id 4 to be incremented with 2, and all the rest with 1.
If (1, 2, 3, 4, 4, 5) comes from a SELECT id ... query, then you can do something like this:
UPDATE yourTable
JOIN
( SELECT id
, COUNT(id) AS counter
....
GROUP BY id
) AS data
ON yourTable.id = data.id
SET yourTable.field = yourTable.field + data.counter
;
Since the input comes from users, perhaps you can manipulate it a bit. Change (1, 2, 3, 4, 4, 5) to (1), (2), (3), (4), (4), (5).
Then (having created a temporary table):
CREATE TABLE tempUpdate
( id INT )
;
Do the following procedure:
add the values in the temporary table,
run the update and
delete the values.
Code:
INSERT INTO TempUpdate
VALUES (1), (2), (3), (4), (4), (5)
;
UPDATE yourTable
JOIN
( SELECT id
, COUNT(id) AS counter
FROM TempUpdate
GROUP BY id
) AS data
ON yourTable.id = data.id
SET yourTable.field = yourTable.field + data.counter
;
DELETE FROM TempUpdate
;
No. But you could perform something like
UPDATE table
SET field = field + (LENGTH(',1,2,3,4,4,5,') - LENGTH(REPLACE(',1,2,3,4,4,5,', CONCAT(',', id, ','), ''))) / LENGTH(CONCAT(',', id, ','))
WHERE id IN (1, 2, 3, 4, 4, 5)
if you need row with id = 4 specifically to be incremented twice
Here is solution you wanted, but I'm not sure this is what you need.
Let's say that your talbe is called test. You want to increase id. I've added a field idwas to easily show what was the id before the query:
CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL auto_increment,
`idwas` int(8) unsigned default NULL,
PRIMARY KEY (`id`)
) ;
Let's fill it with data:
truncate table test;
insert into test(id) VALUES(1),(3),(15);
update test set idwas = id;
Now let's say that you have user input 1,3,5,3, so:
id 1 should be increased by 1
id 3 should be increased by 2
id 5 is missing, nothing to increase.
row with id 15 should not be changed because not in user input
We'll put the user input in a variable to be easier to use it:
SET #userInput = '1,3,5,3';
then do the magic:
SET #helperTable = CONCAT(
'SELECT us.id, count(us.id) as i FROM ',
'(SELECT ',REPLACE(#userInput, ',',' AS `id` UNION ALL SELECT '),
') AS us GROUP BY us.id');
SET #stmtText = CONCAT(
' UPDATE ',
'(',#helperTable,') AS h INNER JOIN test as t ON t.id = h.id',
' SET t.id = t.id + h.i');
PREPARE stmt FROM #stmtText;
EXECUTE stmt;
And this is the result:
mysql> SELECT * FROM test;
+----+-------+
| id | idwas |
+----+-------+
| 2 | 1 |
| 5 | 3 |
| 15 | 15 |
+----+-------+
3 rows in set (0.00 sec)
If it's reasonable, you could try doing a combination of what you had before and what you have now.
In whatever is creating this list, separate it into (depending on the language's constructs) some type of array. Follow this by sorting it,finding how many multiples of each there are, and doing whatever else you need to to get the following: an array with (increment-number => list of ids), so you do one query for each increment amount. Thus, your example becomes
UPDATE table SET field = field + 1 WHERE id IN (1, 2, 3, 5)
UPDATE table SET field = field + 2 WHERE id IN (4)
In php, for example, I would take the array, sort the array, use the content of the array as the keys for another array of the form (id => count), and then fold that over into the (count => list of ids) array.
It's not that efficient, but is definitely better than one query per id. It's also probably better than using iteration and string manipulation in SQL. Unless you're forced to use SQL to do everything (which it sounds like you're not), I wouldn't use it to do everything, when it's overly awkward to do so.
You could use the following:
create temporary table temp1 (id integer);
insert into temp1 (id) values (1),(2),(3),(4),(4),(5);
update your_table set your_field = your_field + (select count(*) from temp1 where id = your_table.id)
This solution requires you to format the id list like (1),(2),(3),(4),(4),(5) but I don't think that is a problem, right?
This worked on my test database, hope it works for you too!
Regards,
Arthur

How to delete duplicates on a MySQL table?

I need to DELETE duplicated rows for specified sid on a MySQL table.
How can I do this with an SQL query?
DELETE (DUPLICATED TITLES) FROM table WHERE SID = "1"
Something like this, but I don't know how to do it.
This removes duplicates in place, without making a new table.
ALTER IGNORE TABLE `table_name` ADD UNIQUE (title, SID)
Note: This only works well if index fits in memory.
Suppose you have a table employee, with the following columns:
employee (first_name, last_name, start_date)
In order to delete the rows with a duplicate first_name column:
delete
from employee using employee,
employee e1
where employee.id > e1.id
and employee.first_name = e1.first_name
Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:
Create the table and insert some rows:
create table penguins(foo int, bar varchar(15), baz datetime);
insert into penguins values(1, 'skipper', now());
insert into penguins values(1, 'skipper', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(4, 'rico', now());
select * from penguins;
+------+----------+---------------------+
| foo | bar | baz |
+------+----------+---------------------+
| 1 | skipper | 2014-08-25 14:21:54 |
| 1 | skipper | 2014-08-25 14:21:59 |
| 3 | kowalski | 2014-08-25 14:22:09 |
| 3 | kowalski | 2014-08-25 14:22:13 |
| 3 | kowalski | 2014-08-25 14:22:15 |
| 4 | rico | 2014-08-25 14:22:22 |
+------+----------+---------------------+
6 rows in set (0.00 sec)
Remove the duplicates in place:
delete a
from penguins a
left join(
select max(baz) maxtimestamp, foo, bar
from penguins
group by foo, bar) b
on a.baz = maxtimestamp and
a.foo = b.foo and
a.bar = b.bar
where b.maxtimestamp IS NULL;
Query OK, 3 rows affected (0.01 sec)
select * from penguins;
+------+----------+---------------------+
| foo | bar | baz |
+------+----------+---------------------+
| 1 | skipper | 2014-08-25 14:21:59 |
| 3 | kowalski | 2014-08-25 14:22:15 |
| 4 | rico | 2014-08-25 14:22:22 |
+------+----------+---------------------+
3 rows in set (0.00 sec)
You're done, duplicate rows are removed, last one by timestamp is kept.
For those of you without a timestamp or unique column.
You don't have a timestamp or a unique index column to sort by? You're living in a state of degeneracy. You'll have to do additional steps to delete duplicate rows.
create the penguins table and add some rows
create table penguins(foo int, bar varchar(15));
insert into penguins values(1, 'skipper');
insert into penguins values(1, 'skipper');
insert into penguins values(3, 'kowalski');
insert into penguins values(3, 'kowalski');
insert into penguins values(3, 'kowalski');
insert into penguins values(4, 'rico');
select * from penguins;
# +------+----------+
# | foo | bar |
# +------+----------+
# | 1 | skipper |
# | 1 | skipper |
# | 3 | kowalski |
# | 3 | kowalski |
# | 3 | kowalski |
# | 4 | rico |
# +------+----------+
make a clone of the first table and copy into it.
drop table if exists penguins_copy;
create table penguins_copy as ( SELECT foo, bar FROM penguins );
#add an autoincrementing primary key:
ALTER TABLE penguins_copy ADD moo int AUTO_INCREMENT PRIMARY KEY first;
select * from penguins_copy;
# +-----+------+----------+
# | moo | foo | bar |
# +-----+------+----------+
# | 1 | 1 | skipper |
# | 2 | 1 | skipper |
# | 3 | 3 | kowalski |
# | 4 | 3 | kowalski |
# | 5 | 3 | kowalski |
# | 6 | 4 | rico |
# +-----+------+----------+
The max aggregate operates upon the new moo index:
delete a from penguins_copy a left join(
select max(moo) myindex, foo, bar
from penguins_copy
group by foo, bar) b
on a.moo = b.myindex and
a.foo = b.foo and
a.bar = b.bar
where b.myindex IS NULL;
#drop the extra column on the copied table
alter table penguins_copy drop moo;
select * from penguins_copy;
#drop the first table and put the copy table back:
drop table penguins;
create table penguins select * from penguins_copy;
observe and cleanup
drop table penguins_copy;
select * from penguins;
+------+----------+
| foo | bar |
+------+----------+
| 1 | skipper |
| 3 | kowalski |
| 4 | rico |
+------+----------+
Elapsed: 1458.359 milliseconds
What's that big SQL delete statement doing?
Table penguins with alias 'a' is left joined on a subset of table penguins called alias 'b'. The right hand table 'b' which is a subset finds the max timestamp [ or max moo ] grouped by columns foo and bar. This is matched to left hand table 'a'. (foo,bar,baz) on left has every row in the table. The right hand subset 'b' has a (maxtimestamp,foo,bar) which is matched to left only on the one that IS the max.
Every row that is not that max has value maxtimestamp of NULL. Filter down on those NULL rows and you have a set of all rows grouped by foo and bar that isn't the latest timestamp baz. Delete those ones.
Make a backup of the table before you run this.
Prevent this problem from ever happening again on this table:
If you got this to work, and it put out your "duplicate row" fire. Great. Now define a new composite unique key on your table (on those two columns) to prevent more duplicates from being added in the first place.
Like a good immune system, the bad rows shouldn't even be allowed in to the table at the time of insert. Later on all those programs adding duplicates will broadcast their protest, and when you fix them, this issue never comes up again.
Following remove duplicates for all SID-s, not only single one.
With temp table
CREATE TABLE table_temp AS
SELECT * FROM table GROUP BY title, SID;
DROP TABLE table;
RENAME TABLE table_temp TO table;
Since temp_table is freshly created it has no indexes. You'll need to recreate them after removing duplicates. You can check what indexes you have in the table with SHOW INDEXES IN table
Without temp table:
DELETE FROM `table` WHERE id IN (
SELECT all_duplicates.id FROM (
SELECT id FROM `table` WHERE (`title`, `SID`) IN (
SELECT `title`, `SID` FROM `table` GROUP BY `title`, `SID` having count(*) > 1
)
) AS all_duplicates
LEFT JOIN (
SELECT id FROM `table` GROUP BY `title`, `SID` having count(*) > 1
) AS grouped_duplicates
ON all_duplicates.id = grouped_duplicates.id
WHERE grouped_duplicates.id IS NULL
)
After running into this issue myself, on a huge database, I wasn't completely impressed with the performance of any of the other answers. I want to keep only the latest duplicate row, and delete the rest.
In a one-query statement, without a temp table, this worked best for me,
DELETE e.*
FROM employee e
WHERE id IN
(SELECT id
FROM (SELECT MIN(id) as id
FROM employee e2
GROUP BY first_name, last_name
HAVING COUNT(*) > 1) x);
The only caveat is that I have to run the query multiple times, but even with that, I found it worked better for me than the other options.
This always seems to work for me:
CREATE TABLE NoDupeTable LIKE DupeTable;
INSERT NoDupeTable SELECT * FROM DupeTable group by CommonField1,CommonFieldN;
Which keeps the lowest ID on each of the dupes and the rest of the non-dupe records.
I've also taken to doing the following so that the dupe issue no longer occurs after the removal:
CREATE TABLE NoDupeTable LIKE DupeTable;
Alter table NoDupeTable Add Unique `Unique` (CommonField1,CommonField2);
INSERT IGNORE NoDupeTable SELECT * FROM DupeTable;
In other words, I create a duplicate of the first table, add a unique index on the fields I don't want duplicates of, and then do an Insert IGNORE which has the advantage of not failing as a normal Insert would the first time it tried to add a duplicate record based on the two fields and rather ignores any such records.
Moving fwd it becomes impossible to create any duplicate records based on those two fields.
The following works for all tables
CREATE TABLE `noDup` LIKE `Dup` ;
INSERT `noDup` SELECT DISTINCT * FROM `Dup` ;
DROP TABLE `Dup` ;
ALTER TABLE `noDup` RENAME `Dup` ;
Here is a simple answer:
delete a from target_table a left JOIN (select max(id_field) as id, field_being_repeated
from target_table GROUP BY field_being_repeated) b
on a.field_being_repeated = b.field_being_repeated
and a.id_field = b.id_field
where b.id_field is null;
This work for me to remove old records:
delete from table where id in
(select min(e.id)
from (select * from table) e
group by column1, column2
having count(*) > 1
);
You can replace min(e.id) to max(e.id) to remove newest records.
delete p from
product p
inner join (
select max(id) as id, url from product
group by url
having count(*) > 1
) unik on unik.url = p.url and unik.id != p.id;
I find Werner's solution above to be the most convenient because it works regardless of the presence of a primary key, doesn't mess with tables, uses future-proof plain sql, is very understandable.
As I stated in my comment, that solution hasn't been properly explained though.
So this is mine, based on it.
1) add a new boolean column
alter table mytable add tokeep boolean;
2) add a constraint on the duplicated columns AND the new column
alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
3) set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint
update ignore mytable set tokeep = true;
4) delete rows that have not been marked as tokeep
delete from mytable where tokeep is null;
5) drop the added column
alter table mytable drop tokeep;
I suggest that you keep the constraint you added, so that new duplicates are prevented in the future.
This procedure will remove all duplicates (incl multiples) in a table, keeping the last duplicate. This is an extension of Retrieving last record in each group
Hope this is useful to someone.
DROP TABLE IF EXISTS UniqueIDs;
CREATE Temporary table UniqueIDs (id Int(11));
INSERT INTO UniqueIDs
(SELECT T1.ID FROM Table T1 LEFT JOIN Table T2 ON
(T1.Field1 = T2.Field1 AND T1.Field2 = T2.Field2 #Comparison Fields
AND T1.ID < T2.ID)
WHERE T2.ID IS NULL);
DELETE FROM Table WHERE id NOT IN (SELECT ID FROM UniqueIDs);
Another easy way... using UPDATE IGNORE:
U have to use an index on one or more columns (type index).
Create a new temporary reference column (not part of the index). In this column, you mark the uniques in by updating it with ignore clause. Step by step:
Add a temporary reference column to mark the uniques:
ALTER TABLE `yourtable` ADD `unique` VARCHAR(3) NOT NULL AFTER `lastcolname`;
=> this will add a column to your table.
Update the table, try to mark everything as unique, but ignore possible errors due to to duplicate key issue (records will be skipped):
UPDATE IGNORE `yourtable` SET `unique` = 'Yes' WHERE 1;
=> you will find your duplicate records will not be marked as unique = 'Yes', in other words only one of each set of duplicate records will be marked as unique.
Delete everything that's not unique:
DELETE * FROM `yourtable` WHERE `unique` <> 'Yes';
=> This will remove all duplicate records.
Drop the column...
ALTER TABLE `yourtable` DROP `unique`;
If you want to keep the row with the lowest id value:
DELETE n1 FROM 'yourTableName' n1, 'yourTableName' n2 WHERE n1.id > n2.id AND n1.email = n2.email
If you want to keep the row with the highest id value:
DELETE n1 FROM 'yourTableName' n1, 'yourTableName' n2 WHERE n1.id < n2.id AND n1.email = n2.email
Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way, also valid to handle big data sources (with examples for different use cases).
Ali, in your case, you can run something like this:
-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;
-- add a unique constraint
ALTER TABLE tmp_table1 ADD UNIQUE(sid, title);
-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;
-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;
delete from `table` where `table`.`SID` in
(
select t.SID from table t join table t1 on t.title = t1.title where t.SID > t1.SID
)
Love #eric's answer but it doesn't seem to work if you have a really big table (I'm getting The SELECT would examine more than MAX_JOIN_SIZE rows; check your WHERE and use SET SQL_BIG_SELECTS=1 or SET MAX_JOIN_SIZE=# if the SELECT is okay when I try to run it). So I limited the join query to only consider the duplicate rows and I ended up with:
DELETE a FROM penguins a
LEFT JOIN (SELECT COUNT(baz) AS num, MIN(baz) AS keepBaz, foo
FROM penguins
GROUP BY deviceId HAVING num > 1) b
ON a.baz != b.keepBaz
AND a.foo = b.foo
WHERE b.foo IS NOT NULL
The WHERE clause in this case allows MySQL to ignore any row that doesn't have a duplicate and will also ignore if this is the first instance of the duplicate so only subsequent duplicates will be ignored. Change MIN(baz) to MAX(baz) to keep the last instance instead of the first.
This works for large tables:
CREATE Temporary table duplicates AS select max(id) as id, url from links group by url having count(*) > 1;
DELETE l from links l inner join duplicates ld on ld.id = l.id WHERE ld.id IS NOT NULL;
To delete oldest change max(id) to min(id)
This here will make the column column_name into a primary key, and in the meantime ignore all errors. So it will delete the rows with a duplicate value for column_name.
ALTER IGNORE TABLE `table_name` ADD PRIMARY KEY (`column_name`);
I think this will work by basically copying the table and emptying it then putting only the distinct values back into it but please double check it before doing it on large amounts of data.
Creates a carbon copy of your table
create table temp_table like oldtablename;
insert temp_table select * from oldtablename;
Empties your original table
DELETE * from oldtablename;
Copies all distinct values from the copied table back to your original table
INSERT oldtablename SELECT * from temp_table group by firstname,lastname,dob
Deletes your temp table.
Drop Table temp_table
You need to group by aLL fields that you want to keep distinct.
DELETE T2
FROM table_name T1
JOIN same_table_name T2 ON (T1.title = T2.title AND T1.ID <> T2.ID)
here is how I usually eliminate duplicates
add a temporary column, name it whatever you want(i'll refer as active)
group by the fields that you think shouldn't be duplicate and set their active to 1, grouping by will select only one of duplicate values(will not select duplicates)for that columns
delete the ones with active zero
drop column active
optionally(if fits to your purposes), add unique index for those columns to not have duplicates again
You could just use a DISTINCT clause to select the "cleaned up" list (and here is a very easy example on how to do that).
Could it work if you count them, and then add a limit to your delete query leaving just one?
For example, if you have two or more, write your query like this:
DELETE FROM table WHERE SID = 1 LIMIT 1;
There are just a few basic steps when removing duplicate data from your table:
Back up your table!
Find the duplicate rows
Remove the duplicate rows
Here is the full tutorial: https://blog.teamsql.io/deleting-duplicate-data-3541485b3473