I'm trying to add a new column to few MySQL (MariaDB) tables. I want to make the column auto-filled with sequential number. What I figured out so far is:
SELECT #count:=0;
UPDATE users SET ordering = #count:=#count+1;
It works perfectly. However, I don't know how to make it so that the order in which the numbers are assigned is based on another value, namely ascending order of another integer field called regdate. Here's an example.
Current result:
login
regdate
ordering
user
1633205589
1
guy
16332060000
3
account
16332090000
2
data
16332095000
4
What I want:
login
regdate
ordering
user
1633205589
1
guy
16332060000
2
account
16332090000
3
data
16332095000
4
I hope it's pretty clear and concise :)
You can use a joined table with ROW_NUMBER
CREATE TABLE users (
`login` VARCHAR(7),
`regdate` VARCHAR(20) ,
`ordering` INTEGER
);
INSERT INTO users
(`login`, `regdate`, `ordering`)
VALUES
('user', '1633205589', '1'),
('guy', '16332060000', '3'),
('account', '16332090000', '2'),
('data', '16332095000', '4');
UPDATE users u1
JOIN (SELECT `login`, `regdate`, row_number() over (ORDER BY regdate ASC) rn FROM users) u2
ON u1.`login` = u2.`login` AND u1.`regdate` = u2.`regdate`
SET u1.ordering = u2.rn ;
SELECT * FROM users
login | regdate | ordering
:------ | :---------- | -------:
user | 1633205589 | 1
guy | 16332060000 | 2
account | 16332090000 | 3
data | 16332095000 | 4
db<>fiddle here
UPDATE users usr1
JOIN (SELECT #a:=#a+1 rn, id FROM users, (SELECT #a:= 0) AS a) usr2
ON usr1.id = usr2.id
SET usr.serial = usr2.rn
Does users have a PRIMARY KEY? If so, the following won't work. (Please provide SHOW CREATE TABLE users;)
CREATE TABLE users2 LIKE users;
ALTER TABLE users2
ADD COLUMN ordering INT UNSIGNED NOT NULL AUTO_INCREMENT
PRIMARY KEY(ordering);
INSERT INTO users2
SELECT *, -- the existing columns
NULL -- 1,2,etc automatically filled in
FROM users;
RENAME TABLE users TO trash,
users2 TO users;
DROP TABLE trash;
What happens when a new user is added? AUTO_INCREMENT automatically gives it the next higher number. Is that OK? If you need to renumber the rows, no method is good.
As long as your update just involves a single table, you can specify an order:
UPDATE users SET ordering = #count:=#count+1 ORDER BY regdate;
I'm updating an existing table by adding data into an existing column.
I have already have an output of the data to be inserted, but due to the amount of records, i'm looking for the best way to insert this into my table without having to manually write to each line of sql.
Here's my sql (partial) i want to insert into
INSERT INTO `tbl_user_variables_dobRE` (`user_id`, `value`) VALUES
(150, '1959-11-02'),
(151, '1948-04-20'),
(152, '1961-06-18'),
And this is the table i want to insert it into
id | 7
username | guestinvite
password | BLANK
forname | forname
surname | surname
email | guestinvite#test.com
address_id | 286
type_id | 4
dob | 0000-00-00
plusGuest | 0
update | 2016-02-16 11:54:36
created | 2016-04-04 17:03:12
So i want to insert the second item into the 'dob' column where first item = id
Is there anyway to do this programmatically or do i have to write WHERE & OR statements for every line?
You tagged both MySql AND sql-server in your post. The following is assuming you're using SQL Server, but the idea would remain the same in MySQL (just different syntax)...
If I'm understanding correctly, it sounds like you want to do an UPDATE, not an INSERT, being that you're modifying existing rows.
You said that you have an output of the data to be inserted - Insert this into a TEMP table and JOIN it to the table you'd like to update where the id's match.
BEGIN TRANSACTION [Transaction1] -- Do large updates as transactions to avoid data loss
CREATE TABLE #temp ( -- Create temp table
[user_id] int,
[dob] nvarchar(20)
)
INSERT INTO #temp
-- YOUR SELECT GOES HERE
SELECT my_id as [user_id], my_dob as [dob]
UPDATE my_table
SET my_table.dob = t.dob
FROM tbl_user_variables_dobRE my_table
INNER JOIN #temp t ON t.user_id = my_table.id
DROP TABLE #temp
If your data looks good, commit the transaction: (Don't dwell too long, transactions lock table data!)
COMMIT TRANSACTION [Transaction1]
Otherwise:
ROLLBACK TRANSACTION [Transaction1]
The quickest way I can think of doing this is creating a temporary table with the new data that you want to add (you could possibly bulk import it all from say, a CSV file).
The temporary table will just need a couple of columns - one with user_id and the other one dob - you'll be getting rid of it after anyway.
You could then do something like this:
UPDATE tbl_user_variables_dobRE a
JOIN tmp_table b
ON ( a.user_id = b.user_id )
SET a.dob = b.dob
Once you've done that you can DROP your temporary table and be good to go - good luck!
Important
Be super-careful when updating data - it's so easy to mess up your data by forgetting to add a clause. If possible, do this with some test data before trying it with the real production data.
When I try to run below update query, It takes about 40 hours to complete. So I added a time limitation(Update query with time limitation). But still it takes nearly same time to complete.Is there any way to speed up this update?
EDIT: What I really want to do is only get logs between some specific dates and run this update query on this records.
create table user
(userid varchar(30));
create table logs
( log_time timestamp,
log_detail varchar(100),
userid varchar(30));
insert into user values('user1');
insert into user values('user2');
insert into user values('user3');
insert into user values('');
insert into logs values('no user mentioned','user3');
insert into logs values('inserted by user2','user2');
insert into logs values('inserted by user3',null);
Table before Update
log_time | log_detail | userid |
.. |-------------------|--------|
.. | no user mention | user3 |
.. | inserted by user2 | user2 |
.. | inserted by user3 | (null) |
Update query
update logs join user
set logs.userid=user.userid
where logs.log_detail LIKE concat("%",user.userID,"%") and user.userID != "";
Update query with time limitation
update logs join user
set logs.userid = IF (logs.log_time between '2015-08-11 00:39:41' AND '2015-08-01 17:39:44', user.userID, null)
where logs.log_detail LIKE concat("%",user.userID,"%") and user.userID != "";
Table after update
log_time | log_detail | userid |
.. |-------------------|--------|
.. | no user mentione | user3 |
.. | inserted by user2 | user2 |
.. | inserted by user3 | user3 |
EDIT: Original question Sql update statement with variable .
Log tables can easily fill up with tons of rows of data each month and even the best indexing won't help, especially in the case of a LIKE operator. Your log_detail column is 100 characters long and your search query is CONCAT("%",user.userID,"%"). Using a function in a SQL command can slow things down because the function is doing extra computations. And what you're trying to search for is, if your userID is John, %John%. So your query will scan every row in that table because indexes will be semi-useless. If you didn't have the first %, then the query would be able to utilize its indexes efficiently. Your query would, in effect, do an INDEX SCAN as opposed to an INDEX SEEK.
For more information on these concepts, see:
Index Seek VS Index Scan
Query tuning a LIKE operator
Alright, what can you do about this? Two strategies.
Option 1 is to limit the number of rows that you're searching
through. You had the right idea using time limitations to reduce the
number of rows to search through. What I would suggest is to put the
time limitations as the first expression in your WHERE clause.
Most databases execute the first expression first. So when
the second expression kicks in, it'll only scan through the rows returned by
the first expression.
update logs join user
set logs.userid=user.userid
where logs.log_time between '2015-08-01' and '2015-08-11'
and logs.log_detail LIKE concat('%',user.userID,'%')
Option 2 depends on your control of the database. If you have total
control (and you have the time and money, MySQL has a feature called
Auto-Sharding. This is available in MySQL Cluster and MySQL
Fabric. I won't go over those products in much detail as the links
provided below can explain themselves much better than I could
summarize, but the idea behind Sharding is to split the rows into
horizontal tables, so to speak. The idea behind it is that you're
not searching through a long database table, but instead across
several sister tables at the same time. Searching through 10 tables
of 10 million rows is faster than searching through 1 table of 100
million rows.
Database Sharding - Wikipedia
MySQL Cluster
MySQL Fabric
First, the right place to put the time limitation is in the where clause, not an if:
update logs l left join
user u
on l.log_detail LIKE concat("%", u.userID)
set l.userid = u.userID
where l.log_time between '2015-08-11 00:39:41' AND '2015-08-01 17:39:44';
If you want to set the others to NULL do this before:
update logs l
set l.userid = NULL
where l.log_time not between '2015-08-11 00:39:41' AND '2015-08-01 17:39:44';
But, if you really want this to be fast, you need to use an index for the join. It is possible that this will use an index on users(userid):
update logs l left join
user u
on cast(substring_index(l.log_detail, ' ', -1) as signed) = u.userID
set l.userid = u.userID
where l.log_time between '2015-08-11 00:39:41' AND '2015-08-01 17:39:44';
Look at the explain on the equivalent select. It is really important that the cast() be to the same type as the UserId.
You could add a new column called log_detail_reverse where a trigger can be set so that when you insert a new row, you also insert the log_detail column in reverse character order using the MySQL function reverse. When you're doing your update query, you also reverse the userID search. The net effect is that you then transform your INDEX SCAN to an INDEX SEEK, which will be much faster.
update logs join user
set logs.userid=user.userid
where logs.log_time between '2015-08-01' and '2015-08-11'
and logs.log_detail_reverse LIKE concat(reverse(user.userID), '%')
MySQL Trigger
The Trigger could be something like:
DELIMITER //
CREATE TRIGGER log_details_in_reverse
AFTER INSERT
ON logs FOR EACH ROW
BEGIN
DECLARE reversedLogDetail varchar(100);
DECLARE rowId int; <-- you don't have a primary key in your example, but I'm assuming you do have one. If not, you should look into adding it.
-- Reverse the column log_detail and assign it to the declared variable
SELECT reverse(log_detail) INTO reversedLogDetail;
SELECT mysql_insert_id() INTO rowId;
-- Update record into logs table
UPDATE logs
SET log_detail_reverse = reversedLogDetail
WHERE log_id = rowId;
END; //
DELIMITER ;
One thing about speeding up updates is not to update records that need no update. You only want to update records in a certain time range where the user doesn't match the user mentioned in the log text. Hence limit the records to be updated in your where clause.
update logs
set userid = substring_index(log_detail, ' ', -1)
where log_time between '2015-08-11 00:39:41' AND '2015-08-01 17:39:44'
and not userid <=> substring_index(log_detail, ' ', -1);
I'm currently working on a project with a MySQL Db of more than 8 million rows. I have been provided with a part of it to test some queries on it. It has around 20 columns out of which 5 are of use to me. Namely: First_Name, Last_Name, Address_Line1, Address_Line2, Address_Line3, RefundID
I have to create a unique but random RefundID for each row, that is not the problem. The problem is to create same RefundID for those rows whose First_Name, Last_Name, Address_Line1, Address_Line2, Address_Line3 as same.
This is my first real work related to MySQL with such large row count. So far I have created these queries:
-- Creating Teporary Table --
CREATE temporary table tempT (SELECT tt.First_Name, count(tt.Address_Line1) as
a1, count(tt.Address_Line2) as a2, count(tt.Address_Line3) as a3, tt.RefundID
FROM `tempTable` tt GROUP BY First_Name HAVING a1 >= 2 AND a2 >= 2 AND a3 >= 2);
-- Updating Rows with First_Name from tempT --
UPDATE `tempTable` SET RefundID = FLOOR(RAND()*POW(10,11))
WHERE First_Name IN (SELECT First_Name FROM tempT WHERE First_Name is not NULL);
This update query keeps on running but never ends, tempT has more than 30K rows. This query will then be run on the main DB with more than 800K rows.
Can someone help me out with this?
Regards
The solutions that seem obvious to me....
Don't use a random value - use a hash:
UPDATE yourtable
SET refundid = MD5('some static salt', First_Name
, Last_Name, Address_Line1, Address_Line2, Address_Line3)
The problem is that if you are using an integer value for the refundId then there's a good chance of getting a collision (hint CONV(SUBSTR(MD5(...),1,16),16,10) to get a SIGNED BIGINT). But you didn't say what the type of the field was, nor how strict the 'unique' requirement was. It does carry out the update in a single pass though.
An alternate approach which creates a densely packed seguence of numbers is to create a temporary table with the unique values from the original table and a random value. Order by the random value and set a monotonically increasing refundId - then use this as a look up table or update the original table:
SELECT DISTINCT First_Name
, Last_Name, Address_Line1, Address_Line2, Address_Line3
INTO temptable
FROM yourtable;
set #counter=-1;
UPDATE temptable t SET t,refundId=(#counter:=#counter + 1)
ORDER BY r.randomvalue;
There are other solutions too - but the more efficient ones rely on having multiple copies of the data and/or using a procedural language.
Try using the following:
UPDATE `tempTable` x SET RefundID = FLOOR(RAND()*POW(10,11))
WHERE exists (SELECT 1 FROM tempT y WHERE First_Name is not NULL and x.First_Name=y.First_Name);
In MySQL, it is often more efficient to use join with update than to filter through the where clause using a subquery. The following might perform better:
UPDATE `tempTable` join
(SELECT distinct First_Name
FROM tempT
WHERE First_Name is not NULL
) fn
on temptable.First_Name = fn.First_Name
SET RefundID = FLOOR(RAND()*POW(10,11));
I need to DELETE duplicated rows for specified sid on a MySQL table.
How can I do this with an SQL query?
DELETE (DUPLICATED TITLES) FROM table WHERE SID = "1"
Something like this, but I don't know how to do it.
This removes duplicates in place, without making a new table.
ALTER IGNORE TABLE `table_name` ADD UNIQUE (title, SID)
Note: This only works well if index fits in memory.
Suppose you have a table employee, with the following columns:
employee (first_name, last_name, start_date)
In order to delete the rows with a duplicate first_name column:
delete
from employee using employee,
employee e1
where employee.id > e1.id
and employee.first_name = e1.first_name
Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:
Create the table and insert some rows:
create table penguins(foo int, bar varchar(15), baz datetime);
insert into penguins values(1, 'skipper', now());
insert into penguins values(1, 'skipper', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(4, 'rico', now());
select * from penguins;
+------+----------+---------------------+
| foo | bar | baz |
+------+----------+---------------------+
| 1 | skipper | 2014-08-25 14:21:54 |
| 1 | skipper | 2014-08-25 14:21:59 |
| 3 | kowalski | 2014-08-25 14:22:09 |
| 3 | kowalski | 2014-08-25 14:22:13 |
| 3 | kowalski | 2014-08-25 14:22:15 |
| 4 | rico | 2014-08-25 14:22:22 |
+------+----------+---------------------+
6 rows in set (0.00 sec)
Remove the duplicates in place:
delete a
from penguins a
left join(
select max(baz) maxtimestamp, foo, bar
from penguins
group by foo, bar) b
on a.baz = maxtimestamp and
a.foo = b.foo and
a.bar = b.bar
where b.maxtimestamp IS NULL;
Query OK, 3 rows affected (0.01 sec)
select * from penguins;
+------+----------+---------------------+
| foo | bar | baz |
+------+----------+---------------------+
| 1 | skipper | 2014-08-25 14:21:59 |
| 3 | kowalski | 2014-08-25 14:22:15 |
| 4 | rico | 2014-08-25 14:22:22 |
+------+----------+---------------------+
3 rows in set (0.00 sec)
You're done, duplicate rows are removed, last one by timestamp is kept.
For those of you without a timestamp or unique column.
You don't have a timestamp or a unique index column to sort by? You're living in a state of degeneracy. You'll have to do additional steps to delete duplicate rows.
create the penguins table and add some rows
create table penguins(foo int, bar varchar(15));
insert into penguins values(1, 'skipper');
insert into penguins values(1, 'skipper');
insert into penguins values(3, 'kowalski');
insert into penguins values(3, 'kowalski');
insert into penguins values(3, 'kowalski');
insert into penguins values(4, 'rico');
select * from penguins;
# +------+----------+
# | foo | bar |
# +------+----------+
# | 1 | skipper |
# | 1 | skipper |
# | 3 | kowalski |
# | 3 | kowalski |
# | 3 | kowalski |
# | 4 | rico |
# +------+----------+
make a clone of the first table and copy into it.
drop table if exists penguins_copy;
create table penguins_copy as ( SELECT foo, bar FROM penguins );
#add an autoincrementing primary key:
ALTER TABLE penguins_copy ADD moo int AUTO_INCREMENT PRIMARY KEY first;
select * from penguins_copy;
# +-----+------+----------+
# | moo | foo | bar |
# +-----+------+----------+
# | 1 | 1 | skipper |
# | 2 | 1 | skipper |
# | 3 | 3 | kowalski |
# | 4 | 3 | kowalski |
# | 5 | 3 | kowalski |
# | 6 | 4 | rico |
# +-----+------+----------+
The max aggregate operates upon the new moo index:
delete a from penguins_copy a left join(
select max(moo) myindex, foo, bar
from penguins_copy
group by foo, bar) b
on a.moo = b.myindex and
a.foo = b.foo and
a.bar = b.bar
where b.myindex IS NULL;
#drop the extra column on the copied table
alter table penguins_copy drop moo;
select * from penguins_copy;
#drop the first table and put the copy table back:
drop table penguins;
create table penguins select * from penguins_copy;
observe and cleanup
drop table penguins_copy;
select * from penguins;
+------+----------+
| foo | bar |
+------+----------+
| 1 | skipper |
| 3 | kowalski |
| 4 | rico |
+------+----------+
Elapsed: 1458.359 milliseconds
What's that big SQL delete statement doing?
Table penguins with alias 'a' is left joined on a subset of table penguins called alias 'b'. The right hand table 'b' which is a subset finds the max timestamp [ or max moo ] grouped by columns foo and bar. This is matched to left hand table 'a'. (foo,bar,baz) on left has every row in the table. The right hand subset 'b' has a (maxtimestamp,foo,bar) which is matched to left only on the one that IS the max.
Every row that is not that max has value maxtimestamp of NULL. Filter down on those NULL rows and you have a set of all rows grouped by foo and bar that isn't the latest timestamp baz. Delete those ones.
Make a backup of the table before you run this.
Prevent this problem from ever happening again on this table:
If you got this to work, and it put out your "duplicate row" fire. Great. Now define a new composite unique key on your table (on those two columns) to prevent more duplicates from being added in the first place.
Like a good immune system, the bad rows shouldn't even be allowed in to the table at the time of insert. Later on all those programs adding duplicates will broadcast their protest, and when you fix them, this issue never comes up again.
Following remove duplicates for all SID-s, not only single one.
With temp table
CREATE TABLE table_temp AS
SELECT * FROM table GROUP BY title, SID;
DROP TABLE table;
RENAME TABLE table_temp TO table;
Since temp_table is freshly created it has no indexes. You'll need to recreate them after removing duplicates. You can check what indexes you have in the table with SHOW INDEXES IN table
Without temp table:
DELETE FROM `table` WHERE id IN (
SELECT all_duplicates.id FROM (
SELECT id FROM `table` WHERE (`title`, `SID`) IN (
SELECT `title`, `SID` FROM `table` GROUP BY `title`, `SID` having count(*) > 1
)
) AS all_duplicates
LEFT JOIN (
SELECT id FROM `table` GROUP BY `title`, `SID` having count(*) > 1
) AS grouped_duplicates
ON all_duplicates.id = grouped_duplicates.id
WHERE grouped_duplicates.id IS NULL
)
After running into this issue myself, on a huge database, I wasn't completely impressed with the performance of any of the other answers. I want to keep only the latest duplicate row, and delete the rest.
In a one-query statement, without a temp table, this worked best for me,
DELETE e.*
FROM employee e
WHERE id IN
(SELECT id
FROM (SELECT MIN(id) as id
FROM employee e2
GROUP BY first_name, last_name
HAVING COUNT(*) > 1) x);
The only caveat is that I have to run the query multiple times, but even with that, I found it worked better for me than the other options.
This always seems to work for me:
CREATE TABLE NoDupeTable LIKE DupeTable;
INSERT NoDupeTable SELECT * FROM DupeTable group by CommonField1,CommonFieldN;
Which keeps the lowest ID on each of the dupes and the rest of the non-dupe records.
I've also taken to doing the following so that the dupe issue no longer occurs after the removal:
CREATE TABLE NoDupeTable LIKE DupeTable;
Alter table NoDupeTable Add Unique `Unique` (CommonField1,CommonField2);
INSERT IGNORE NoDupeTable SELECT * FROM DupeTable;
In other words, I create a duplicate of the first table, add a unique index on the fields I don't want duplicates of, and then do an Insert IGNORE which has the advantage of not failing as a normal Insert would the first time it tried to add a duplicate record based on the two fields and rather ignores any such records.
Moving fwd it becomes impossible to create any duplicate records based on those two fields.
The following works for all tables
CREATE TABLE `noDup` LIKE `Dup` ;
INSERT `noDup` SELECT DISTINCT * FROM `Dup` ;
DROP TABLE `Dup` ;
ALTER TABLE `noDup` RENAME `Dup` ;
Here is a simple answer:
delete a from target_table a left JOIN (select max(id_field) as id, field_being_repeated
from target_table GROUP BY field_being_repeated) b
on a.field_being_repeated = b.field_being_repeated
and a.id_field = b.id_field
where b.id_field is null;
This work for me to remove old records:
delete from table where id in
(select min(e.id)
from (select * from table) e
group by column1, column2
having count(*) > 1
);
You can replace min(e.id) to max(e.id) to remove newest records.
delete p from
product p
inner join (
select max(id) as id, url from product
group by url
having count(*) > 1
) unik on unik.url = p.url and unik.id != p.id;
I find Werner's solution above to be the most convenient because it works regardless of the presence of a primary key, doesn't mess with tables, uses future-proof plain sql, is very understandable.
As I stated in my comment, that solution hasn't been properly explained though.
So this is mine, based on it.
1) add a new boolean column
alter table mytable add tokeep boolean;
2) add a constraint on the duplicated columns AND the new column
alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
3) set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint
update ignore mytable set tokeep = true;
4) delete rows that have not been marked as tokeep
delete from mytable where tokeep is null;
5) drop the added column
alter table mytable drop tokeep;
I suggest that you keep the constraint you added, so that new duplicates are prevented in the future.
This procedure will remove all duplicates (incl multiples) in a table, keeping the last duplicate. This is an extension of Retrieving last record in each group
Hope this is useful to someone.
DROP TABLE IF EXISTS UniqueIDs;
CREATE Temporary table UniqueIDs (id Int(11));
INSERT INTO UniqueIDs
(SELECT T1.ID FROM Table T1 LEFT JOIN Table T2 ON
(T1.Field1 = T2.Field1 AND T1.Field2 = T2.Field2 #Comparison Fields
AND T1.ID < T2.ID)
WHERE T2.ID IS NULL);
DELETE FROM Table WHERE id NOT IN (SELECT ID FROM UniqueIDs);
Another easy way... using UPDATE IGNORE:
U have to use an index on one or more columns (type index).
Create a new temporary reference column (not part of the index). In this column, you mark the uniques in by updating it with ignore clause. Step by step:
Add a temporary reference column to mark the uniques:
ALTER TABLE `yourtable` ADD `unique` VARCHAR(3) NOT NULL AFTER `lastcolname`;
=> this will add a column to your table.
Update the table, try to mark everything as unique, but ignore possible errors due to to duplicate key issue (records will be skipped):
UPDATE IGNORE `yourtable` SET `unique` = 'Yes' WHERE 1;
=> you will find your duplicate records will not be marked as unique = 'Yes', in other words only one of each set of duplicate records will be marked as unique.
Delete everything that's not unique:
DELETE * FROM `yourtable` WHERE `unique` <> 'Yes';
=> This will remove all duplicate records.
Drop the column...
ALTER TABLE `yourtable` DROP `unique`;
If you want to keep the row with the lowest id value:
DELETE n1 FROM 'yourTableName' n1, 'yourTableName' n2 WHERE n1.id > n2.id AND n1.email = n2.email
If you want to keep the row with the highest id value:
DELETE n1 FROM 'yourTableName' n1, 'yourTableName' n2 WHERE n1.id < n2.id AND n1.email = n2.email
Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way, also valid to handle big data sources (with examples for different use cases).
Ali, in your case, you can run something like this:
-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;
-- add a unique constraint
ALTER TABLE tmp_table1 ADD UNIQUE(sid, title);
-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;
-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;
delete from `table` where `table`.`SID` in
(
select t.SID from table t join table t1 on t.title = t1.title where t.SID > t1.SID
)
Love #eric's answer but it doesn't seem to work if you have a really big table (I'm getting The SELECT would examine more than MAX_JOIN_SIZE rows; check your WHERE and use SET SQL_BIG_SELECTS=1 or SET MAX_JOIN_SIZE=# if the SELECT is okay when I try to run it). So I limited the join query to only consider the duplicate rows and I ended up with:
DELETE a FROM penguins a
LEFT JOIN (SELECT COUNT(baz) AS num, MIN(baz) AS keepBaz, foo
FROM penguins
GROUP BY deviceId HAVING num > 1) b
ON a.baz != b.keepBaz
AND a.foo = b.foo
WHERE b.foo IS NOT NULL
The WHERE clause in this case allows MySQL to ignore any row that doesn't have a duplicate and will also ignore if this is the first instance of the duplicate so only subsequent duplicates will be ignored. Change MIN(baz) to MAX(baz) to keep the last instance instead of the first.
This works for large tables:
CREATE Temporary table duplicates AS select max(id) as id, url from links group by url having count(*) > 1;
DELETE l from links l inner join duplicates ld on ld.id = l.id WHERE ld.id IS NOT NULL;
To delete oldest change max(id) to min(id)
This here will make the column column_name into a primary key, and in the meantime ignore all errors. So it will delete the rows with a duplicate value for column_name.
ALTER IGNORE TABLE `table_name` ADD PRIMARY KEY (`column_name`);
I think this will work by basically copying the table and emptying it then putting only the distinct values back into it but please double check it before doing it on large amounts of data.
Creates a carbon copy of your table
create table temp_table like oldtablename;
insert temp_table select * from oldtablename;
Empties your original table
DELETE * from oldtablename;
Copies all distinct values from the copied table back to your original table
INSERT oldtablename SELECT * from temp_table group by firstname,lastname,dob
Deletes your temp table.
Drop Table temp_table
You need to group by aLL fields that you want to keep distinct.
DELETE T2
FROM table_name T1
JOIN same_table_name T2 ON (T1.title = T2.title AND T1.ID <> T2.ID)
here is how I usually eliminate duplicates
add a temporary column, name it whatever you want(i'll refer as active)
group by the fields that you think shouldn't be duplicate and set their active to 1, grouping by will select only one of duplicate values(will not select duplicates)for that columns
delete the ones with active zero
drop column active
optionally(if fits to your purposes), add unique index for those columns to not have duplicates again
You could just use a DISTINCT clause to select the "cleaned up" list (and here is a very easy example on how to do that).
Could it work if you count them, and then add a limit to your delete query leaving just one?
For example, if you have two or more, write your query like this:
DELETE FROM table WHERE SID = 1 LIMIT 1;
There are just a few basic steps when removing duplicate data from your table:
Back up your table!
Find the duplicate rows
Remove the duplicate rows
Here is the full tutorial: https://blog.teamsql.io/deleting-duplicate-data-3541485b3473