Delete duplicate entries while keeping one - mysql

I have a table but it has no unique ID or primary key.
It has 3 columns in total.
name
user_id
role_id
ben
1
2
ben
1
2
sam
1
3
I'd like to remove one entry with the name Ben.
So output would look like this
name
user_id
role_id
ben
1
2
sam
1
3
Most of the examples shows deleting duplicate entries with ID or primary key. However how would I retain one entry whilest removing the other ones?
Using the following query I was able to get duplicated rows
SELECT name, user_id, role_id, count(*) FROM some_table
GROUP BY name, user_id, role_id
HAVING count(*) > 1
To clarify, I am looking to delete these rows.
Prefer not creating a new table.

If you don't have to worry about other users accessing the table -
CREATE TABLE `new_table` AS
SELECT DISTINCT `name`, `user_id`, `role_id`
FROM `old_table`;
RENAME TABLE
`old_table` TO `backup`,
`new_table` TO `old_table`;
Or you could use your duplicates query to output lots of single row delete queries -
SELECT
`name`,
`user_id`,
`role_id`,
COUNT(*),
CONCAT('DELETE FROM some_table WHERE name=\'', `name`, '\' AND user_id=\'', `user_id`, '\' AND role_id=\'', `role_id`, '\' LIMIT 1;') AS `delete_stmt`
FROM `some_table`
GROUP BY `name`, `user_id`, `role_id`
HAVING COUNT(*) > 1;
Or you could temporarily add a SERIAL column and then remove it after the delete -
ALTER TABLE `some_table` ADD COLUMN `temp_id` SERIAL;
DELETE `t1`.*
FROM `some_table` `t1`
LEFT JOIN (
SELECT MIN(`temp_id`) `min_temp_id`
FROM `some_table`
GROUP BY `name`, `user_id`, `role_id`
) `t2` ON `t1`.`temp_id` = `t2`.`min_temp_id`
WHERE `t2`.`min_temp_id` IS NULL;
ALTER TABLE `some_table` DROP COLUMN `temp_id`;

Note that you are not saving anything by not having a primary key; mysql (at least with innodb) requires a primary key and will create a hidden one if you do not have one. So I would first add a primary key:
alter table some_table add id serial primary key;
Then you can easily remove duplicates with:
delete a from some_table a join some_table b on a.name=b.name and a.user_id=b.user_id and a.role_id=b.role_id and b.id < a.id;

I would take the duplicate records and put them into another table.
SELECT
name,
user_id,
role_id
INTO some_new_table
FROM some_table
GROUP BY name, user_id, role_id
HAVING count(*) > 1
Then you can delete those records from your source table
DELETE a
FROM some_table a
INNER JOIN some_new_table b
ON a.name = b.name
AND a.user_id = b.user_id
AND a.role_id = b.role_id
Finally you can then insert the deduped records back into your table.
INSERT INTO some_table
SELECT
name,
user_id,
role_id
FROM some_new_table
If the volume of dupes is very large you could also just create a new table with the deduped data. Truncate \ Drop the old table and then Insert \ Rename from the new table.

Related

MySql group by columns and assign them a unique group_id

Basically, there is an abstract task on removing duplicates from db, which linked by it's id with several other tables...
I need to assign for each repeating row in table a unique group_id as max(id) of existing row. Please help
My question in picture
https://i.stack.imgur.com/CVYG1.png
You can group the entities (to find the group ids) at first and then update the group_id of entities.
Example:
UPDATE `t`,
(SELECT `name`, `surname`, MAX(`id`) AS `group_id` FROM `t`
WHERE 1 GROUP BY CONCAT(`name`, `surname`)) AS `t1`
SET `t`.`group_id` = `t1`.`group_id`
WHERE `t`.`name` = `t1`.`name` AND `t`.`surname` = `t1`.`surname`

insert extra unwanted row to table

Why in MYSQL by executing this SQL query 2 rows will add to table? Is this query executes two times!?;
INSERT INTO user(`usr_name`, `email`, `name`, `reg_date`, `role_id`)
(
SELECT "editor1",
"editor1#example.com",
"editor1",
"2005-12-20",
2
FROM `user`
WHERE (("admin", 3) IN (
SELECT usr_name, role_id
FROM `user`
)
AND NOT EXISTS (
SELECT usr_name, email
FROM `user`
WHERE usr_name = "editor1" OR email = "editor1#example.com"
))
)
result is here!
Apparently, two rows in user match the WHERE conditions.
You are not using the user table in the first FROM. So how about this instead:
INSERT INTO user(`usr_name`, `email`, `name`, `reg_date`, `role_id`)
SELECT t.*
FROM (SELECT 'editor1' as user_name, 'editor1#example.com as email,
'editor1' as name, '2005-12-20 as reg_date, 2 as role_id
) t
WHERE ('admin', 3) IN (SELECT usr_name, role_id
FROM `user`
) AND
NOT EXISTS (SELECT usr_name, email
FROM `user` u
WHERE u.usr_name = t.usr_name OR u.email = t.email
)
Or, better yet, but unique indexes on the fields that you don't want duplicated in the table:
create unique index idx_user_username on user(usr_name);
create unique index idx_user_email on usr(email);
Let the database protect the table. It is there to help you.

How to sum(amount) and group by user_id on the same table

I have a table with a lot of rows per user_id
and i am trying to group rows by user_id and to sum their amount
This is the table structure
Name Type Collation Attributes Null Default Extra Action
1 user_id int(11) No None Change Change Drop Drop Browse distinct values Browse distinct values Primary Primary Unique Unique Show more actions More
2 amount decimal(16,8) No None Change Change Drop Drop Browse distinct values Browse distinct values Primary Primary Unique Unique Show more actions More
3 aff int(11) No 0 Change Change Drop Drop Browse distinct values Browse distinct values Primary Primary Unique Unique Show more actions More
4 jackpot int(11) No 0 Change Change Drop Drop Browse distinct values Browse distinct values Primary Primary Unique Unique Show more actions More
5 paidout int(11) No 0 Change Change Drop Drop Browse distinct values Browse distinct values Primary Primary Unique Unique Show more actions More
6 type int(11) No 0 Change Change Drop Drop Browse distinct values Browse distinct values Primary Primary Unique Unique Show more actions More
7 created timestamp No CURRENT_TIMESTAMP Change Change Drop Drop Browse distinct values Browse distinct values Primary Primary Unique Unique Show more actions More
I am trying this query without success:
update trans
SELECT * FROM trans group by user_id
set amount = (select sum(amount) from trans
Any help would be appreciated
You can do something like this :
UPDATE trans t
INNER JOIN (
select user_id, sum(amount) sumAmount
from trans
group by user_id
) subSum on subSum.user_id = t.user_id
SET t.amount = subSum.sumAmount
With user_id range :
UPDATE trans t
INNER JOIN (
select user_id, sum(amount) sumAmount
from trans
where user_id BETWEEN 0 AND 1000 --Edited
group by user_id
) subSum on subSum.user_id = t.user_id
SET t.amount = subSum.sumAmount
WHERE t.user_id BETWEEN 0 AND 1000 --HEre
With a temp table
Using temp table :
--Create table with user_id and sum amount
CREATE TABLE trans_temp_sum_amount
SELECT user_id, sum(amount) sumAmount
FROM trans
GROUP BY user_id;
--Update
UPDATE trans t
INNER JOIN trans_temp_sum_amount subSum
on subSum.user_id = t.user_id
SET t.amount = subSum.sumAmount;
--Drop temp table
DROP TABLE trans_temp_sum_amount;
i would advise you to use a VIEW instead of "deleting" all your old data:
CREATE VIEW trans_view AS
SELECT user_id, SUM(amount) FROM trans GROUP BY user_id;
And i would advise you to have a look at your program and try to update the amount every time it changes instead of insert a new row every time.

how to Delete Duplicate Rows but keeping 1 based on two columns

I have table called scheduler. It contains following columns:
ID
sequence_id
schedule_time (timestamp)
processed
source_order
I need to delete duplicate rows from the table but keeping 1 row which has same schedule_time and source_order for a particular sequence_id where processed=0
DELETE yourTable FROM yourTable LEFT OUTER JOIN (
SELECT MIN(ID) AS minID FROM yourTable WHERE processed = 0 GROUP BY schedule_time, source_order
) AS keepRowTable ON yourTable.ID = keepRowTable.minID
WHERE keepRowTable.ID IS NULL AND processed = 0
I apply from this post ;P How can I remove duplicate rows?
Have you seen it?
--fixed version--
DELETE yourTable FROM yourTable LEFT OUTER JOIN (
SELECT MIN(ID) AS minID FROM yourTable WHERE processed = 0 GROUP BY schedule_time, source_order
) AS keepRowTable ON yourTable.ID = keepRowTable.minID
WHERE keepRowTable.minID IS NULL AND processed = 0
For mysql
DELETE a from tbl a , tbl b WHERE a.Id>b.Id and
a.sequence_id= b.sequence_id and a.processed=0;
The fastest way to remove duplicates - is definitely to force them out by adding an index, leaving only one copy of each left in the table:
ALTER IGNORE TABLE dates ADD PRIMARY KEY (
ID
sequence_id
schedule_time
processed
source_order
)
Now if you have a key, you might need to delete it and so on, but the point is that when you add a unique key with IGNORE to a table with duplicates - the bahavior is to delete all the extra records / duplicates. So after you added this key, you now just need to delete it again to be able to make new duplicates :-)
Now if you need to do more complex filtering (on witch one of the duplicates to keep that you can not just include in indexes - although unlikely), you can create a table at the same time as you select and input what you want in it - all in the same query:
CREATE TABLE tmp SELECT ..fields.. GROUP BY ( ..what you need..)
DROP TABLE original_table
ALTER TABLE tmp RENAME TO original_table_name

How to remove duplicate entries from a mysql db?

I have a table with some ids + titles. I want to make the title column unique, but it has over 600k records already, some of which are duplicates (sometimes several dozen times over).
How do I remove all duplicates, except one, so I can add a UNIQUE key to the title column after?
This command adds a unique key, and drops all rows that generate errors (due to the unique key). This removes duplicates.
ALTER IGNORE TABLE table ADD UNIQUE KEY idx1(title);
Edit: Note that this command may not work for InnoDB tables for some versions of MySQL. See this post for a workaround. (Thanks to "an anonymous user" for this information.)
Create a new table with just the distinct rows of the original table. There may be other ways but I find this the cleanest.
CREATE TABLE tmp_table AS SELECT DISTINCT [....] FROM main_table
More specifically:
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
DROP TABLE tableName;
INSERT tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;
Since the MySql ALTER IGNORE TABLE has been deprecated, you need to actually delete the duplicate date before adding an index.
First write a query that finds all the duplicates. Here I'm assuming that email is the field that contains duplicates.
SELECT
s1.email
s1.id,
s1.created
s2.id,
s2.created
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
/* Emails are the same */
s1.email = s2.email AND
/* DON'T select both accounts,
only select the one created later.
The serial id could also be used here */
s2.created > s1.created
;
Next select only the unique duplicate ids:
SELECT
DISTINCT s2.id
FROM
student AS s1
INNER JOIN
student AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
;
Once you are sure that only contains the duplicate ids you want to delete, run the delete. You have to add (SELECT * FROM tblname) so that MySql doesn't complain.
DELETE FROM
student
WHERE
id
IN (
SELECT
DISTINCT s2.id
FROM
(SELECT * FROM student) AS s1
INNER JOIN
(SELECT * FROM student) AS s2
WHERE
s1.email = s2.email AND
s2.created > s1.created
);
Then create the unique index:
ALTER TABLE
student
ADD UNIQUE INDEX
idx_student_unique_email(email)
;
Below query can be used to delete all the duplicate except the one row with lowest "id" field value
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id > t2.id AND t1.name = t2.name
In the similar way, we can keep the row with the highest value in 'id' as follows
DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id < t2.id AND t1.name = t2.name
This shows how to do it in SQL2000. I'm not completely familiar with MySQL syntax but I'm sure there's something comparable
create table #titles (iid int identity (1, 1), title varchar(200))
-- Repeat this step many times to create duplicates
insert into #titles(title) values ('bob')
insert into #titles(title) values ('bob1')
insert into #titles(title) values ('bob2')
insert into #titles(title) values ('bob3')
insert into #titles(title) values ('bob4')
DELETE T FROM
#titles T left join
(
select title, min(iid) as minid from #titles group by title
) D on T.title = D.title and T.iid = D.minid
WHERE D.minid is null
Select * FROM #titles
delete from student where id in (
SELECT distinct(s1.`student_id`) from student as s1 inner join student as s2
where s1.`sex` = s2.`sex` and
s1.`student_id` > s2.`student_id` and
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)
The solution posted by Nitin seems to be the most elegant / logical one.
However it has one issue:
ERROR 1093 (HY000): You can't specify target table 'student' for
update in FROM clause
This can however be resolved by using (SELECT * FROM student) instead of student:
DELETE FROM student WHERE id IN (
SELECT distinct(s1.`student_id`) FROM (SELECT * FROM student) AS s1 INNER JOIN (SELECT * FROM student) AS s2
WHERE s1.`sex` = s2.`sex` AND
s1.`student_id` > s2.`student_id` AND
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)
Give your +1's to Nitin for coming up with the original solution.
Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way (with examples for different use cases).
In this case, something like this should work:
-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;
-- add a unique constraint
ALTER TABLE tmp_table1 ADD UNIQUE(id, title);
-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;
-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;