Remove duplicate MySQL entries [duplicate] - mysql

This question already has answers here:
Delete all Duplicate Rows except for One in MySQL? [duplicate]
(2 answers)
Closed 2 years ago.
I've not seen anyone ask for help with this specific issue.
I've got a table with 300,000 rows in it. Each row has a unique id, several columns but there is no timestamp etc.
The issue I have is the user has managed to import new data in to the DB and so now some of the rows are duplicated.
For the rows having this issue there are 2 rows which are identical apart from the ID.
Is there any way to search the whole table, find the duplicated rows based on name and remove the rows with the old ID ?
I need to ensure only a duplicate is removed and only the old entry.
So far I've come up with the following which shows the duplicate rows.
SELECT id, name, COUNT(name) AS cnt
FROM Sites
GROUP BY name
HAVING (cnt > 1)
This produces output with id, name, cnt and shows there are 50,000 entries to be removed. The id shown does appear to be the old ID.
is there anyway to feed this into a delete command to remove the entries ?
Thanks

As far as I understand, there's two duplicate rows now in table and you want to delete the old one or the one with smaller id.
You can INNER JOIN the same table.
First, confirm all rows to delete:
SELECT t1.* FROM Sites t1
INNER JOIN Sites t2
WHERE t1.name = t2.name
AND t1.id < t2.id
This should only return original rows with smaller ID
Second, if all returned rows were correct, now you can use that query and get each ID and use this in DELETE statement
DELETE * FROM Sites WHERE id IN (
SELECT t1.id FROM Sites t1
INNER JOIN Sites t2
WHERE t1.name = t2.name
AND t1.id < t2.id
)
You can add more columns with AND from your table to check exact duplicate rows.

This seems to have worked for me.
DELETE FROM Sites WHERE id IN (
SELECT * FROM (
SELECT id FROM Sites GROUP BY name HAVING COUNT(name) >= 2
) AS a
);
Thanks

Assuming the old ID is a lower value, you can use FIRST_VALUE partitioned by name (not being familiar with your table) and ordered by ID.
https://mariadb.com/kb/en/first_value/

Related

MySQL 5.7 remove duplicate rows in the same table based on multiple columns

I have a table with already existing records, I want to add a Unique constraint on multiple columns(app_instance_config_uuid, external_resource_id and spaceId), but first, I need to remove already existing duplicates.
This is an example of the table I want to add the constraint.
The best solution i found is
DELETE FROM spaces_apps
WHERE id IN ( SELECT id FROM ( SELECT MIN(id) AS id FROM spaces_apps
GROUP BY spaceId, app_instance_config_uuid, external_resource_id
HAVING COUNT(id) > 1 ) temp )
but the issue is that it only deletes one duplicate and if I need to delete more then one i need to run it again.
Important note that this is MySQL5.7 so using ROW_COUNT() and similar approaches doesn't work.
UPDATE:
The first solution works even better when just changing IN to NOT IN and removing the HAVING clause! Thanks to #Pankaj for pointing this!
DELETE FROM spaces_apps
WHERE id IN ( SELECT id FROM ( SELECT MIN(id) AS id FROM spaces_apps
GROUP BY spaceId, app_instance_config_uuid, external_resource_id )temp )
I found solution for this. It's not the prettiest but it's the only one that works in my case.
DELETE t1 FROM table_name t1
INNER JOIN table_name t2
WHERE
t1.created_at < t2.created_at AND
t1.app_instance_config_uuid=t2.app_instance_config_uuid AND
t1.external_resource_id=t2.external_resource_id AND
t1.spaceId=t2.spaceId;

Mysql select optimization (huge db)

I have a select request in MySQL that takes between 25-30s, which is extremely long and I was wondering if you could help me fasten it.
CREATE TEMPORARY TABLE results(
id VARCHAR(30),
secondid VARCHAR(5),
allele VARCHAR(30),
translation VARCHAR(10),
level VARCHAR(20),
subgroup VARCHAR(20),
subgroup2 VARCHAR(20)
);
INSERT INTO results(id, secondid, allele, level) SELECT DISTINCT t1.id, t1.secondid, t1.texte, t3.texte
FROM database t1
JOIN database t2 ON t1.id=t2.id
JOIN database t3 ON t1.id=t3.id AND t1.secondid=t3.secondid
WHERE (t1.qualifier,t2.qualifier) = ("allele","organism") AND t3.qualifier = "level_length" AND t3.texte NOT REGEXP "X" AND t3.texte IS NOT NULL
AND t2.texte = ? AND t1.texte REGEXP ?
GROUP BY t1.texte;
UPDATE results SET translation = (SELECT t1.qualifier
FROM database t1
JOIN database t2 ON t1.id=t2.id AND t1.secondid=t2.secondid
JOIN database t3 ON t1.id=t3.id AND t1.secondid=t3.secondid
WHERE t1.qualifier IN ("protein","ncRNA","rRNA") AND t2.texte=results.allele AND t3.texte=results.level LIMIT 1);
UPDATE results SET subgroup = (SELECT t2.subgrp
FROM alleledb.alleleSubgroups t1
JOIN alleledb.subgroups t2 ON t1.subgroup=t2.subgroup
WHERE t1.gene=SUBSTRING_INDEX(results.allele, "*", 1) AND t1.species=? LIMIT 1);
ALTER TABLE results DROP id, DROP secondid;
SELECT * FROM results ORDER BY subgroup ASC, level ASC;
DROP TABLE results;
I need to go through many dbs to get join (same id), database are huge but results to extract are quite low (less than 1% of all the database). The majority of the results are stored in the same db, in different rows (with the same id and secondid). However, id and secondid are not unique to the rows I need to select, only the combinaison of two is.
Thank you.
I would start by having a proper composite index on your database table
First on
(qualifier, id, secondid, texte)
This will help your joins, the where testing and NOT have to go back to the actual raw data tables for the records as the index has the data you are interested in.
Next, I would adjust the query/joins. Since you are specifically looking for the "allele" and "organism" from t1 and t2 respectively, make them as such.
I have no idea what you are doing with your REGEXP "X" or "?" values for texte, but you'll figure that out after.
Here is how I would revise the queries
insert into ...
SELECT DISTINCT
t1.id,
t1.secondid,
t1.texte,
t3.texte
FROM
database t1
JOIN database t2
ON t1.id = t2.id
AND t2.qualifier = 'organism'
JOIN database t3 ON
t1.id = t3.id
AND t1.secondid = t3.secondid
AND t3.qualifier = 'level_length'
WHERE
t1.qualifier = 'allele'
AND t1.texte REGEXP ?
-- I would move these t2 and t3 into the respective JOINs above directly.
AND t3.texte NOT REGEXP "X"
AND t3.texte IS NOT NULL
AND t2.texte = ?
GROUP BY
t1.texte;
As for your UPDATE commands, having a second index on (id, secondid) will help on the join to t2 and t3 since there is no qualifier context to the join.
As for your UPDATE commands, as Rick mentioned, without some context of an ORDER BY clause, you have no guarantee WHICH record is returned back by the LIMIT 1.
First of all, thank you for all your help.
My first table (The insert to and the first update, database named) looks like this :
I want all things in red. In others words, I need some parameters which has the same id and secondid as the "level" which is unique among the id. Whereas others parameters may be repeated within the same id (but different second id).
I am filtering using the allele name (ECK in EC locus) with thé REGEXP and species. For example, all allèles from EC locus of human.
Then (last update), I take one parameter (allele), substring it and go to a database that gives me one id (one row -> one id). And I use this id on annoter database that gives me one or two rows (one subgroup or two subgroups/rare). So as in my example I only has one group, the absence of ORDER BY was not seen. But yes I want to order (get the subgroup that contains the allele in first). I don't know how to do that.
Finally, I can try to make an index but due to the size of the db, I'm wondering the time and the size of such an index. Would it significally improve time and can I remove it ?
The REGEXP "X" is to remove every matches that are not relevant regarding this parameter (I don't want them).
The ? is user input (for the species/2 occurrences this one and the locus).
The operations on the first database takes 30s, last operation on the two databases lasts 1-2s. Others (drop , select) are <20ms (not the problem).

How to de-duplicate records in MySQL?

This is a hard one. A third party has been sending us data from a fourth party. But they have done that in a horrible format and they messed up and duplicated many of the data.
Now the data is all in one table, even though it should have been in much more than one. This has to do with a historical data format.
Now what SHOULD be each record with multiple related records in other tables, is actually put into our database as follows:
Id HistoricalId Field1 Field2 Field3 Field4 FieldX ...
1 327
2 data data data
3 data data data
4 data data
5 data data
6 328
7 data data data (etc etc)
Everything grossly simplified. So you always first have a sort of "header record". Then records with the data. Until there is a new header. Let's call all the records from one header to the next together a "superrecord" (for instance in the example ID 1 t/m 5 form together the first superrecord, the next superrecord stats at Id 6).
Problem is: there are MANY duplicate "superrecords", easily identified by their duplicate HistoricalId in the header record. But they can be anywhere in the database (the records that form the superrecord will be well sorted and not mixed up, but the superrecords are mixed up).
So the puzzle: remove all duplicate superrecords. We are talking 10.000s here if not more.
So, how would you, in MySQL:
Find a Id from a duplicate superrecord (easy)
Find the Id from the next header record (i.e. the following superrecord)
Delete everything between (and including) the first Id and the second Id minus 1
And do this for all duplicate superrecords.
My head starts spinning. It must be possible with just mySQL, but how? I am just not experienced enough. Even though I am not bad at MySQL, here I cannot even see where to start. Or should I program something in php?
Anyone likes a challenge? Thank you in advance!
UPDATE: Solved it thanks to you and two hours of hard work. See solution.
If you're open to copying to a different table etc., then...
You can figure which records you want to delete. All records where the historical-id exists in some other record with a higher ID
SELECT id, HISTORICAL_ID
FROM tbl t1
WHERE historical_id>0
AND exists
(SELECT 1 FROM tbl t2
WHERE T2.hISTORICAL_id=T1.HISTORICAL_ID and T2.ID>T1.ID)
Since each record has an ID, for each record, you could compute the ID of the Header Record. (This is what you mention in your comment). It would be the Max. ID from any "previous" record where historical id is filled in.
Select ID, HISTORICAL_ID
,(Select MAX(ID) FROM T2 Where T1.ID <T2.ID and T1.HistoricalId<>0) As PARENT_ID
From TBL T1
You can then match the PARENT_ID with the first query to get all the IDs you wish to delete
I finally solved it. Thanks everyone, you all put me into the right direction.
Three queries are needed:
First mark all duplicate header records by setting HistoricalID to -1
UPDATE
t1 INNER JOIN
(SELECT MIN(id) AS keep, HistoricalID FROM t1
GROUP BY HistoricalID
HAVING count(*) > 1 AND HistoricalID > 0) t2
ON t1.HistoricalID = t2.HistoricalID
SET HistoricalID = IF(t1.id=t2.keep, t1.HistoricalID , -1)
WHERE t1.HistoricalID > 0
Secondly copy HistoricalID from the header record to all other records below it (in the same superrecord). I can undo this later easily if needed.
UPDATE
t1 JOIN
( SELECT Id, #s:=IF(HistoricalID='', #s, HistoricalID) HistoricalID FROM
(SELECT * FROM t1 ORDER BY Id) r, (SELECT #s:='') t ) t2
ON t1.Id = t2.Id
SET t1.HistoricalID= t2.HistoricalID
Delete all duplicates:
DELETE FROM t1 WHERE HistoricalID = -1
It worked. Couldn't have done it without you!

MySQL: deleting rows based on a condition with data from another table and NO JOIN

I have a TABLE1 like this:
And a TABLE2 like this:
I want to delete entries from table 1 whose endTimestamp is not equal to ANY table 2 entry endTimestamp, with a margin of 1000 time units.
(I know in this example all the entries from table 1 and table 2 have the same timestamp values, so the 5 entries from table 1 should be kept, and any other should be erased if existed)
Since the ids of both tables are not related to each other, I can't perform a JOIN operation as long as I know.
How can I do this?
EDIT: Tried here. Works, but does not work at my server :|
You're looking for :
delete from table1 where endTimestamp not in (Select endTimestamp from table2)
Edit : As pointed out by #user2864740, you can very well use Join here too, even if the ids of both tables are not related to each other.
DELETE FROM table1 INNER JOIN table2 ON table1.endTimestamp = table2.endTimestamp;

Insert into one table by selecting data from multiple tables

I need to select the data from two tables and insert into one table. The same kind of the question was asked and answered many times but I have kind of requirement.
I have total three tables T1,T2,T3.
My final goal is to insert the data into table T3. I have total 15 columns in table T3. Among that 15 columns I need to fill the 14 columns data from table T2 and the data for the last column I need to join table T1 and T2 from that I need to fetch the data for that column. Please find the below query
CREATE procedure proc_name
BEGIN
Insert into T3(
id,
col1,
col2,
....
...
col14)
select
(select id from T1 INNER JOIN T2 ON
(T1.somecol1=T2.somecol1,
T1. somecol2= T2.somecol2,
T1.somecol3 = T2.somecol3,
T1.somecol4= T2.somecol4)
ORDER BY T2.somecol5 LIMIT 1),
T2.col1,
T2.col2,
...
...
T2.col14 from T2;
END;
Here the rest of the fourteen columns of T3 have the relation first column id.
Whenever I call the above stored procedure all the records in into T3 is inserting with top 1 id in table T1 time even though I have total 10 id's in T1.
After close observation I came to know that the reason for that is as I'm mentioning the limit 1 so it is fetch only first id every time.
If I didn't mention limit 1 it is returning all the 10 id's and the query it self is failing.
Is there anyway I can get all id's in table T3. Please suggest me.
Thanks in Advance.
To elaborate or the comment by #jarlh (who is absolutely correct) you need to have a join between the two tables rather than selecting a value in a subquery
INSERT INTO T3
(id
,col1
,col2
,...
,col14)
SELECT T1.id
,T2.col1
,T2.col2
,...
FROM T1
INNER JOIN T2 ON
(T1.somecol1=T2.somecol1,
T1. somecol2= T2.somecol2,
T1.somecol3 = T2.somecol3,
T1.somecol4= T2.somecol4)
With your code as it is, you would only ever expect a single value from T1.