How to de-duplicate records in MySQL? - mysql

This is a hard one. A third party has been sending us data from a fourth party. But they have done that in a horrible format and they messed up and duplicated many of the data.
Now the data is all in one table, even though it should have been in much more than one. This has to do with a historical data format.
Now what SHOULD be each record with multiple related records in other tables, is actually put into our database as follows:
Id HistoricalId Field1 Field2 Field3 Field4 FieldX ...
1 327
2 data data data
3 data data data
4 data data
5 data data
6 328
7 data data data (etc etc)
Everything grossly simplified. So you always first have a sort of "header record". Then records with the data. Until there is a new header. Let's call all the records from one header to the next together a "superrecord" (for instance in the example ID 1 t/m 5 form together the first superrecord, the next superrecord stats at Id 6).
Problem is: there are MANY duplicate "superrecords", easily identified by their duplicate HistoricalId in the header record. But they can be anywhere in the database (the records that form the superrecord will be well sorted and not mixed up, but the superrecords are mixed up).
So the puzzle: remove all duplicate superrecords. We are talking 10.000s here if not more.
So, how would you, in MySQL:
Find a Id from a duplicate superrecord (easy)
Find the Id from the next header record (i.e. the following superrecord)
Delete everything between (and including) the first Id and the second Id minus 1
And do this for all duplicate superrecords.
My head starts spinning. It must be possible with just mySQL, but how? I am just not experienced enough. Even though I am not bad at MySQL, here I cannot even see where to start. Or should I program something in php?
Anyone likes a challenge? Thank you in advance!
UPDATE: Solved it thanks to you and two hours of hard work. See solution.

If you're open to copying to a different table etc., then...
You can figure which records you want to delete. All records where the historical-id exists in some other record with a higher ID
SELECT id, HISTORICAL_ID
FROM tbl t1
WHERE historical_id>0
AND exists
(SELECT 1 FROM tbl t2
WHERE T2.hISTORICAL_id=T1.HISTORICAL_ID and T2.ID>T1.ID)
Since each record has an ID, for each record, you could compute the ID of the Header Record. (This is what you mention in your comment). It would be the Max. ID from any "previous" record where historical id is filled in.
Select ID, HISTORICAL_ID
,(Select MAX(ID) FROM T2 Where T1.ID <T2.ID and T1.HistoricalId<>0) As PARENT_ID
From TBL T1
You can then match the PARENT_ID with the first query to get all the IDs you wish to delete

I finally solved it. Thanks everyone, you all put me into the right direction.
Three queries are needed:
First mark all duplicate header records by setting HistoricalID to -1
UPDATE
t1 INNER JOIN
(SELECT MIN(id) AS keep, HistoricalID FROM t1
GROUP BY HistoricalID
HAVING count(*) > 1 AND HistoricalID > 0) t2
ON t1.HistoricalID = t2.HistoricalID
SET HistoricalID = IF(t1.id=t2.keep, t1.HistoricalID , -1)
WHERE t1.HistoricalID > 0
Secondly copy HistoricalID from the header record to all other records below it (in the same superrecord). I can undo this later easily if needed.
UPDATE
t1 JOIN
( SELECT Id, #s:=IF(HistoricalID='', #s, HistoricalID) HistoricalID FROM
(SELECT * FROM t1 ORDER BY Id) r, (SELECT #s:='') t ) t2
ON t1.Id = t2.Id
SET t1.HistoricalID= t2.HistoricalID
Delete all duplicates:
DELETE FROM t1 WHERE HistoricalID = -1
It worked. Couldn't have done it without you!

Related

Remove duplicate MySQL entries [duplicate]

This question already has answers here:
Delete all Duplicate Rows except for One in MySQL? [duplicate]
(2 answers)
Closed 2 years ago.
I've not seen anyone ask for help with this specific issue.
I've got a table with 300,000 rows in it. Each row has a unique id, several columns but there is no timestamp etc.
The issue I have is the user has managed to import new data in to the DB and so now some of the rows are duplicated.
For the rows having this issue there are 2 rows which are identical apart from the ID.
Is there any way to search the whole table, find the duplicated rows based on name and remove the rows with the old ID ?
I need to ensure only a duplicate is removed and only the old entry.
So far I've come up with the following which shows the duplicate rows.
SELECT id, name, COUNT(name) AS cnt
FROM Sites
GROUP BY name
HAVING (cnt > 1)
This produces output with id, name, cnt and shows there are 50,000 entries to be removed. The id shown does appear to be the old ID.
is there anyway to feed this into a delete command to remove the entries ?
Thanks
As far as I understand, there's two duplicate rows now in table and you want to delete the old one or the one with smaller id.
You can INNER JOIN the same table.
First, confirm all rows to delete:
SELECT t1.* FROM Sites t1
INNER JOIN Sites t2
WHERE t1.name = t2.name
AND t1.id < t2.id
This should only return original rows with smaller ID
Second, if all returned rows were correct, now you can use that query and get each ID and use this in DELETE statement
DELETE * FROM Sites WHERE id IN (
SELECT t1.id FROM Sites t1
INNER JOIN Sites t2
WHERE t1.name = t2.name
AND t1.id < t2.id
)
You can add more columns with AND from your table to check exact duplicate rows.
This seems to have worked for me.
DELETE FROM Sites WHERE id IN (
SELECT * FROM (
SELECT id FROM Sites GROUP BY name HAVING COUNT(name) >= 2
) AS a
);
Thanks
Assuming the old ID is a lower value, you can use FIRST_VALUE partitioned by name (not being familiar with your table) and ordered by ID.
https://mariadb.com/kb/en/first_value/

MySQL: deleting rows based on a condition with data from another table and NO JOIN

I have a TABLE1 like this:
And a TABLE2 like this:
I want to delete entries from table 1 whose endTimestamp is not equal to ANY table 2 entry endTimestamp, with a margin of 1000 time units.
(I know in this example all the entries from table 1 and table 2 have the same timestamp values, so the 5 entries from table 1 should be kept, and any other should be erased if existed)
Since the ids of both tables are not related to each other, I can't perform a JOIN operation as long as I know.
How can I do this?
EDIT: Tried here. Works, but does not work at my server :|
You're looking for :
delete from table1 where endTimestamp not in (Select endTimestamp from table2)
Edit : As pointed out by #user2864740, you can very well use Join here too, even if the ids of both tables are not related to each other.
DELETE FROM table1 INNER JOIN table2 ON table1.endTimestamp = table2.endTimestamp;

MySQL: How to update fields with the values of the record before this one

I have a large table with 1 mln records. I get this table from another company. All records have an autonumber id as PRIMARY KEY. A lot of the fields are empty, because some records "belong together" (as a group) and these fields are only filled in for the first record (sort of header record).
I want to fill in the same values in all following records until that field is not empty (that is where a new "header" from a new group of records starts). (I know, bad database design, but it is what I get and I want to turn it into better database design as soon as it gets in, and this is just the first necessary step into a longer step process to get there.)
I am having a hard time getting this right. I want to UPDATE the table, in order of id, where, if the specific fields are empty, they are filled in with the values of the previous record.
I tried different solutions, all of them turn out not to be working.
My last one:
UPDATE t1
SET f1 =
(SELECT t2.f1 FROM t1 AS t2
WHERE t2.id = t1.id-1) ,
Verzend_tijd=
(SELECT t3.f2 FROM t1 as t3
WHERE t3.id = t1.id-1)
WHERE t2.f1 = ''
ORDER BY t1.id
I get the error:
You can't specify target table 't1' for update in FROM clause
Anyone an idea how to get this done? I also tried with an INNER JOIN but it turns our I cannot do an ORDER BY in UPDATE+INNER JOIN, and the order is important!
I am a bit at a loss and Googling didn't bring me anything either.
After more and more Googling I found the answer right here on this website:
Mysql Updating a record with a value from the previous record
Sorry for the inconvenience

Udating statement

I have a problem with my database, which has many tables but i am focused on the main named TblLivroPorta(t1) and the 2nd one named Tblp_h(t2).
so t1 comunicate with the application while t2 stores everything that happens with t1, I can say that t2 is the t1's backup.
I want to find the data I am looking for in t2 and copy them to t1 so they can be accessed by the application.
The statement below gives me all data I want to copy back
select NOrdem, Num_Oficio from tblp_h where Num_Oficio != '3469/3ยช V/TAPS/2012'
and Data_Saida between '2012-01-01' and '2012-11-30'
union select NOrdem, Num_Oficio from TbLivroPorta
where Num_Oficio = null and Data_Saida between '2012-01-01' and '2012-11-30'
so my difficult is to copy them back to t1.
I hope I have been clear.
You need to use INSERT INTO SELECT to copy the values from one table to another
http://www.w3schools.com/sql/sql_insert_into_select.asp
You need to ensure that your SELECT statement returns the columns that will be inserted into the target table

sql delete all but 2 duplicates

I want to be able to limit the amount of duplicate records in a mySQL database table to 2.
(Excluding the id field which is auto increment)
My table is set up like
id city item
---------------------
1 Miami 4
2 Detroit 5
3 Miami 4
4 Miami 18
5 Miami 4
So in that table, only row 5 would be deleted.
How can I do this?
MySQL has some foibles when reading and writing to the same table. So I don't actually know if this will work, the syntax is fine in many implementations of SQL, but I don't know if it's MySQL friendly...
DELETE
yourTable
WHERE
1 < (SELECT COUNT(*)
FROM yourTable as Lookup
WHERE city = yourTable.city AND item = yourTable.item AND id < yourTable.id)
EDIT
Amazingly convoluted, but worth a try?
DELETE
yourTable
FROM
yourTable
INNER JOIN
(
SELECT
id
FROM
(
SELECT
id
FROM
yourTable
WHERE
1 < (SELECT COUNT(*)
FROM yourTable as Lookup
WHERE city = yourTable.city AND item = yourTable.item AND id < yourTable.id)
)
AS inner_deletes
)
AS deletes
ON deletes.id = yourTable.id
I think your problem here is that both your code and/or table structure allows inserting duplicates and you are asking this question when you should really fix your db and/or code.
i think a better solution is avoid allow more than 5 registers, you have to implement a validation where if select count(*) > 3 you will not accept the new insert.
because if you want to do this into the data tier, you have to use a stored procedure , because first you need to identify all the register with more than 3 registers and delete only the last .
Saludos
Due to MySQL being notoriously difficult when it comes to updating queried tables (see for example the answers from Dems), the best I can figure out is sadly more than one statement but on the plus side fairly readable;
CREATE TEMPORARY TABLE Dump AS SELECT id FROM table1 WHERE id NOT IN
(SELECT MIN(id) FROM table1 GROUP BY city,item UNION
SELECT MAX(id) FROM table1 GROUP BY city,item);
DELETE FROM table1 where id in (select * from Dump);
DROP TABLE DUMP;
Not sure if it was important which duplicate was removed, this keeps the first and last.
In your reply to Joachim's answer, you ask about saving 3 or 5 rows, this is one way to accomplish it. Depending on how you are using this database, you could either call this in a loop, or you could turn it into a stored procedure. Either way, you would continue to run this entire block of code until Rows Affected = 0:
drop table if exists TempTable;
create table TempTable
select city, item,
count(*) as record_count,
min(id) as ItemToDrop -- this could be changed to max() if you
-- want to delete new stuff instead
from YourTable
group by city, item
having count(*) > 2; -- This value = number of rows you save
delete from YourTable
where id in (select ItemToDrop from TempTable);