I have got a database with about 7000 cars, but unfortunately, only about 1000 are unique. Can I delete all the duplicated rows?
The schema looks like this:
Thank you!
Here is one way to do it:
delete t1
from mytable t1
inner join mytable t2
on t2.brand = t1.brand
and t2.model = t1.model
and t2.id < t1.id
This will delete duplicates on (brand, model) while retaining the one with the smallest id.
Related
I want to make a query that deletes duplicate data leaving only one duplicate data when two columns overlap.
Maybe because of a lot of data, but the following query doesn't work for a long time
DELETE t1 FROM table t1 INNER JOIN table t2
WHERE t1.idx < t2.idx AND t1.Nm = t2.Nm AND t1.product = t2.product;
Can this query do what I want? If not, what is the other way?
Create an Index on the 3 columns involved in the ON clause:
CREATE INDEX idx_name
ON tablename (Nm, product, idx);
and execute the query like this:
DELETE t1 FROM tablename t1 INNER JOIN tablename t2
WHERE t1.Nm = t2.Nm AND t1.product = t2.product AND t1.idx < t2.idx;
As you can see in this simplified demo, the query will be executed using the index.
I have a table containing roughly 5 million rows and 150 columns. However, there are several similar rows that I would like to consider duplicates if they share the same values for 3 columns : ID, Order and Name.
However, I don't just want to delete the duplicates at random, I want the row I consider a duplicate to be the one which has the smaller count value (Count being another column) or if they have the same count, then base it on which has the earliest date column (Date is another column).
I have tried with the code below:
DELETE t1 FROM uploaddata_copy t1
JOIN uploaddata_copy t2
ON t2.Name = t1.Name
AND t2.ID = t1.ID
AND t2.Order = t1.Order
AND t2.Count < t1.Count
AND t2.Date < t1.Date
However (and this is probably due to my computer) it seems to run indefinitely (~25mins) before timing out from the server so I'm left unsure if this is correct and I just need to run for even longer or if the code is inherently wrong and there is a quicker way of doing it.
A more accurate query would be:
DELETE t1
FROM uploaddata_copy t1 JOIN
uploaddata_copy t2
ON t2.Name = t1.Name AND
t2.ID = t1.ID AND
t2.Order = t1.Order AND
(t2.Count < t1.Count OR
t2.Count = t1.Count AND t2.Date < t1.Date
);
However, fixing the logic will not (in this case) improve performance. First, you want an index on uploaddata_copy(Name, Id, Order, Count, Date). This allows the "lookup" to be between the original data and only the index.
Second, start small. Add a LIMIT 1 or LIMIT 10 to see how long it takes to remove just a few rows. Deleting rows is a complicated process, because it affects the table, indexes, and the transaction log -- not to mention any triggers on the table.
If a lot of rows are being deleted, you might find it faster to re-create the table, but that depends heavily on the relative number of rows being removed.
Why the join? You want to delete rows when there exists a "better" record. So use an EXISTS clause:
delete from dup using uploaddata_copy as dup
where exists
(
select *
from uploaddata_copy better
where better.name = dup.name
and better.id = dup.id
and better.order = dup.order
and (better.count > dup.count or (better.count = dup.count and better.date > dup.date))
);
(Please check my comparisions. This is how I understand this: A better record for name + id + order has a greater count or the same count and a higher date. You consider the worse record an undesired duplicate you want to delete.)
You'd have an index on uploaddata_copy(id, name, order) at least or better even on uploaddata_copy(id, name, order, count, date) for this delete statement to perform well.
Please try with this:
DELETE t1 FROM uploaddata_copy t1
JOIN uploaddata_copy t2
ON t2.Name = t1.Name
AND t2.ID = t1.ID
AND t2.Order = t1.Order
AND t2.Count < t1.Count
AND t2.Date < t1.Date
AND t2.primary_key != t1.primary_key
I have two tables t1, t2 that I have created and loaded data from a CSV into these.
I had to then create a new PK column as the existing columns (t1.old_id, t2.old_id) are strings that would naturally be a PK are not absolutely fixed (this seems to be advised against?)
so I created a id PK INT AUTO_INCREMENT in each table
as one record in t1 is linked to many in t2 and I want to maintain referential integrity between these two tables.
I believe what i need to do is create an id INT NOT NULL in t2 as an FK
This t2.id is blank at the moment (as it is dependent ont1.id`)
Am I right in thinking I need an UPDATE query with a JOIN of some description to make this work?
The following produces the data exactly that I want to update into my t2.id column - but I don't know how to do the update
select t1.id
from t1
inner join t2
on t1.old_id = t2.old_id
You can use a join in your UPDATE statement like this:
UPDATE t2
JOIN t1 ON t1.old_id = t2.old_id
SET t2.id = t1.id
You can use a correlated UPDATE query like this
UPDATE t2
SET id = (SELECT MAX(t1.id) FROM t1 WHERE t1.old_id = t2.old_id);
*Assuming you have a single t1.id for each t1.old_id
On a Separate Note, You should name t2.id like t2.t1ID so as to remove ambiguity if and when you have a identity column in t2 as well named id
In MySQL, how can I find all rows whose attribute1 is the same as a particular row's attribute1? I thought about doing
SELECT
t1.id
FROM
t AS t1
, t AS t2
WHERE
t2.id=123
AND t1.a=t2.a;
but it has been running for eons.
This should work to return the required rows.
SELECT t1.id
FROM t AS t1
JOIN t AS t2 ON (t1.a = t2.a and t1.id <> t2.id)
WHERE t2.id=123;
How many rows are in your table? Is the "a" column indexed? Adding an index should speed up the join.
Here's an example on SQLFiddle.
I have the following select query
SELECT t1.*
FROM t1, t2
WHERE t1.field1=t2.field1 And t1.field2=t2.field2 And t1.field3=t2.field3 ;
I want to convert this into a delete query. how should i write it?
What about this query:
DELETE FROM t1
WHERE t1.field1 IN (
SELECT t1.field1 FROM t1, t2
WHERE t1.field1=t2.field1 And
t1.field2=t2.field2 And
t1.field3=t2.field3)
Not 100% on access but does:
DELETE t1
FROM t1, t2
WHERE t1.field1=t2.field1 And t1.field2=t2.field2 And t1.field3=t2.field3 ;
Work?
Try this
DELETE FROM t1
FROM t1 AS tt1, t2 AS tt2
WHERE tt1.field1=tt2.field1 And tt1.field2=tt2.field2 And tt1.field3=tt2.field3 ;
EDIT:
Did this in MS Access
DELETE DISTINCTROW t1.*
FROM t1 INNER JOIN t2 ON (t1.field3 = t2.field3) AND (t1.field2 = t2.field2) AND (t1.field1 = t2.field1);
And it worked, you have to set the Unique Records to Yes
What about this:
DELETE
FROM t1
INNER JOIN t2 ON t1.field1=t2.field1
And t1.field2=t2.field2
And t1.field3=t2.field3
Will delete all records in t1 that have a matching record in t2 based on the three field values.
I've been struggling with something similar.
I found the easiest way is not to use a query at all, but to create an empty duplicate table with multiple Primary Keys with Duplicates OK set (Design View hold down Ctrl key and select the rows you want and then right click and select them all as Primary Keys).
Then copy and paste all the rows from your table into the new table. OK the error messages and you will find you have a table with only unique values in the fields you wanted.
That has the additional benefit of not allowing duplicate rows in your new table.