I have a table containing roughly 5 million rows and 150 columns. However, there are several similar rows that I would like to consider duplicates if they share the same values for 3 columns : ID, Order and Name.
However, I don't just want to delete the duplicates at random, I want the row I consider a duplicate to be the one which has the smaller count value (Count being another column) or if they have the same count, then base it on which has the earliest date column (Date is another column).
I have tried with the code below:
DELETE t1 FROM uploaddata_copy t1
JOIN uploaddata_copy t2
ON t2.Name = t1.Name
AND t2.ID = t1.ID
AND t2.Order = t1.Order
AND t2.Count < t1.Count
AND t2.Date < t1.Date
However (and this is probably due to my computer) it seems to run indefinitely (~25mins) before timing out from the server so I'm left unsure if this is correct and I just need to run for even longer or if the code is inherently wrong and there is a quicker way of doing it.
A more accurate query would be:
DELETE t1
FROM uploaddata_copy t1 JOIN
uploaddata_copy t2
ON t2.Name = t1.Name AND
t2.ID = t1.ID AND
t2.Order = t1.Order AND
(t2.Count < t1.Count OR
t2.Count = t1.Count AND t2.Date < t1.Date
);
However, fixing the logic will not (in this case) improve performance. First, you want an index on uploaddata_copy(Name, Id, Order, Count, Date). This allows the "lookup" to be between the original data and only the index.
Second, start small. Add a LIMIT 1 or LIMIT 10 to see how long it takes to remove just a few rows. Deleting rows is a complicated process, because it affects the table, indexes, and the transaction log -- not to mention any triggers on the table.
If a lot of rows are being deleted, you might find it faster to re-create the table, but that depends heavily on the relative number of rows being removed.
Why the join? You want to delete rows when there exists a "better" record. So use an EXISTS clause:
delete from dup using uploaddata_copy as dup
where exists
(
select *
from uploaddata_copy better
where better.name = dup.name
and better.id = dup.id
and better.order = dup.order
and (better.count > dup.count or (better.count = dup.count and better.date > dup.date))
);
(Please check my comparisions. This is how I understand this: A better record for name + id + order has a greater count or the same count and a higher date. You consider the worse record an undesired duplicate you want to delete.)
You'd have an index on uploaddata_copy(id, name, order) at least or better even on uploaddata_copy(id, name, order, count, date) for this delete statement to perform well.
Please try with this:
DELETE t1 FROM uploaddata_copy t1
JOIN uploaddata_copy t2
ON t2.Name = t1.Name
AND t2.ID = t1.ID
AND t2.Order = t1.Order
AND t2.Count < t1.Count
AND t2.Date < t1.Date
AND t2.primary_key != t1.primary_key
Related
I want to make a query that deletes duplicate data leaving only one duplicate data when two columns overlap.
Maybe because of a lot of data, but the following query doesn't work for a long time
DELETE t1 FROM table t1 INNER JOIN table t2
WHERE t1.idx < t2.idx AND t1.Nm = t2.Nm AND t1.product = t2.product;
Can this query do what I want? If not, what is the other way?
Create an Index on the 3 columns involved in the ON clause:
CREATE INDEX idx_name
ON tablename (Nm, product, idx);
and execute the query like this:
DELETE t1 FROM tablename t1 INNER JOIN tablename t2
WHERE t1.Nm = t2.Nm AND t1.product = t2.product AND t1.idx < t2.idx;
As you can see in this simplified demo, the query will be executed using the index.
I have got a database with about 7000 cars, but unfortunately, only about 1000 are unique. Can I delete all the duplicated rows?
The schema looks like this:
Thank you!
Here is one way to do it:
delete t1
from mytable t1
inner join mytable t2
on t2.brand = t1.brand
and t2.model = t1.model
and t2.id < t1.id
This will delete duplicates on (brand, model) while retaining the one with the smallest id.
I recently came across a query in one of our office discussions,
SELECT t1.id, t1.name, t1.date AS date_filter,
(SELECT t2.column_x
FROM table_2 t2
WHERE t2.date = date_filter LIMIT 1
) AS column_x
FROM table_1 t1
WHERE t1.category_id = 10
ORDER BY t1.date
LIMIT 10;
The sub-query returns a column value from a second table that matches the date from the first table.
This query is not running at an optimised speed, can you guys pass me what are the ways to improvise the performance ?
Cheers
It would help to have SHOW CREATE TABLE for both tables, plus EXPLAIN SELECT ...
Indexes needed:
t1: INDEX(category_id, date)
t2: INDEX(date)
The subquery does not make sense without an ORDER BY -- which "1" row do you want?
I've the following UPDATE statement
UPDATE Table1 t1
INNER JOIN Table2 t2 ON (t1.Day = t2.Day AND t1.Id = t2.Id)
SET
t1.Price = t2.Price,
t1.Name = t2.Name
WHERE t2.Id = 1
AND t2.Day = DATE_FORMAT(DATE_ADD('2013-11-01', INTERVAL 1 DAY),'%Y-%m-%d');
When running a EXPLAIN statement I get the message back as
Impossible WHERE noticed after reading const tables
At the moment selecting a range of 21 records is returned in about 0.400 seconds on average.
I've already added a Index on the fields t2.Id and t2.Day. Basically the requirement of this update statement is to take all records that exist in Table2 with the Id of 1 for each Day (or all Dates between DayStart and DayEnd, which I have access to).
Is there anyway on improving this in terms of performance, or should I not worry about the EXPLAIN result ?
I assume that
SELECT * FROM Table1 t1
INNER JOIN Table2 t2 ON (t1.Day = t2.Day AND t1.Id = t2.Id)
WHERE t2.Id = 1
AND t2.Day = DATE_FORMAT(DATE_ADD('2013-11-01', INTERVAL 1 DAY),'%Y-%m-%d');
will return an empty result.
Impossible WHERE noticed after reading const tables is not performance related. The EXPLAIN is just telling you that there is no dataset found with your given WHERE condition. So maybe there are no datasets in Table2 with Id = 1 OR Day = '2013-11-02'?
If you solved the Impossible WHERE... you can start optimizing your query with the EXPLAIN result (400ms seems to be very slow).
In MySQL, how can I find all rows whose attribute1 is the same as a particular row's attribute1? I thought about doing
SELECT
t1.id
FROM
t AS t1
, t AS t2
WHERE
t2.id=123
AND t1.a=t2.a;
but it has been running for eons.
This should work to return the required rows.
SELECT t1.id
FROM t AS t1
JOIN t AS t2 ON (t1.a = t2.a and t1.id <> t2.id)
WHERE t2.id=123;
How many rows are in your table? Is the "a" column indexed? Adding an index should speed up the join.
Here's an example on SQLFiddle.