I want to make a query that deletes duplicate data leaving only one duplicate data when two columns overlap.
Maybe because of a lot of data, but the following query doesn't work for a long time
DELETE t1 FROM table t1 INNER JOIN table t2
WHERE t1.idx < t2.idx AND t1.Nm = t2.Nm AND t1.product = t2.product;
Can this query do what I want? If not, what is the other way?
Create an Index on the 3 columns involved in the ON clause:
CREATE INDEX idx_name
ON tablename (Nm, product, idx);
and execute the query like this:
DELETE t1 FROM tablename t1 INNER JOIN tablename t2
WHERE t1.Nm = t2.Nm AND t1.product = t2.product AND t1.idx < t2.idx;
As you can see in this simplified demo, the query will be executed using the index.
Related
I have got a database with about 7000 cars, but unfortunately, only about 1000 are unique. Can I delete all the duplicated rows?
The schema looks like this:
Thank you!
Here is one way to do it:
delete t1
from mytable t1
inner join mytable t2
on t2.brand = t1.brand
and t2.model = t1.model
and t2.id < t1.id
This will delete duplicates on (brand, model) while retaining the one with the smallest id.
I have a table containing roughly 5 million rows and 150 columns. However, there are several similar rows that I would like to consider duplicates if they share the same values for 3 columns : ID, Order and Name.
However, I don't just want to delete the duplicates at random, I want the row I consider a duplicate to be the one which has the smaller count value (Count being another column) or if they have the same count, then base it on which has the earliest date column (Date is another column).
I have tried with the code below:
DELETE t1 FROM uploaddata_copy t1
JOIN uploaddata_copy t2
ON t2.Name = t1.Name
AND t2.ID = t1.ID
AND t2.Order = t1.Order
AND t2.Count < t1.Count
AND t2.Date < t1.Date
However (and this is probably due to my computer) it seems to run indefinitely (~25mins) before timing out from the server so I'm left unsure if this is correct and I just need to run for even longer or if the code is inherently wrong and there is a quicker way of doing it.
A more accurate query would be:
DELETE t1
FROM uploaddata_copy t1 JOIN
uploaddata_copy t2
ON t2.Name = t1.Name AND
t2.ID = t1.ID AND
t2.Order = t1.Order AND
(t2.Count < t1.Count OR
t2.Count = t1.Count AND t2.Date < t1.Date
);
However, fixing the logic will not (in this case) improve performance. First, you want an index on uploaddata_copy(Name, Id, Order, Count, Date). This allows the "lookup" to be between the original data and only the index.
Second, start small. Add a LIMIT 1 or LIMIT 10 to see how long it takes to remove just a few rows. Deleting rows is a complicated process, because it affects the table, indexes, and the transaction log -- not to mention any triggers on the table.
If a lot of rows are being deleted, you might find it faster to re-create the table, but that depends heavily on the relative number of rows being removed.
Why the join? You want to delete rows when there exists a "better" record. So use an EXISTS clause:
delete from dup using uploaddata_copy as dup
where exists
(
select *
from uploaddata_copy better
where better.name = dup.name
and better.id = dup.id
and better.order = dup.order
and (better.count > dup.count or (better.count = dup.count and better.date > dup.date))
);
(Please check my comparisions. This is how I understand this: A better record for name + id + order has a greater count or the same count and a higher date. You consider the worse record an undesired duplicate you want to delete.)
You'd have an index on uploaddata_copy(id, name, order) at least or better even on uploaddata_copy(id, name, order, count, date) for this delete statement to perform well.
Please try with this:
DELETE t1 FROM uploaddata_copy t1
JOIN uploaddata_copy t2
ON t2.Name = t1.Name
AND t2.ID = t1.ID
AND t2.Order = t1.Order
AND t2.Count < t1.Count
AND t2.Date < t1.Date
AND t2.primary_key != t1.primary_key
I have a MySQL table T and there is an index on a column c1. My join query is as follows.
select something from T as t1 inner join T as t2 on ABS(t1.c1-t2.c1)<2;
I used explain to see whether MySQL uses index or not. It didn't use index for the above query. But it did use index for below query.
select something from T as t1 inner join T as t2 on t1.c1=t2.c1;
So how can make MySQL use index on the first query?
Thanks in advance.
You need to remove the function call surrounding the t2.c1 column in order to use the index on that column.
A query like this should work for you:
SELECT something
FROM T AS t1
INNER JOIN T AS t2 ON t2.c1 > (t1.c1 - 2) and t2.c1 < (t1.c1 + 2);
NOTE: This assume that you have an index on the c1 column.
I have a question that I think is simple enough but I seem to be having some trouble with it.
I have two tables. Each table has the exact same rows.
I am trying to perform a join on the two tables with the following query:
SELECT t1.`* FROM `person` as t1
JOIN `person_temp` as t2
on t1.`date` = t2.`date`
and t1.`name` = t2.`name`
and t1.`country_id`= t2.`country_id`
The point of this query is to find all of the rows in t1 that match t2 where the combination of date,name, and country_id are the same (those three columns combined make a record unique). i don't think this query is correct for what I am trying to do because if I have the same exact data in both tables I am getting a much larger number of matches back.
Any ideas on how to edit this query to accomplish what I am trying to do?
Don't use join. Use exists:
SELECT t1.`*
FROM `person` t1
where exists (select 1
from `person_temp` as t2
where t1.`date` = t2.`date`
and t1.`name` = t2.`name`
and t1.`country_id`= t2.`country_id`
);
For performance, you want a composite index on person_temp(date, name, country_id) (the columns can be in any order).
I am trying to spot some broken records in a MS-SQL Database.
In a simplified example, the scenerio is this:
I have 2 tables, simply put:
Table_1 : Id,Date,OpId
Table_2 : Date,OpId,EventName
And I have this business rule: If there is a record in Table_1 THEN at least 1 row should exist in the Table_2 for the Table_1.Date and Table.OpId.
If there is a row in Table_1 and if there is no row matching with that row in Table_2 THEN there is a broken data -whatever the reason-.
To find out the incorrect data, I use:
SELECT *
FROM table_1 t1
LEFT JOIN table_2 t2 ON t1.Date = t2.Date AND t1.OpId = t2.OpId
WHERE t2.OpId IS NULL -- So, if there is no
-- matching row in table_2 then this is a mistake
But it takes too long to have the query completed.
Can there be a faster or better way to approach similar scenerios?
To do an anti semi join NOT EXISTS in SQL Server is usually better than or equal to in performance the other options (NOT IN, OUTER JOIN ... NULL, EXCEPT)
SELECT *
FROM table_1 t1
WHERE NOT EXISTS (SELECT *
FROM table_2 t2
WHERE t1.Date = t2.Date
AND t1.OpId = t2.OpId)
See Left outer join vs NOT EXISTS. You may well be missing a useful index though.
If you use proper indexing there is nothing to do with it (may be use NOT EXISTS instead of LEFT JOIN will be a little bit faster),
BUT
if the Table_1 is has relatively small amount of data and there is no any FKeys or other such a stuff, and this is a one time procedure, then you can use trick like this to drop incorrect lines:
SELECT table_1.*
INTO tempTable
FROM table_1 t1
WHERE EXISTS(SELECT * FROM table_1 t1 WHERE t1.Date = t2.Date AND t1.OpId = t2.OpId)
drop table Table_1
exec sp_rename 'tempTable', 'Table_1'
This may be faster