Deleting duplicate rows on MySQL, getting a max row error - mysql

I am deleting duplicate rows on MySQL and only leaving behind the old row (least id) but I am getting a max row error
DELETE n1
FROM item_audit n1, item_audit n2
WHERE n1.id > n2.id AND n1.description = n2.description

Keep in mind, with that join condition you are joining each row to every row before it (with the same description). This is one of those cases where a subquery will be much more effective than a join.
DELETE a
FROM item_audit a
WHERE (a.id, a.description) NOT IN
(SELECT * FROM
(
SELECT MIN(id), description
FROM item_audit
GROUP BY description
) AS realSubQ
)
Actually, assuming id is unique, it can even be simplier:
DELETE a
FROM item_audit a
WHERE a.id NOT IN
(SELECT * FROM
( SELECT MIN(id)
FROM item_audit
GROUP BY description
) AS realSubQ
)
As you discovered, MySQL needs to be "tricked" into being able to use the delete target in a subquery with the extra select * wrapper.
Alternatively, a join on the subquery could be used to reduce the size of the intermediate result set created behind the scenes.
DELETE a
FROM item_audit a
LEFT JOIN (SELECT MIN(id) AS firstId FROM item_audit GROUP BY description) AS aFirst
ON a.id = aFirst.firstId
WHERE aFirst.firstId IS NULL
;
If that fails, you can insert the first id's into a temp table, and should be able to do subquery version with that.
CREATE TEMPORARY TABLE `old_ids`
SELECT MIN(ID) AS id
FROM item_audit
GROUP BY description;
DELETE a
FROM item_audit a
LEFT JOIN old_ids ON a.id = old_ids.id
WHERE old_ids.id IS NULL
;
In any of these cases, a LIMIT clause can be placed very last to accomplish an incremental delete. The last, temp table, version has the benefit that the subquery will not need re-evaluated after every incremental delete (and the temporary table can be indexed to speed things up as well).

Related

In query with joins and multi-table/field ORDER BY, how to set LIMIT offset to start from a particular row identified by a unique id field?

Suppose I have four tables: tbl1 ... tbl4. Each has a unique numerical id field. tbl1, tbl2 and tbl3 each has a foreign key field for the next table in the sequence. E.g. tbl1 has a tbl2_id foreign key field, and so on. Each table also has a field order (and other fields not relevant to the question).
It is straightforward to join all four tables to return all rows of tbl1 together with corresponding fields from the other three fields. It is also easy to order this result set by a specific ORDER BY combination of the order fields. It is also easy to return just the row that corresponds to some particular id in tbl1, e.g. WHERE tbl1.id = 7777.
QUESTION: what query most efficiently returns (e.g.) 100 rows, starting from the row corresponding to id=7777, in the order determined by the specific combination of order fields?
Using ROW_NUMBER or (an emulation of it in MySQL version < 8) to get the position of the id=7777 row, and then using that in a new version of the same query to set the offset in the LIMIT clause would be one approach. (With a read lock in between.) But can it be done in a single query?
# FIRST QUERY: get row number of result row where tbl1.id = 7777
SELECT x.row_number
FROM
(SELECT #row_number:=#row_number+1 AS row_number, tbl1.id AS id
FROM (SELECT #row_number:=0) AS t, tbl1
INNER JOIN tbl2 ON tbl2.id = tbl1.tbl2_id
INNER JOIN tbl3 ON tbl3.id = tbl2.tbl3_id
INNER JOIN tbl4 ON tbl4.id = tbl3.tbl4_id
WHERE <some conditions>
ORDER BY tbl4.order, tbl3.order, tbl2.order, tbl1.order
) AS x
WHERE id=7777;
Store the row number from the above query and use it to bind :offset in the following query.
# SECOND QUERY : Get 100 rows starting from the one with id=7777
SELECT x.field1, x.field2, <etc.>
FROM
(SELECT #row_number:=#row_number+1 AS row_number, field1, field2
FROM (SELECT #row_number:=0) AS t, tbl1
INNER JOIN tbl2 ON tbl2.id = tbl1.tbl2_id
INNER JOIN tbl3 ON tbl3.id = tbl2.tbl3_id
INNER JOIN tbl4 ON tbl4.id = tbl3.tbl4_id
WHERE <same conditions as before>
ORDER BY tbl4.order, tbl3.order, tbl2.order, tbl1.order
) AS x
LIMIT :offset, 100;
Clarify question
In the general case, you won't ask for WHERE id1 > 7777. Instead, you have a tuple of (11,22,33,44) and you want to "continue where you left off".
Two discussions, with
That is messy, but not impossible. See Iterating through a compound key . Ig gives an example of doing it with 2 columns; 4 columns coming from 4 tables is an extension of such.
A variation
Here is another discussion of such: https://dba.stackexchange.com/questions/164428/should-i-store-data-pre-ordered-rather-than-ordering-on-the-fly/164755#164755
In actually implementing such, I have found that letting the "100" (LIMIT) be flexible can be easier to think through. The idea is: reach forward 100 rows (with LIMIT 100,1). Let's say you get (111,222,333,444). If you are currently at (111, ...), then deal with id2/3/4. If it is, say, (113, ...), then do WHERE id1 < 113 and leave off any specification of id2/3/4. This means fetching less than 100 rows, but it lands you just shy of starting id1=113.
That is, it involves constructing a WHERE clause with between 1 and 4 conditions.
In all cases, your query says ORDER BY id1, id2, id3, id4. And the only use for LIMIT is in the probe to figure out how far ahead the 100th row is (with LIMIT 100,1).
I think I can dig out some old Perl code for that.

MySQL: delete query taking too long

I am trying to delete records from table with duplicate column values but it's taking forever. Basically it gets stuck and no response for hours. I have a significantly large table with over 1.3M records. Is the query inefficient? any wat to optimize it?
delete n1 from ids n1, ids n2 where n1.id > n2.id and n1.user_id = n2.user_id
Database is remote, and am using putty to run queries.
Add an index:
ALTER TABLE ids ADD INDEX (user_id, id);
This makes it efficient to find all the rows with the same user ID and higher IDs.
It will also help to join with a subquery.
DELETE n1
FROM ids AS n1
JOIN (SELECT user_id, MIN(id) AS minid
FROM ids
GROUP BY user_id) AS n2
ON n1.user_id = n2.user_id AND n1.id > n2.minid
This will still be faster with the above index.
yes, that query is very inefficient. Even if you used explicit joins you need to keep in mind that basically every row "N" is being matched up with every row before "N", and every row "N-1" is being matched up with the rows before it.
Try something like this:
DROP TEMPORARY TABLE IF EXISTS keeps;
CREATE TEMPORARY TABLE keeps (
user_id INT,
keepID INT,
INDEX (user_id, keepID)
)
INSERT INTO keeps (user_id, keepID)
SELECT user_id, MIN(id) As keepID
FROM ids
GROUP BY user_id;
DELETE FROM ids WHERE (user_id, id) NOT IN (SELECT user_id, keepID FROM keeps);
DROP TEMPORARY TABLE IF EXISTS keeps;
I'm also tempted to suggest trying something like the below, but I can't remember if MySQL allows subquerying the delete table in the delete query ... which is why I suggested the temp table in the first one.
DELETE a
FROM ids AS a
WHERE EXISTS (
SELECT *
FROM ids AS b
WHERE b.id < a.id
AND b.user_id = a.user_id
)

Remove Duplicates from MYsql table using MinID and Complex Select

I found this here to delete records with min ID:
DELETE FROM Table WHERE id NOT IN (SELECT MIN(id) FROM Table GROUP BY FieldA)
However I don't want all the found dupes in the table to have the one with the lower ID removed, only a subset of them. I have other criteria for other dupes patterns. So I made my select to get the subset of records that have dupes AND other conditions where I then DO want the min ids removed:
Select min(Z),Max(Z),count(*) from Table
group by P,N
having count(*)>1 and Min(Z)!=Max(Z) and Min(Z)>0
I am unclear how to first get that subset of records and THEN remove the minID from the dupes in that subset
In MySQL use LEFT JOIN:
delete t
from table t left join
(Select min(Z), Max(Z), count(*)
from Table
group by P, N
having count(*) > 1 and Min(Z) <> Max(Z) and Min(Z) > 0
) tt
on t.? = tt.?
where tt.? is null;
It is unclear how the id is defined in your expression. It is also unclear whether the subquery generates the ids to keep or to delete. The version above assumes it is generating the ids to keep. (If it generates the ids to delete, then use inner join and get rid of the where clause.)

Possible speed up to delete via temp table in mysql

I am running a delete which removes all of the duplicates within a table. A duplicate is defined as a row where the tag_id, user_id, and is_self are all the same. My technique here is pretty standard, to preform this delete, since the tags_users table itself needs to be referenced to know if a duplicate exists a temp table is created so that a delete can be preformed from the same table that is being referenced. The problem is that this table is about a million rows so this query takes about an hour to run. I know this is related to the slow speed of defining this temp table and then referencing it as it is un-indexed.
DELETE FROM tags_users WHERE id IN (
SELECT id FROM (
SELECT A.id FROM tags_users as A, tags_users as B WHERE A.id > B.id AND A.user_id = B.user_id AND A.tag_id = B.tag_id AND A.is_self = B.is_self GROUP BY A.id
) temp_dup_delete
);
I have reviewed the explain from this query listed here (Please note I'm on mysql 5.5 so I'm using EXPLAIN SELECT 1 to simulate EXPLAIN DELETE). I think the best possible solution to this is to define an index on the temp table, but I cannot figure out how to do this yet. The crux of my question here is: is there a way to improve the speed of this query considering the way it defines a temp table. Thank you to anyone that can help.
Here is an alternative approach. Use an aggregation query to find the minimum id for each set of key values -- this seems to be the row you want to keep.
Then, use left outer join to match to this table and delete all the rows in the original data that do not match.
delete tu
from tags_users tu left outer join
(select tag_id, user_id, is_self, min(id) as minid
from tags_users
group by tag_id, user_id, is_self
) tui
on tui.id = tu.id
where tui.id is null;

Eliminating duplicates from SQL query

What would be the best way to return one item from each id instead of all of the other items within the table. Currently the query below returns all manufacturers
SELECT m.name
FROM `default_ps_products` p
INNER JOIN `default_ps_products_manufacturers` m ON p.manufacturer_id = m.id
I have solved my question by using the DISTINCT value in my query:
SELECT DISTINCT m.name, m.id
FROM `default_ps_products` p
INNER JOIN `default_ps_products_manufacturers` m ON p.manufacturer_id = m.id
ORDER BY m.name
there are 4 main ways I can think of to delete duplicate rows
method 1
delete all rows bigger than smallest or less than greatest rowid value. Example
delete from tableName a where rowid> (select min(rowid) from tableName b where a.key=b.key and a.key2=b.key2)
method 2
usually faster but you must recreate all indexes, constraints and triggers afterward..
pull all as distinct to new table then drop 1st table and rename new table to old table name
example.
create table t1 as select distinct * from t2; drop table t1; rename t2 to t1;
method 3
delete uing where exists based on rowid. example
delete from tableName a where exists(select 'x' from tableName b where a.key1=b.key1 and a.key2=b.key2 and b.rowid >a.rowid) Note if nulls are on column use nvl on column name.
method 4
collect first row for each key value and delete rows not in this set. Example
delete from tableName a where rowid not in(select min(rowid) from tableName b group by key1, key2)
note that you don't have to use nvl for method 4
Using DISTINCT often is a bad practice. It may be a sing that there is something wrong with your SELECT statement, or your data structure is not normalized.
In your case I would use this (in assumption that default_ps_products_manufacturers has unique records).
SELECT m.id, m.name
FROM default_ps_products_manufacturers m
WHERE EXISTS (SELECT 1 FROM default_ps_products p WHERE p.manufacturer_id = m.id)
Or an equivalent query with IN:
SELECT m.id, m.name
FROM default_ps_products_manufacturers m
WHERE m.id IN (SELECT p.manufacturer_id FROM default_ps_products p)
The only thing - between all possible queries it is better to select the one with the better execution plan. Which may depend on your vendor and/or physical structure, statistics, etc... of your data base.
I think in most cases EXISTS will work better.