Using LIMIT on MySql deletion across a two column duplicate - mysql

I have a large MySql table from which I need to delete duplicates - to qualify as a duplicate, a row much match another row on two columns:
SELECT * FROM JwDistanceSurnames n1, JwDistanceSurnames n2
WHERE n1.JwDistanceSurnameId > n2.JwDistanceSurnameId
AND n1.Surname1 = n2.Surname1
AND n1.Surname2 = n2.Surname2
LIMIT 1000;
Because it is a large table, I'd like to do it in batches. My understanding is that I ought to be able to use LIMIT to achieve this. However, this does not execute, citing a syntax error:
DELETE n1 FROM JwDistanceSurnames n1, JwDistanceSurnames n2
WHERE n1.JwDistanceSurnameId > n2.JwDistanceSurnameId
AND n1.Surname1 = n2.Surname1
AND n1.Surname2 = n2.Surname2
LIMIT 1000;
What's the error? Is it not possible to use this simple approach to batching here?
MCVE:
CREATE TABLE `JwDistanceSurnames` (
`JwDistanceSurnameId` int(11) NOT NULL AUTO_INCREMENT,
`Surname1` varchar(999) DEFAULT NULL,
`Surname2` varchar(999) DEFAULT NULL,
`JwScore` double NOT NULL,
PRIMARY KEY (`JwDistanceSurnameId`),
KEY `Surname1` (`Surname1`),
KEY `Surname2` (`Surname2`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
INSERT INTO `JwDistanceSurnames`
(`JwDistanceSurnameId`, `Surname1`, `Surname2`, `JwScore`)
VALUES (null,'williamsom' ,'williamson' ,0.959999999999998);
Repeat the insert a few times. Then run the delete. The expected output is a single row, with the given values. Which of the rows kept is not important.
The error is:
Error Code: 1064. You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right
syntax to use near 'ORDER BY n1.JwDistanceSurnameId LIMIT 1000' at
line 5

From this SO question, it appears that LIMIT cannot be used in a DELETE statement when more than one table is being referenced. One trick around this is to use LIMIT in a subquery to identify records for deletion, and then join back to the target table:
DELETE t1
FROM JwDistanceSurnames t1
INNER JOIN
(
SELECT n1.JwDistanceSurnameId
FROM JwDistanceSurnames n1
INNER JOIN JwDistanceSurnames n2
ON n1.JwDistanceSurnameId > n2.JwDistanceSurnameId
WHERE n1.Surname1 = n2.Surname1 AND n1.Surname2 = n2.Surname2
ORDER BY <some_column> -- IMPORTANT! without this you may get random records
LIMIT 1000
) t2
ON t1.JwDistanceSurnameId = t2.JwDistanceSurnameId;
So the subquery labelled t2 uses LIMIT to identify batches of 1000 records at a time for deletion, and then we use another join to actually label those target records.
Also note that using LIMIT without ORDER BY is not really a well-defined thing, because SQL tables are modelled on unordered sets of records. If you have some business logic determining which order the batches should be deleted, then consider adding an ORDER BY clause (unless it truly does not matter, which would seem unlikely to me).

I think you can use another way for find of dublicates
SELECT n.*
FROM JwDistanceSurnames n
JOIN
(
SELECT Surname1,Surname2,MIN(JwDistanceSurnameId) min_JwDistanceSurnameId
FROM JwDistanceSurnames
GROUP BY Surname1,Surname2
) l
ON n.Surname1=n.Surname1 AND n.Surname2=n.Surname2 AND n.JwDistanceSurnameId>l.min_JwDistanceSurnameId

Related

PHPMYADMIN: Access field in EXISTS clause

I am sure it's just a typo, but how to write the following query correctly in PHPMyAdmin?
SELECT DISTINCT `email_address` as tmp1
FROM `already_customer_checks`
WHERE `is_customer` = 0
AND NOT EXISTS (
SELECT *
FROM `already_customer_checks`
WHERE `email_address` = tmp1
AND `is_customer` = 1
)
Error: #1054 - Unknown table field 'tmp1' in where clause
Background: I want to get all e-mail addresses which exist with 'is_customer' = 0 and do not have another existance in the table with 'is_customer' = 1.
Thank you very much in advance!
To do it with a subquery you need to put the alias tmp1 on the table, not on the column. And then:
SELECT DISTINCT `email_address`
FROM `already_customer_checks` as tmp1
WHERE `is_customer` = 0
AND NOT EXISTS (
SELECT *
FROM `already_customer_checks`
WHERE `email_address` = tmp1.`email_address`
AND `is_customer` = 1
)
You might also consider the comment proposed by #kmoser, which could be more efficient, if less clear. According to the MySQL docs:
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
But if you use that SQL proposed by #kmoser, you probably don't want to alias the email_address column with tmp1.

Error occur when try to delete via select #1241 - Operand should contain 1 column(s)

I`m trying to delete a lot of data via select. This select work appropriate and returns in result 75k+ rows. I need to delete them, but when I try to delete it this error occurs
#1241 - Operand should contain 1 column(s). I'm using PHPMyAdmin.
DELETE FROM `crm_wsal_metadata`
WHERE `occurrence_id` = ANY
(SELECT *
FROM `crm_wsal_metadata`
WHERE `name` = `PostDate` AND `value` BETWEEN str_to_date('2018-12-26', '%Y-%m-%d') AND str_to_date('2020-05-31', '%Y-%m-%d')
GROUP BY `occurrence_id`)
Use
... SELECT `occurence_id` ...
instead of SELECT *. The group by clause forces you to use only grouped columns and aggregations, not star (perhaps unless some proprietary quirks I don't recommend to rely on).
I had found the answer and will try to write it step by step:
Why does this error happen?
In MySQL, you can't modify the same table which you use in the SELECT part.
This behavior is documented at http://dev.mysql.com/doc/refman/5.6/en/update.html
How to make such thing happen?
There are two ways:
Join the table to itself
UPDATE tbl AS a
INNER JOIN tbl AS b ON ....
SET a.col = b.col
Nest the subquery deeper into a from clause
UPDATE tbl SET col = (
SELECT ... FROM (SELECT.... FROM) AS x);
Personally, in my case the code looked like this:
DELETE FROM crm_wsal_metadata
WHERE occurrence_id = ANY (
SELECT occurrence_id FROM (
SELECT occurrence_id FROM crm_wsal_metadata WHERE name = "PostDate" AND value BETWEEN str_to_date('2018-12-26', '%Y-%m-%d') AND str_to_date('2020-05-31', '%Y-%m-%d') AS search) )
Sorry for such bad styling. Im new with it :)

Delete all items in a database except the last date

I have a MySQL table that looks (very simplified) like this:
CREATE TABLE `logging` (
`id` bigint(20) NOT NULL,
`time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`level` smallint(3) NOT NULL,
`message` longtext CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci NOT NULL
);
I would like to delete all rows of a specific level, except the last one (time is most recent).
Is there a way to select all rows with level set to a specific value and then delete all rows except the latest one in one single SQL query? How would I start solving this problem?
(As I said, this is a very simplified table, so please don't try to discuss possible design problems of this table. I removed some columns. It is designed per PSR-3 logging standard and I don't think there is an easy way to change that. What I want to solve is how I can select from a table and then delete all but some rows of the same table. I have only intermediate knowledge of MySQL.)
Thank you for pushing me in the right direction :)
Edit:
The Database version is /usr/sbin/mysqld Ver 8.0.18-0ubuntu0.19.10.1 for Linux on x86_64 ((Ubuntu))
You can use ROW_NUMBER() analytic function ( as using DB version 8+ ) :
DELETE lg FROM `logging` AS lg
WHERE lg.`id` IN
( SELECT t.`id`
FROM
(
SELECT t.*,
ROW_NUMBER() OVER (ORDER BY `time` DESC) as rn
FROM `logging` t
-- WHERE `level` = #lvl -- optionally add this line to restrict for a spesific value of `level`
) t
WHERE t.rn > 1
)
to delete all of the rows except the last inserted one(considering id is your primary key column).
You can do this:
SELECT COUNT(time) FROM logging WHERE level=some_level INTO #TIME_COUNT;
SET #TIME_COUNT = #TIME_COUNT-1;
PREPARE STMT FROM 'DELETE FROM logging WHERE level=some_level ORDER BY time ASC LIMIT ?;';
EXECUTE STMT USING #TIME_COUNT;
If you have an AUTO_INCREMENT id column - I would use it to determine the most recent entry. Here is one way doing that:
delete l
from (
select l1.level, max(id) as id
from logging l1
where l1.level = #level
) m
join logging l
on l.level = m.level
and l.id < m.id
An index on (level) should give you good performance and will support the MAX() subquery as well as the JOIN.
View on DB Fiddle
If you really need to use the time column, you can modify the query as follows:
delete l
from (
select l1.level, l1.id
from logging l1
where l1.level = #level
order by l1.time desc, l1.id desc
limit 1
) m
join logging l
on l.level = m.level
and l.id <> m.id
View on DB Fiddle
Here you would want to have an index on (level, time).

How to delete rows from a table in database X, where the ID exists in Database Y

I've got 2 mysql 5.7 databases hosted on the same server (we're migrating from 1 structure to another)
I want to delete all the rows from database1.table_x where the there is a corresponding row in database2.table_y
The column which contains the data to match on is called code
I'm able to do a SELECT which returns everything that is expected - this is effectively the set of data I want to delete.
An example select would be:
SELECT *
FROM `database1`.`table_x`
WHERE `code` NOT IN (SELECT `code`
FROM `database2`.`table_y`);
This works and it returns 5 rows within 138ms.
--
However, If I change the SELECT to a DELETE e.g.
DELETE
FROM `database1`.`table_x`
WHERE `code` NOT IN (SELECT `code`
FROM `database2`.`table_y`);
The query seems to hang - there are no errors returned, so I have to manually cancel the query after about 3 minutes.
--
Could anyone advise the most efficient/fastest way to achieve this?
try like below it will work
DELETE FROM table_a WHERE `code` NOT IN (
select * from
(
SELECT `code` FROM `second_database`.`table_b`
) as t
);
Try the following query:
DELETE a
FROM first_database.table_a AS a
LEFT JOIN second_database.table_b AS b ON b.code = a.code
WHERE b.code IS NULL;

Error #1093 - You can't specify target table 'relProductsPrices' for update in FROM clause

I'm upgrading and optimizing an old table structure.
In order to properly work with replace into, I'm removing old zombie entries that interfer with the new unique key over 2 columns.
Query:
DELETE from `relProductsPrices` where `ID` in
(SELECT scanA.ID from `relProductsPrices` as scanA
inner join `relProductsPrices` as scanB
where scanA.ID < scanB.ID
and scanA.product = scanB.product
and scanA.priceName = scanB.priceName);
Error:
#1093 - You can't specify target table 'relProductsPrices' for update in FROM clause
I'm not sure how to get this into one mySQL Query properly, at this time?
I hope this question is no duplicate entry, I seemed unable to find a similar, adaptable entry. There are questions regarding this error, but I'm not having an update query here at all, and the solution most people state (create a subselect) was already done by me beforehand already.
Thanks in advance!
Try this:
DELETE FROM `relProductsPrices`
WHERE `ID` IN (
SELECT
tmp.ID
FROM (
SELECT
scanA.ID
FROM
`relProductsPrices` as scanA
INNER JOIN `relProductsPrices` as scanB
ON scanA.ID < scanB.ID
AND scanA.product = scanB.product
AND scanA.priceName = scanB.priceName
) as tmp
);