MySQL self join to flag duplicate rows - better way to do this?

MySQL self join to flag duplicate rows - better way to do this? - mysql

I have a table that occasionally has duplicate row values, so I want to update anything except the first one and flag it as a duplicate. Currently I'm using this but it can be very slow:
UPDATE _gtemp X
JOIN _gtemp Y
ON CONCAT(X.gt_spid, "-", X.gt_cov) = CONCAT(Y.gt_spid, "-", Y.gt_cov)
AND Y.gt_dna = 0
AND Y.gt_gtid < X.gt_gtid
SET X.gt_dna = 1;
gt_spid is a numerical ID, and gt_cov is CHAR(3). I have an index on gt_spid and a 2nd index on gt_spid, gt_cov. At times this table can be upwards of 250,000 rows, but even at 30,000 it takes forever.
Is there a better way to accomplish this? I can change the table as needed.
CREATE TABLE `_gtemp` (
`gt_gtid` int(11) NOT NULL AUTO_INCREMENT,
`gt_group` varchar(10) DEFAULT NULL,
`gt_spid` int(11) DEFAULT NULL,
`gt_cov` char(3) DEFAULT NULL,
`gt_dna` tinyint(1) DEFAULT '0'
PRIMARY KEY (`gt_gtid`),
KEY `spid` (`gt_spid`),
KEY `spidcov` (`gt_spid`,`gt_cov`) USING HASH
)

The way you have used CONCAT makes MySQL optimizer lose it's indexes, resulting in very slow running query.
That's why you need to replace CONCAT with AND statements like below
UPDATE
_gtemp X
JOIN
_gtemp Y
ON
X.gt_spid = Y.gt_spid
AND
X.gt_cov = Y.gt_cov
AND
Y.gt_dna = 0
AND
Y.gt_gtid < X.gt_gtid
SET X.gt_dna = 1;

You can eliminate CONCAT in ON clause and replace it with AND as follows.
Also have moved one restriction from ON to WHERE clause.
Add index to gt_dna
UPDATE _gtemp X
JOIN _gtemp Y
ON X.gt_spid = Y.gt_spid
AND X.gt_cov = Y.gt_cov
AND Y.gt_dna = 0
SET X.gt_dna = 1
WHERE Y.gt_gtid < X.gt_gtid

Related

Serious MySQL query performance issues after adding condition

My problem is that I have a mysql query that runs really fast (0.3 seconds) even though it has a large amount of left joins and a few conditions on the joined columns, but when I add one more condition the query takes upwards of 180 seconds! I understand that the condition means the execution plan has to adjust to pull all potential records first and then apply the condition in a loop, but what's weird to me is that the fast query without the additional condition only returns 16 rows, and even just wrapping the query with the condition on the outer query takes a crazy amount of time when you would think it would only just add an additional loop through 16 rows...
If it matters this is using Amazon Aurora serverless which should align with mysql 5.7
Here's what the query looks like. You can see the additional condition is commented out. (The general table structure of the DB itself cannot change currently so please refrain from suggesting a full database restructuring)
select
e1.entityId as _id,
v1.Value,
v2.Value
v3.Value,
v4.Value,
v5.Value,
v6.Value,
v7.Value,
v8.Value,
v9.Value,
v10.Value,
v11.Value,
v12.Value
from entity e1
left join val as v1 on (v1.entityId = e1.entityId and v1.attributeId = 1189)
left join val as v2 on (v2.entityId = e1.entityId and v2.attributeId = 1190)
left join entity as e2 on e2.entityId = (select entityId from entity where code = v1.Value and type = 88 limit 1)
left join val as v3 on (v3.entityId = e2.entityId and v3.attributeId = 507)
left join val as v4 on (v4.entityId = e2.entityId and v4.attributeId = 522)
left join val as v5 on (v5.entityId = e2.entityId and v5.attributeId = 558)
left join val as v6 on (v6.entityId = e2.entityId and v6.attributeId = 516)
left join val as v7 on (v7.entityId = e2.entityId and v7.attributeId = 518)
left join val as v8 on (v8.entityId = e2.entityId and v8.attributeId = 1384)
left join val as v9 on (v9.entityId = e2.entityId and v9.attributeId = 659)
left join val as v10 on (v10.entityId = e2.entityId and v10.attributeId = 519)
left join val as v11 on (v11.entityId = e2.entityId and v11.attributeId = 1614)
left join entity as e3 on e3.entityId = (select entityId from entity where code = v9.Value and type = 97 limit 1)
left join val as v12 on (v12.entityId = e3.entityId and v12.attributeId = 661)
where e1.type = 154
and v2.Value = 'foo'
and v5.Value = 'bar'
and v10.Value = 'foo2'
-- and v11`.Value = 'bar2'
order by v3.Value asc;
And wrapping that in something like this still takes forever...
select *
from (
<query from above>
) sub
where sub.v11 = 'bar2';
query execution plan with the condition commented out (fast)
query execution plan with the condition included (slow)
I'm going to fiddle around with indexing on the "entity" tables to improve the execution plan regardless which will likely help... but can someone explain what's going on here and what I should be looking at in the execution plan that would indicate such bad performance? And why wrapping the fast query in a subquery so that the outer query should only loop over 16 rows takes a really long time?
EDIT: I noticed in the slow query that the far left execution is using a non-unique key lookup (which is on val.entityId) for "68e9145e-43eb-4581-9727-4212be41bef5" (v11) instead of the unique key lookup the rest are using (which is a composite index on entityId,attributeId). I presume this might be part of the issue, but why can't it use the the composite index there like it does for the rest?
PS: For now since we know the result set will be small, we are implementing that last condition server side with a filter on the result set in our nodeJS server.
Here's the results of "SHOW CREATE TABLE entity" and "SHOW CREATE TABLE val"
CREATE TABLE `entity` (
`entityId` int(11) NOT NULL AUTO_INCREMENT,
`UID` varchar(64) NOT NULL,
`type` int(11) NOT NULL,
`code` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PRIMARY KEY (`entityId`),
UNIQUE KEY `UID` (`UID`),
KEY `IX_Entity_Type` (`type`),
CONSTRAINT `FK_Entities_Types` FOREIGN KEY (`type`) REFERENCES `entityTypes` (`typeId`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=296138 DEFAULT CHARSET=latin1
CREATE TABLE `val` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`UID` varchar(64) NOT NULL,
`attributeId` int(11) NOT NULL,
`entityId` int(11) NOT NULL,
`Value` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PRIMARY KEY (`id`),
UNIQUE KEY `UID` (`UID`),
UNIQUE KEY `idx_val_entityId_attributeId` (`entityId`,`attributeId`),
KEY `IX_val_attributeId` (`attributeId`),
KEY `IX_val_entityId` (`entityId`)
) ENGINE=InnoDB AUTO_INCREMENT=2325375 DEFAULT CHARSET=latin1

Please provide SHOW CREATE TABLE.
I would hope to see these composite indexes:
`val`: (entityId, attributeId) -- order is not critical
Alas, because code is LONGTEXT, this is not possible for entity: INDEX(type, code, entityId). Hence this will not be very efficient:
SELECT entityId
from entity
where code = v9.Value
and type = 97
limit 1
I see LIMIT with an ORDER BY -- do you care which value you get?
Probably that would be better written as
WHERE EXISTS ( SELECT 1 FROM entity
WHERE entityID = e3.entityID
AND code = v9.Value
AND type = 97 )
(Are you sure about the mixture of e3 and v9?)
Wrapping...
This forces the LEFT JOIN to become JOIN. And it gets rid of the then inner ORDER BY.
Then the Optimizer probably decides it is best to start with 68e9145e-43eb-4581-9727-4212be41bef5, which I call val AS v11:
JOIN val AS v11 ON (v11.entityId = e2.id
and v11.attributeId = 1614)
AND v11.Value = 'bar2')
If this is an EAV table, then all it does is verify that [, 1514] has value 'bar2'. This does not seem like a sensible test.
in addition to my former recommendation.
I would prefer EXPLAIN SELECT ....
EAV
Assuming val is a traditional EAV table, this would probably be much better:
CREATE TABLE `val` (
`attributeId` int(11) NOT NULL,
`entityId` int(11) NOT NULL,
`Value` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PRIMARY KEY(`entityId`,`attributeId`),
KEY `IX_val_attributeId` (`attributeId`),
) ENGINE=InnoDB AUTO_INCREMENT=2325375 DEFAULT CHARSET=latin1
The two IDs have no practical use (unless I am missing something). If you are forced to use them because of a framework, that is unfortunate. Promoting (entityId, attributeId) to be the PK makes fetching value a little faster.
There is no useful way to include a LONGTEXT in any index, so some of my previous suggestions need changing.

Using SQL to select records with a single "true" bit field from several bit fields

If I have a table like this:
CREATE TABLE `Suppression` (
`SuppressionId` int(11) NOT NULL AUTO_INCREMENT,
`Address` varchar(255) DEFAULT NULL,
`BooleanOne` bit(1) NOT NULL DEFAULT '0',
`BooleanTwo` bit(1) NOT NULL DEFAULT '0',
`BooleanThree` bit(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`SuppressionId`),
)
Is there a set-based way in which I can select all records which have exactly one of the three bit fields = 1 without writing out the field names?
For example given:
1 10 Pretend Street 1 1 1
2 11 Pretend Street 0 0 0
3 12 Pretend Street 1 1 0
4 13 Pretend Street 0 1 0
5 14 Pretend Street 1 0 1
6 14 Pretend Street 1 0 0
I want to return records 4 and 6.

You could "add them up":
where cast(booleanone as unsigned) + cast(booleantwo as unsigned) + cast(booleanthree as unsigned) = 1
Or, use tuples:
where ( (booleanone, booleantwo, booleanthree) ) in ( (0b1, 0b0, 0b0), (0b0, 0b1, 0b0), (0b0, 0b0, 0b1) )
I'm not sure what you mean by "set-based".

If your number of booleans can vary over time and you don't want to update your code, I suggest you make them lines and not columns.
For example:
CREATE TABLE `Suppression` (
`SuppressionId` int(11) NOT NULL AUTO_INCREMENT,
`Address` varchar(255) DEFAULT NULL,
`BooleanId` int(11) NOT NULL,
`BooleanValue` bit(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`SuppressionId`,`BooleanId`),
)
So with 1 query and a 'group by' you can check all values of your booleans, however numerous they are. Of course, this makes your tables bigger.
EDIT: Just came out with another idea: why don't you have a checksum column added, whose value would be the sum of all your bits? So you would update it at every write into your table, and just check this one in your select

If you
must use this denormalized way of representing these flags, and you
must be able to add new flag columns to your table in production, and you
cannot rewrite your queries by hand when you add columns,
then you must figure out how to write a program to write your queries.
You can use this query to retrieve a result set of boolean-valued columns, then you can use that result set in a program to write a query involving all those columns.
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
AND TABLE_NAME = 'Suppression'
AND COLUMN_NAME LIKE 'Boolean%'
AND DATA_TYPE = 'bit'
AND NUMERIC_PRECISION=1
The approach you have proposed here will work exponentially more poorly as you add columns, unfortunately. Any time a software engineer says "exponential" it's time to run away screaming. Seriously.
A much more scalable approach is to build a one-to-many relationship between your Suppression rows and your flags. Add this table.
CREATE TABLE SuppressionFlags (
SuppressionId int(11) NOT NULL,
FlagName varchar(31) NOT NULL,
Value bit(1) NOT NULL DEFAULT '0',
PRIMARY KEY (SuppressionID, FlagName)
)
Then, when you want to insert a row with some flag variables, do this sequence of queries.
INSERT INTO Suppression (Address) VALUES ('some address');
SET #SuppressionId := LAST_INSERT_ID();
INSERT INTO SuppressionFlags (SuppressionId, FlagName, Value)
VALUES (#SuppressionId, 'BooleanOne', 1);
INSERT INTO SuppressionFlags (SuppressionId, FlagName, Value)
VALUES (#SuppressionId, 'BooleanTwo', 0);
INSERT INTO SuppressionFlags (SuppressionId, FlagName, Value)
VALUES (#SuppressionId, 'BooleanThree', 0);
This gives you one Suppression row with three flags set in the SuppressionFlags table. Note the use of #SuppressionId to set the Id values in the second table.
Then to find all rows with just one flag set, do this.
SELECT Suppression.SuppressionId, Suppression.Address
FROM Suppression
JOIN SuppressionFlags ON Suppression.SuppressionId = SuppressionFlags.SuppressionId
GROUP BY Suppression.SuppressionId, Suppression.Address
HAVING SUM(SuppressionFlags.Value) = 1
It gets a little trickier if you want more elaborate combinations. For example, if you want all rows with BooleanOne and either BooleanTwo or BooleanThree set, you need to do something like this.
SELECT S.SuppressionId, S.Address
FROM Suppression S
JOIN SuppressionFlags A ON S.SuppressionId=A.SuppressionId AND A.FlagName='BooleanOne'
JOIN SuppressionFlags B ON S.SuppressionId=B.SuppressionId AND B.FlagName='BooleanTwo'
JOIN SuppressionFlags C ON S.SuppressionId=C.SuppressionId AND C.FlagName='BooleanThree'
WHERE A.Value = 1 AND (B.Value = 1 OR C.Value = 1)
This common database pattern is called the attribute / value pattern. Because SQL doesn't easily let you use variables for column names (it doesn't really have reflection) this kind of way of naming your attributes is your best path to extensibility.
It's a little more SQL. But you can add as many new flags as you need, in production, without rewriting queries or getting a combinatorial explosion of flag-matching. And SQL is built to handle this kind of query.

Subtract from zero not working in query

I have this table:
CREATE TABLE `page` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`sortorder` SMALLINT(5) UNSIGNED NOT NULL,
PRIMARY KEY (`id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
This is the data I have:
id sortorder
1 0
2 1
And I want to run this query:
select id from page where (sortorder = (select sortorder from page where id = 1) - 1)
(I'm trying to find the previous page, ie the one with the lower sortorder, if it exists. If none exists, I want an empty result set.)
The error I receive from mysql:
SQL Error (1690): BIGINT UNSIGNED value is out of range in '((select '0' from `page` where 1) - 1)'
And more specifically when I run:
select sortorder - 1 from page where id = 1
I get:
SQL Error (1690): BIGINT UNSIGNED value is out of range in '('0' - 1)'
What can I do to prevent this?

I usually use JOINs for this goal because they can be optimized better than the sub-queries. This query should produce the same result as yours but probably faster:
SELECT pp.*
FROM page cp # 'cp' from 'current page'
LEFT JOIN page pp # 'pp' from 'previous page'
ON pp.sortorder = cp.sortorder - 1
WHERE cp.id = 1
Unfortunately it fails running with the same error message about -1 not being UNSIGNED.
It can be fixed by writing the JOIN condition as:
ON pp.sortorder + 1 = cp.sortorder
I moved the -1 to the other side of the equal sign and it turned to +1.
You can also fix your original query by using the same trick: moving -1 to the other side of the equal sign; this way it becomes +1 and there is no error any more:
select id
from page
where (sortorder + 1 = (select sortorder from page where id = 1)
The problem with both queries now is that, because there is no index on column sortorder, MySQL is forced to check all the rows one by one until it finds one matching the WHERE (or ON) condition and this takes a lot of time and uses a lot of resources.
Fortunately, this can be fixed easily by adding an index on column sortorder:
ALTER TABLE page ADD INDEX(sortorder);
Now both queries can be used. The one using JOIN (and the ON condition with +1) is slightly faster.
The original query doesn't return any rows when the condition is not met. The JOIN query returns a row full of NULLs. It can be modified to return no rows by replacing LEFT JOIN with INNER JOIN.
You can circumvent the error altogether (and use any version of these queries) by removing the UNSIGNED attribute from column sortorder:
ALTER TABLE page
CHANGE COLUMN `sortorder` `sortorder` SMALLINT(5) UNSIGNED NOT NULL;

Try to set your SQL Mode in 'NO_UNSIGNED_SUBTRACTION'
SET sql_mode = 'NO_UNSIGNED_SUBTRACTION'

MySQL: How to update unique key on duplicate unique key

Given the following table structure, how can I change the value of primary to 0 when a duplicate unique index is found?
CREATE TABLE `ncur` (
`user_id` INT NOT NULL,
`rank_id` INT NOT NULL,
`primary` TINYINT DEFAULT NULL,
PRIMARY KEY (`user_id`, `rank_id`),
UNIQUE (`user_id`, `primary`)
);
So, when I run a query like this:
UPDATE `ncur` SET `primary` = 1 WHERE `user_id` = 4 AND `rank_id` = 5;
When a constraint of user_id-primary is matched, I want it to set all primary values for user_id to NULL, and then complete the update query by updating the row it had found.

I am not as much familiar with MySQL as I am with Oracle; However, I think this query should work for you:
UPDATE `ncur` a
SET `primary` = (
/* 1st Subquery */
SELECT 1 FROM (SELECT * FROM `ncur`) b
WHERE b.`user_id` = a.`user_id` AND b.`rank_id` = a.`rank_id`
AND a.`rank_id` = 5
UNION ALL
/* 2nd Subquery */
SELECT 0 FROM (SELECT * FROM `ncur`) b
WHERE b.`user_id` = a.`user_id` AND b.`rank_id` <> 5 AND a.`rank_id` <> 5
GROUP BY `user_id`
HAVING COUNT(*) = 1
)
WHERE `user_id` = 4
Justification:
The query updates all the records that have user_id = 4.
For each of such records, primary is set to a different value of 1, 0, or NULL, depending on the value of rank_id in this record as well as the information regarding how many other records with the same user_id exists in the table.
The subquery that returns the value for primary consists of three subqueries, only one of which returns a value depending on the circumstances.
1st Subquery: This subquery returns 1 for the record with rank_id = 5; Otherwise it returns NULL.
2nd Subquery: This subquery returns 0 for the records with rank_id
!= 5 if there is only one such record in the table; otherwise it returns NULL.
Please note: if the query is run while there are no records with rank_id = 5, it will still update the other records according to the rules specified above. If this is not desired, the condition in the parent query must be changed from:
WHERE `user_id` = 4
to:
WHERE `user_id` = 4 AND
EXISTS(SELECT * FROM (SELECT * FROM `ncur`) b WHERE 'rank_id` = 5)

How to add and populate a sequence column to a link table

Assuming I have a table like the one below:
create table filetype_filestatus (
id integer(11) not null auto_increment,
file_type_id integer(11) not null,
file_status_id integer(11) not null,
)
I want to add a sequence column like so:
alter table filetype_filestatus add column sequence integer(11) not null;
alter table filetype_filestatus add unique key idx1 (file_type_id, file_status_id, sequence);
Now I want to add the column, which is straightforward, and populate it with some default values that satisfy the unique key.
The sequence column is to allow the user to arbitrarily order the display of file_status for a particular file_type. I'm not too concerned by the initial order since that can be revised in the application.
Ideally I would end up with something like:
FileType FileStatus Sequence
1 1 1
1 2 2
1 3 3
2 2 1
2 2 2
The best I can think of is something like:
update filetype_filestatus set sequence = file_type_id * 1000 + file_status_id;
Are there better approaches?

Hmm, I believe this should work:
UPDATE filetype_filestatus as a
SET sequence = (SELECT COALESCE(MAX(b.sequence), 0)
FROM filetype_filestatus as b
WHERE b.file_type_id = a.file_type_id) + 1
WHERE sequence = 0
I'd recommend adding the new column to the table, running the alter table statement (and getting the default of 0), run the update statement, then add the constraint (well, you have to anyways). Anything that gets touched updates to a sequence greater than 0, so this can be safely run multiple times, too.
EDIT:
As #Dems has pointed out, the subquery is being run before the update, and so the above doesn't actually work for this purpose. It does work on single-line inserts (which doesn't help at all here).
EDIT:
Gah, you have an id column, this works just fine (and yes, I tested this one first).
UPDATE filetype_filestatus as a
SET sequence = (SELECT COALESCE(COUNT(*), 0)
FROM filetype_filestatus as b
WHERE b.file_type_id = a.file_type_id
AND b.id < a.id) + 1
WHERE sequence = 0
Don't know about the performance implications, though.

If all you need are "some values that conform to idx1", why not just copy the id field? It is, after all, unique...
UPDATE
filetype_filestatus
SET
sequence = id;
EDIT
How to get sequential values based on the OPs changes to the question being asked.
ROW_NUMBER() is not available in MySQL, and it is also my understanding that you can't use the table being updated in the source query as well.
create temporary table temp_filetype_filestatus (
id integer(11) not null auto_increment,
file_type_id integer(11) not null,
file_status_id integer(11) not null,
PRIMARY KEY (file_type_id, file_status_id)
)
INSERT INTO temp_filetype_filestatus (
file_type_id,
file_status_id
)
SELECT
file_type_id,
file_status_id
FROM
filetype_filestatus
ORDER BY
file_type_id,
file_status_id
-- Update Option 1
------------------
UPDATE
filetype_filestatus
SET
sequence
=
(SELECT id FROM temp_filetype_filestatus
WHERE file_type_id = filetype_filestatus.file_type_id
AND file_status_id = filetype_filestatus.file_status_id)
-
(SELECT id FROM temp_filetype_filestatus
WHERE file_type_id = filetype_filestatus.file_type_id
ORDER BY file_status_id ASC LIMIT 1)
+
1
-- Update Option 2
------------------
UPDATE
filetype_filestatus
SET
sequence
=
(SELECT COUNT(*) FROM temp_filetype_filestatus
WHERE file_type_id = filetype_filestatus.file_type_id
AND file_status_id <= filetype_filestatus.file_status_id)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL self join to flag duplicate rows - better way to do this? - mysql

You can eliminate CONCAT in ON clause and replace it with AND as follows. Also have moved one restriction from ON to WHERE clause. Add index to gt_dna UPDATE _gtemp X JOIN _gtemp Y ON X.gt_spid = Y.gt_spid AND X.gt_cov = Y.gt_cov AND Y.gt_dna = 0 SET X.gt_dna = 1 WHERE Y.gt_gtid < X.gt_gtid

Related

Serious MySQL query performance issues after adding condition

Using SQL to select records with a single "true" bit field from several bit fields

Subtract from zero not working in query

MySQL: How to update unique key on duplicate unique key

How to add and populate a sequence column to a link table

Categories

Resources