Can I control which JOINed row gets used in an UPDATE? - mysql

Once upon a time, I had a table like this:
CREATE TABLE `Events` (
`EvtId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`AlarmId` INT UNSIGNED,
-- Other fields omitted for brevity
PRIMARY KEY (`EvtId`)
);
AlarmId was permitted to be NULL.
Now, because I want to expand from zero-or-one alarm per event to zero-or-more alarms per event, in a software update I'm changing instances of my database to have this instead:
CREATE TABLE `Events` (
`EvtId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
-- Other fields omitted for brevity
PRIMARY KEY (`EvtId`)
);
CREATE TABLE `EventAlarms` (
`EvtId` INT UNSIGNED NOT NULL,
`AlarmId` INT UNSIGNED NOT NULL,
PRIMARY KEY (`EvtId`, `AlarmId`),
CONSTRAINT `fk_evt` FOREIGN KEY (`EvtId`) REFERENCES `Events` (`EvtId`)
ON DELETE CASCADE ON UPDATE CASCADE
);
So far so good.
The data is easy to migrate, too:
INSERT INTO `EventAlarms`
SELECT `EvtId`, `AlarmId` FROM `Events` WHERE `AlarmId` IS NOT NULL;
ALTER TABLE `Events` DROP COLUMN `AlarmId`;
Thing is, my system requires that a downgrade also be possible. I accept that downgrades will sometimes be lossy in terms of data, and that's okay. However, they do need to work where possible, and result in the older database structure while making a best effort to keep as much original data as is reasonably possible.
In this case, that means going from zero-or-more alarms per event, to zero-or-one alarm per event. I could do it like this:
ALTER TABLE `Events` ADD COLUMN `AlarmId` INT UNSIGNED;
UPDATE `Events`
LEFT JOIN `EventAlarms` USING(`EvtId`)
SET `Events`.`AlarmId` = `EventAlarms`.`AlarmId`;
DROP TABLE `EventAlarms`;
… which is kind of fine, since I don't really care which one gets kept (it's best-effort, remember). However, as warned, this is not good for replication as the result may be unpredictable:
> SHOW WARNINGS;
Unsafe statement written to the binary log using statement format since
BINLOG_FORMAT = STATEMENT. Statements writing to a table with an auto-
increment column after selecting from another table are unsafe because the
order in which rows are retrieved determines what (if any) rows will be
written. This order cannot be predicted and may differ on master and the
slave.
Is there a way to somehow "order" or "limit" the join in the update, or shall I just skip this whole enterprise and stop trying to be clever? If the latter, how can I leave the downgraded AlarmId as NULL iff there were multiple rows in the new table between which we cannot safely distinguish? I do want to migrate the AlarmId if there is only one.
As a downgrade is a "one-time" maintenance operation, it doesn't have to be exactly real-time, but speed would be nice. Both tables could potentially have thousands of rows.
(MariaDB 5.5.56 on CentOS 7, but must also work on whatever ships with CentOS 6.)

First, we can perform a bit of analysis, with a self-join:
SELECT `A`.`EvtId`, COUNT(`B`.`EvtId`) AS `N`
FROM `EventAlarms` AS `A`
LEFT JOIN `EventAlarms` AS `B` ON (`A`.`EvtId` = `B`.`EvtId`)
GROUP BY `B`.`EvtId`
The result will look something like this:
EvtId N
--------------
370 1
371 1
372 4
379 1
380 1
382 16
383 1
384 1
Now you can, if you like, drop all the rows representing events that map to more than one alarm (which you suggest as a fallback solution; I think this makes sense, though you could modify the below to leave one of them in place if you really wanted).
Instead of actually DELETEing anything, though, it's easier to introduce a new table, populated using the self-joining query shown above:
CREATE TEMPORARY TABLE `_migrate` (
`EvtId` INT UNSIGNED,
`n` INT UNSIGNED,
PRIMARY KEY (`EvtId`),
KEY `idx_n` (`n`)
);
INSERT INTO `_migrate`
SELECT `A`.`EvtId`, COUNT(`B`.`EvtId`) AS `n`
FROM `EventAlarms` AS `A`
LEFT JOIN `EventAlarms` AS `B` ON(`A`.`EvtId` = `B`.`EvtId`)
GROUP BY `B`.`EvtId`;
Then your update becomes:
UPDATE `Events`
LEFT JOIN `_migrate` ON (`Events`.`EvtId` = `_migrate`.`EvtId` AND `_migrate`.`n` = 1)
LEFT JOIN `EventAlarms` ON (`_migrate`.`EvtId` = `EventAlarms`.`EvtId`)
SET `Events`.`AlarmId` = `EventAlarms`.`AlarmId`
WHERE `EventAlarms`.`AlarmId` IS NOT NULL
And, finally, clean up after yourself:
DROP TABLE `_migrate`;
DROP TABLE `EventAlarms`;
MySQL still kicks out the same warning as before, but since know that at most one value will be pulled from the source tables, we can basically just ignore it.
It should even be reasonably efficient, as we can tell from the equivalent EXPLAIN SELECT:
EXPLAIN SELECT `Events`.`EvtId` FROM `Events`
LEFT JOIN `_migrate` ON (`Events`.`EvtId` = `_migrate`.`EvtId` AND `_migrate`.`n` = 1)
LEFT JOIN `EventAlarms` ON (`_migrate`.`EvtId` = `EventAlarms`.`EvtId`)
WHERE `EventAlarms`.`AlarmId` IS NOT NULL
id select_type table type possible_keys key key_len ref rows Extra
---------------------------------------------------------------------------------------------------------------------
1 SIMPLE _migrate ref PRIMARY,idx_n idx_n 5 const 6 Using index
1 SIMPLE EventAlarms ref PRIMARY,fk_AlarmId PRIMARY 8 db._migrate.EvtId 1 Using where; Using index
1 SIMPLE Events eq_ref PRIMARY PRIMARY 8 db._migrate.EvtId 1 Using where; Using index

Use a subquery and user variables to select just one EventAlarms
In your update instead of EventAlarms use
( SELECT `EvtId`, `AlarmId`
FROM ( SELECT `EvtId`, `AlarmId`,
#rn := if ( #EvtId = `EvtId`
#rn + 1,
if ( #EvtId := `EvtId` , 1, 1)
) as rn
FROM `EventAlarms`
CROSS JOIN ( SELECT #EvtId := 0, #rn := 0) as vars
ORDER BY EvtId, AlarmId
) as t
WHERE rn = 1
) as SingleEventAlarms

Related

MySQL - Select only the rows that have not been selected in the last read

Problem description
I have a table, say trans_flow:
CREATE TABLE trans_flow (
id BIGINT(20) AUTO_INCREMENT PRIMARY KEY,
card_no VARCHAR(50) DEFAULT NULL,
money INT(20) DEFAULT NULL
)
New data is inserted into this table constantly.
Now, I want to fetch only the rows that have not been fetched in the last query. For example, at 5:00, id ranges from 1 to 100, and I read the rows 80 - 100 and do some processing. Then, at 5:01, the id comes to 150, and I want to get exactly the rows 101 - 150. Otherwise, the processing program will read in old and already processed data. Note that such queries are committed continuously. From a certain perspective, I want to implement "streaming process" on MySQL.
A tentative idea
I have a simple but maybe ugly solution. I create an auxiliary table query_cursor which stores the beginning and end ids of one query:
CREATE TABLE query_cursor (
task_id VARCHAR(20) PRIMARY KEY COMMENT 'Specify which task is reading this table',
first_row_id BIGINT(20) DEFAULT NULL,
last_row_id BIGINT(20) DEFAULT NULL
)
During each query, I first update the query range stored in this table by:
UPDATE query_cursor
SET first_row_id = (SELECT last_row_id + 1 FROM query_cursor WHERE task_id = 'xxx'),
last_row_id = (SELECT MAX(id) FROM trans_flow)
WHERE task_id = 'xxx'
And then, doing query on table trans_flow using stored cursors:
SELECT * FROM trans_flow
WHERE id BETWEEN (SELECT first_row_id FROM query_cursor WHERE task_id = 'xxx')
AND (SELECT last_row_id FROM query_cursor WHERE task_id = 'xxx')
Question for help
Is there a simpler and more elegant implementation that can achieve the same effect (the best if no need to use an auxiliary table)? The version of MySQL is 5.7.

Serious MySQL query performance issues after adding condition

My problem is that I have a mysql query that runs really fast (0.3 seconds) even though it has a large amount of left joins and a few conditions on the joined columns, but when I add one more condition the query takes upwards of 180 seconds! I understand that the condition means the execution plan has to adjust to pull all potential records first and then apply the condition in a loop, but what's weird to me is that the fast query without the additional condition only returns 16 rows, and even just wrapping the query with the condition on the outer query takes a crazy amount of time when you would think it would only just add an additional loop through 16 rows...
If it matters this is using Amazon Aurora serverless which should align with mysql 5.7
Here's what the query looks like. You can see the additional condition is commented out. (The general table structure of the DB itself cannot change currently so please refrain from suggesting a full database restructuring)
select
e1.entityId as _id,
v1.Value,
v2.Value
v3.Value,
v4.Value,
v5.Value,
v6.Value,
v7.Value,
v8.Value,
v9.Value,
v10.Value,
v11.Value,
v12.Value
from entity e1
left join val as v1 on (v1.entityId = e1.entityId and v1.attributeId = 1189)
left join val as v2 on (v2.entityId = e1.entityId and v2.attributeId = 1190)
left join entity as e2 on e2.entityId = (select entityId from entity where code = v1.Value and type = 88 limit 1)
left join val as v3 on (v3.entityId = e2.entityId and v3.attributeId = 507)
left join val as v4 on (v4.entityId = e2.entityId and v4.attributeId = 522)
left join val as v5 on (v5.entityId = e2.entityId and v5.attributeId = 558)
left join val as v6 on (v6.entityId = e2.entityId and v6.attributeId = 516)
left join val as v7 on (v7.entityId = e2.entityId and v7.attributeId = 518)
left join val as v8 on (v8.entityId = e2.entityId and v8.attributeId = 1384)
left join val as v9 on (v9.entityId = e2.entityId and v9.attributeId = 659)
left join val as v10 on (v10.entityId = e2.entityId and v10.attributeId = 519)
left join val as v11 on (v11.entityId = e2.entityId and v11.attributeId = 1614)
left join entity as e3 on e3.entityId = (select entityId from entity where code = v9.Value and type = 97 limit 1)
left join val as v12 on (v12.entityId = e3.entityId and v12.attributeId = 661)
where e1.type = 154
and v2.Value = 'foo'
and v5.Value = 'bar'
and v10.Value = 'foo2'
-- and v11`.Value = 'bar2'
order by v3.Value asc;
And wrapping that in something like this still takes forever...
select *
from (
<query from above>
) sub
where sub.v11 = 'bar2';
query execution plan with the condition commented out (fast)
query execution plan with the condition included (slow)
I'm going to fiddle around with indexing on the "entity" tables to improve the execution plan regardless which will likely help... but can someone explain what's going on here and what I should be looking at in the execution plan that would indicate such bad performance? And why wrapping the fast query in a subquery so that the outer query should only loop over 16 rows takes a really long time?
EDIT: I noticed in the slow query that the far left execution is using a non-unique key lookup (which is on val.entityId) for "68e9145e-43eb-4581-9727-4212be41bef5" (v11) instead of the unique key lookup the rest are using (which is a composite index on entityId,attributeId). I presume this might be part of the issue, but why can't it use the the composite index there like it does for the rest?
PS: For now since we know the result set will be small, we are implementing that last condition server side with a filter on the result set in our nodeJS server.
Here's the results of "SHOW CREATE TABLE entity" and "SHOW CREATE TABLE val"
CREATE TABLE `entity` (
`entityId` int(11) NOT NULL AUTO_INCREMENT,
`UID` varchar(64) NOT NULL,
`type` int(11) NOT NULL,
`code` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PRIMARY KEY (`entityId`),
UNIQUE KEY `UID` (`UID`),
KEY `IX_Entity_Type` (`type`),
CONSTRAINT `FK_Entities_Types` FOREIGN KEY (`type`) REFERENCES `entityTypes` (`typeId`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=296138 DEFAULT CHARSET=latin1
CREATE TABLE `val` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`UID` varchar(64) NOT NULL,
`attributeId` int(11) NOT NULL,
`entityId` int(11) NOT NULL,
`Value` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PRIMARY KEY (`id`),
UNIQUE KEY `UID` (`UID`),
UNIQUE KEY `idx_val_entityId_attributeId` (`entityId`,`attributeId`),
KEY `IX_val_attributeId` (`attributeId`),
KEY `IX_val_entityId` (`entityId`)
) ENGINE=InnoDB AUTO_INCREMENT=2325375 DEFAULT CHARSET=latin1
Please provide SHOW CREATE TABLE.
I would hope to see these composite indexes:
`val`: (entityId, attributeId) -- order is not critical
Alas, because code is LONGTEXT, this is not possible for entity: INDEX(type, code, entityId). Hence this will not be very efficient:
SELECT entityId
from entity
where code = v9.Value
and type = 97
limit 1
I see LIMIT with an ORDER BY -- do you care which value you get?
Probably that would be better written as
WHERE EXISTS ( SELECT 1 FROM entity
WHERE entityID = e3.entityID
AND code = v9.Value
AND type = 97 )
(Are you sure about the mixture of e3 and v9?)
Wrapping...
This forces the LEFT JOIN to become JOIN. And it gets rid of the then inner ORDER BY.
Then the Optimizer probably decides it is best to start with 68e9145e-43eb-4581-9727-4212be41bef5, which I call val AS v11:
JOIN val AS v11 ON (v11.entityId = e2.id
and v11.attributeId = 1614)
AND v11.Value = 'bar2')
If this is an EAV table, then all it does is verify that [, 1514] has value 'bar2'. This does not seem like a sensible test.
in addition to my former recommendation.
I would prefer EXPLAIN SELECT ....
EAV
Assuming val is a traditional EAV table, this would probably be much better:
CREATE TABLE `val` (
`attributeId` int(11) NOT NULL,
`entityId` int(11) NOT NULL,
`Value` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PRIMARY KEY(`entityId`,`attributeId`),
KEY `IX_val_attributeId` (`attributeId`),
) ENGINE=InnoDB AUTO_INCREMENT=2325375 DEFAULT CHARSET=latin1
The two IDs have no practical use (unless I am missing something). If you are forced to use them because of a framework, that is unfortunate. Promoting (entityId, attributeId) to be the PK makes fetching value a little faster.
There is no useful way to include a LONGTEXT in any index, so some of my previous suggestions need changing.

Efficient way of gettting a single example of a row when using hashes

My database has a staging table with the following structure:
CREATE TABLE featureMappings (
id bigint(20) NOT NULL AUTO_INCREMENT,
visitId bigint(20) NOT NULL,
featureId bigint(20) NOT NULL,
textValue text DEFAULT NULL,
hashTextValue char(32) GENERATED ALWAYS AS (MD5(textValue)) VIRTUAL,
PRIMARY KEY (id));
ALTER TABLE featureMappings
ADD INDEX fsHashTextValue (featureId, hashTextValue)
In a typical run this table has approximately 40 - 100 million rows. There are a lot of duplicate text values so I am using the hashTextValue key to be able to index on this column.
The following query takes about 25 seconds to run:
CREATE TEMPORARY TABLE temp AS
SELECT
featureId,
hashTextValue
FROM
featureMappings
GROUP BY featureId, hashTextValue
Question
I'd like to extract the value in the textValue column alongside the featureId and hashTextValue columns.
I have tried two approaches. Both of these dramatically increased the query time, so I'm looking for a better solution.
Slow Option 1 - Adding textValue to the query
When running the belo change to the query, the time to process went from 25 seconds to about 10 minutes. I've tried to google how textValue is retrieved when not using an aggregate function, but could not find a clear answer.
CREATE TEMPORARY TABLE temp AS
SELECT
featureId,
hashTextValue,
textValue # I also tried MIN(textValue)
FROM
featureMappings
GROUP BY featureId, hashTextValue
Complicated Option 2: Iterative Update
My preferred approach is to iterate over the unique combinations of the first query and then run a loop over the following queries:
SELECT featureId, hashTextValue INTO #fid, #htv
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
LIMIT 1;
SELECT textValue
INTO #textValue
FROM featureMappings
WHERE featureId = #fid and hashTextValue = #htv
LIMIT 1;
UPDATE temp
SET textValue = #textValue
WHERE featureId = #fid AND hashTextValue = #htv;
Server Configuration
This is being run on AWS RDS Aurora based on Mysql 5.7. The server has limited (2GB) memory and usually has less freeable memory than the index size on the table.
Plan A: Dedup as you load. This is trivially done by making the PK of featureMappings be PRIMARY KEY(featureId, hashTextValue) and using INSERT IGNORE when loading the staging table.
Plan B: (Assuming there is something preventing Plan A) Change the table have these indexes.
PRIMARY KEY (featureId, hashTextValue, id),
INDEX(id)
This still has the dups, but I am unclear on what needs to happen next.
Further...
SELECT featureId, hashTextValue INTO #fid, #htv
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
LIMIT 1;
This has the problem of getting slower and slower as you eat through the items that match. It would be better to add an explicit PRIMARY KEY and walk through temp. In fact, it will be an order of magnitude faster (if temp is large). Let's say id is the PK; then:
SELECT #id := id, #fid := featureId, #htv := hashTextValue INTO
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
AND id > #id -- this picks up 'where you left off'
LIMIT 1;
(Initialize with SET #id := 0;)
Now that you have the id, the UPDATE becomes simpler and faster.

Subtract from zero not working in query

I have this table:
CREATE TABLE `page` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`sortorder` SMALLINT(5) UNSIGNED NOT NULL,
PRIMARY KEY (`id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
This is the data I have:
id sortorder
1 0
2 1
And I want to run this query:
select id from page where (sortorder = (select sortorder from page where id = 1) - 1)
(I'm trying to find the previous page, ie the one with the lower sortorder, if it exists. If none exists, I want an empty result set.)
The error I receive from mysql:
SQL Error (1690): BIGINT UNSIGNED value is out of range in '((select '0' from `page` where 1) - 1)'
And more specifically when I run:
select sortorder - 1 from page where id = 1
I get:
SQL Error (1690): BIGINT UNSIGNED value is out of range in '('0' - 1)'
What can I do to prevent this?
I usually use JOINs for this goal because they can be optimized better than the sub-queries. This query should produce the same result as yours but probably faster:
SELECT pp.*
FROM page cp # 'cp' from 'current page'
LEFT JOIN page pp # 'pp' from 'previous page'
ON pp.sortorder = cp.sortorder - 1
WHERE cp.id = 1
Unfortunately it fails running with the same error message about -1 not being UNSIGNED.
It can be fixed by writing the JOIN condition as:
ON pp.sortorder + 1 = cp.sortorder
I moved the -1 to the other side of the equal sign and it turned to +1.
You can also fix your original query by using the same trick: moving -1 to the other side of the equal sign; this way it becomes +1 and there is no error any more:
select id
from page
where (sortorder + 1 = (select sortorder from page where id = 1)
The problem with both queries now is that, because there is no index on column sortorder, MySQL is forced to check all the rows one by one until it finds one matching the WHERE (or ON) condition and this takes a lot of time and uses a lot of resources.
Fortunately, this can be fixed easily by adding an index on column sortorder:
ALTER TABLE page ADD INDEX(sortorder);
Now both queries can be used. The one using JOIN (and the ON condition with +1) is slightly faster.
The original query doesn't return any rows when the condition is not met. The JOIN query returns a row full of NULLs. It can be modified to return no rows by replacing LEFT JOIN with INNER JOIN.
You can circumvent the error altogether (and use any version of these queries) by removing the UNSIGNED attribute from column sortorder:
ALTER TABLE page
CHANGE COLUMN `sortorder` `sortorder` SMALLINT(5) UNSIGNED NOT NULL;
Try to set your SQL Mode in 'NO_UNSIGNED_SUBTRACTION'
SET sql_mode = 'NO_UNSIGNED_SUBTRACTION'

Randomize Primary Keys Based on Existing Values

I have got an older database for which (at some really questionable and obscure reason I do not like to put too much on topic here) I want to randomize or shuffle the primary keys.
I right now have auto-increment fields in the Mysql database tables.
I have not many relations, those that exist are not defined as foreign keys. The relationships do not need to be preserved.
All I'm looking for is to take the current values of the primary keys and replace it with a random value out of those like:
ID := new(ID)
Where the new function returns a value from the set of all OLD ids with a 1:1 match. E.g.
2 := 3
3 := 2
But not
2 := 3
3 := 3
Is there a way to change the data in the database with (ideally) a single SQL query per table?
Edit: I do not have really strict requirements. Consider to have exclusive access to the database if it helps, including changing constraints on the primary key back and forth, e.g. alter the table, do the operation, alter the table to previous schema. It is also possible to add another column for the new (or old) PK value.
Just a scetch of the procedure. Create two temporary tables
CREATE TABLE temp_old
( ai INT NOT NULL AUTO_INCREMENT
, id INT NOT NULL
, PRIMARY KEY (ai)
, INDEX old_idx (id, ai)
) ENGINE = InnoDB ;
CREATE TABLE temp_new
( ai INT NOT NULL AUTO_INCREMENT
, id INT NOT NULL
, PRIMARY KEY (ai)
, INDEX new_idx (id, ai)
) ENGINE = InnoDB ;
Copy the id values in different orders to the two tables (randomly in the 2nd table):
INSERT INTO temp_old
(id)
SELECT id
FROM tableX
ORDER BY id ;
INSERT INTO temp_new
(id)
SELECT id
FROM tableX
ORDER BY RAND() ;
Then we drop the primary key:
ALTER TABLE tableX
DROP PRIMARY KEY ;
to run the actual UPDATE statement:
UPDATE tableX AS t
JOIN temp_old AS o
ON o.id = t.id
JOIN temp_new AS n
ON n.ai = o.ai
SET t.id = n.id ;
Then recreate the primary key and drop the temp tables:
ALTER TABLE tableX
ADD PRIMARY KEY (id) ;
DROP TABLE temp_old ;
DROP TABLE temp_new ;
Tested in SQL-Fiddle
Here's a technique that creates a list of your ids in table order, along with a sequential number from 1, it also creates a list of your ids in a random order, along with a sequential number from 1. It then updates the ids based on matching the sequential number.
There are issues with the performance of order by rand(), (and it's randomness).
If your keys are already sequential starting from 1, you can simplify this.
Update
Test as t
Inner Join (
Select
#rownum2 := #rownum2 + 1 as rank,
t2.id
From
Test t2,
(Select #rownum2:= 0) a1
) as o on t.id = o.id
Inner Join (
Select
#rownum := #rownum + 1 as rank,
t3.id
From
(Select id from Test order by Rand()) t3,
(Select #rownum:= 0) a2
) as n on o.rank = n.rank
Set
t.id = n.id
http://sqlfiddle.com/#!2/3f354/1
You could create a stored procedure that would create a temporary table containing all of the ids, then you can loop over each record, replacing the id with an id from the temp table then removing that id from the temp table. I don't believe there is a way to do what you are talking about in a single query though.