How to optimize an UPDATE and JOIN query on practically identical tables? - mysql

I am trying to update one table based on another in the most efficient way.
Here is the table DDL of what I am trying to update
Table1
CREATE TABLE `customersPrimary` (
`id` int NOT NULL AUTO_INCREMENT,
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Table2
CREATE TABLE `customersSecondary` (
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.
In both tables the combination of groupID and IDInGroup are unique.
Here is the query I want to optimize
UPDATE customersPrimary
INNER JOIN customersSecondary ON
(customersPrimary.groupID = customersSecondary.groupID
AND customersPrimary.IDInGroup = customersSecondary.IDInGroup)
SET
customersPrimary.name = customersSecondary.name,
customersPrimary.address = customersSecondary.address
This query works but scans EVERY row in customersSecondary.
Adding
WHERE customersPrimary.groupID = (groupID)
Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE needs improvement.
I can control table structure and add indexes. I will have to keep both tables.
Any suggestions would be helpful.

Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing customersSecondary because it has fewer rows, or at least it thinks it has.
Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?
You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed updatedAt column and use that to track which records have been changed.
ALTER TABLE `customersPrimary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT '2000-01-01 00:00:00',
ADD INDEX `idx_customer_primary_updated` (`updatedAt`);
ALTER TABLE `customersSecondary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
ADD INDEX `idx_customer_secondary_updated` (`updatedAt`);
And then you can add updatedAt to your join criteria and the WHERE clause -
UPDATE customersPrimary cp
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > :last_query_run_time;
For :last_query_run_time you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could use NOW() - INTERVAL 65 MINUTE. Notice I have used more than one hour to make sure records aren't missed if there is a slight delay for some reason. Another option would be to use SELECT MAX(updatedAt) FROM customersPrimary -
UPDATE customersPrimary cp
INNER JOIN (SELECT MAX(updatedAt) maxUpdatedAt FROM customersPrimary) t
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > t.maxUpdatedAt;

Plan A:
Something like this would first find the "new" rows, then add only those:
UPDATE primary
SET ...
JOIN ( SELECT ...
FROM secondary
LEFT JOIN primary
WHERE primary... IS NULL )
ON ...
Might secondary have changes? If so, a variant of that would work.
Plan B:
Better yet is to TRUNCATE TABLE secondary after it is folded into primary.

Related

Update a table with longblobs is too slow

I have a table filled with data (about 20,000 records). I am trying to update it by the data from another table, but I have a timeout (30 seconds).
At first I tried a naive solution:
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description;
This script is working much longer then 30 seconds, so I tried to reduce join:
UPDATE TableWhithBlobs a
SET a.SomeText = (select b.Description from AnotherTable b where a.AnotherTableId = b.Id);
but this one is still very slow. Is there any cases how it could be fast?
Edit:
A bit explanation about what I'm doing. Previously, I had two tables, which in my script are called TableWhithBlobs and AnotherTable. In table TableWhithBlobs, a link to table AnotherTable was stored, but this link was not a real foreign key, it was just a guid from table AnotherTable. And there is a Unique key constraint for this reference in TableWhithBlobs for this guid. I decided to fix this, remove the old field from table TableWhithBlobs and add a normal foreign key to it (using the primary ID from AnotherTable). The script from the question just adds the correct data to this new field. After that, I delete old guid reference and add a new foreign key constraint. Everything works fine in the small amount of data in TableWhithBlobs, but on QA database with 20000 rows its extremely slow.
Update
SHOW CREATE TABLE TableWhithBlobs;
CREATE TABLE `TableWhithBlobs` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`AnotherTableId` char(36) CHARACTER SET ascii NOT NULL,
`ChunkNumber` bigint(20) NOT NULL,
`Content` longblob NOT NULL,
`SomeText` bigint(20) NOT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `AnotherTableId` (`AnotherTableId`,`ChunkNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
SHOW CREATE TABLE AnotherTable ;
CREATE TABLE `AnotherTable` (
`Description` bigint(20) NOT NULL AUTO_INCREMENT,
`Id` char(36) CHARACTER SET ascii NOT NULL,
`Length` bigint(20) NOT NULL,
`ContentDigest` char(68) CHARACTER SET ascii NOT NULL,
`ContentAndMetadataDigest` char(68) CHARACTER SET ascii NOT NULL,
`Status` smallint(6) NOT NULL,
`ChunkStartNumber` bigint(20) NOT NULL DEFAULT '0',
`IsTestData` bit(1) NOT NULL DEFAULT b'0',
PRIMARY KEY (`Description`),
UNIQUE KEY `Id` (`Id`),
UNIQUE KEY `ContentAndMetadataDigest` (`ContentAndMetadataDigest`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
PS. Column names may look weird because i want to hide the actual production scheme names.
innodb_buffer_pool_size is 134217728, RAM is 4Gb
Result of
explain UPDATE TableWhithBlobs a JOIN AnotherTable b on a.AnotherTableId =
b.Id SET a.SomeText= b.Description;
Version: mysql Ver 14.14 Distrib 5.7.21-20, for debian-linux-gnu (x86_64) using 6.3
Some thoughts, none of which jump out as "the answer":
Increase innodb_buffer_pool_size to 1500M, assuming this does not lead to swapping.
Step back and look at "why" the BIGINT needs to be copied over so often. And whether "all" rows need updating.
Put the LONGBLOB into another table in parallel with the current one. That will add a JOIN for the cases when you need to fetch the blob, but may keep it out of the way for the current query. (I would not expect the blob to be "in the way", but apparently it is.)
What is in the blob? In some situations, it is better to have the blob in a file. A prime example is an image for a web site -- it could be accessed via http's <img...>.
Increase the timeout -- but this just "sweeps the problem under the rug" and probably leads to 30+ second delays in other things that are waiting for it. I don't recognize 30 seconds as a timeout amount. Look through SHOW VARIABLES LIKE '%out'; Try increasing any that are 30.
Do the update piecemeal -- but would this have other implications? (Anyway, Luuk should carry this option forward.)
What about doing smaller updates?
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description
WHERE a.SomeText <> b.Description;
or even:
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description
WHERE a.SomeText <> b.Description
LIMIT 100;
Your timeout problem should be solved 😉, but i do not know how many times you have to run this to finally get the 0 rows affected...

Concurrent queries on composite index with order by id drastically slow

I have a table defined as follows:
| book | CREATE TABLE `book` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`provider_id` int(10) unsigned DEFAULT '0',
`source_id` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
`title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`description` longtext COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `provider` (`provider_id`,`source_id`),
KEY `idx_source_id` (`source_id`),
) ENGINE=InnoDB AUTO_INCREMENT=1605425 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
when there are about 10 concurrent read with following sql:
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '1037122800') ORDER BY `book`.`id` ASC LIMIT 1
it becomes slow, it takes about 100 ms.
however if I changed it to
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '221630001') LIMIT 1
then it is normal, it takes several ms.
I don't understand why adding order by id makes query much slower? could anyone expain?
Try to add desired columns (Select Column Name,.. ) instead of * or Refer this.
Why is my SQL Server ORDER BY slow despite the ordered column being indexed?
I'm not a mysql expert, and not able to perform a detailed analysis, but my guess would be that because you are providing values for the UNIQUE KEY in the WHERE clause, the engine can go and fetch that row directly using an index.
However, when you ask it to ORDER BY the id column, which is a PRIMARY KEY, that changes the access path. The engine now guesses that since it has an index on id, and you want to order by id, it is better to fetch that data in PK order, which will avoid a sort. In this case though, it leads to a slower result, as it has to compare every row to the criteria (a table scan).
Note that this is just conjecture. You would need to EXPLAIN both statements to see what is going on.

In MySQL is it faster to execute one JOIN + one LIKE statement or two JOINs?

I have to create a cron job, which is simple in itself, but because it will run every minute I'm worried about performance. I have two tables, one has user names and the other has details about their network. Most of the time a user will belong to just one network, but it is theoretically possible that they might belong to more, but even then very few, maybe two or three. So, in order to reduce the number of JOINs, I saved the network ids separated by | in a field in the user table, e.g.
|1|3|9|
The (simplified for this question) user table structure is
TABLE `users` (
`u_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`userid` VARCHAR(500) NOT NULL UNIQUE,
`net_ids` VARCHAR(500) NOT NULL DEFAULT '',
PRIMARY KEY (`u_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The (also simplified) network table structure is
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
PRIMARY KEY (`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I have to send a warning when timeout occurs, my query is
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN users AS U ON U.net_ids LIKE CONCAT('%|', N.n_id, '|%');
I made N a subquery to reduce the number of rows joined. But I would like to know if it would be faster to add a third table with u_id and n_id as columns, removed the net_ids column from users and then do a join on all three tables? Because I read that using LIKE slows things down.
Which is the most effcient query to use in this case? One JOIN and a LIKE or two JOINS?
P.S. I did some experimentation and the initial values for using two JOINS are higher than using a JOIN and a LIKE. However, repeated runs of the same query seems to speed things up a lot, I suspect something is cached somewhere, either in my app or the database, and both become comparable, so I did not find this data satisfactory. It also contradicts what I was expecting based on what I have been reading.
I used this table:
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
INDEX `u_id` (`u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
INDEX `n_id` (`n_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and this query:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id;
You should define composite indexes for the user_net table. One of them can (and should) be the primary key.
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`u_id`, `n_id`),
INDEX `uid_nid` (`n_id`, `u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I would also rewrite your query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, N.timeout_mins, N.login_time), NOW()) < 60
While your subquery will probably not hurt much, there is no need for it.
Note that the last condition cannot use an index, because you have to combine two columns. If your MySQL version is at least 5.7.6 you can define an indexed virtual (calculated) column.
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
`is_open` TINYINT UNSIGNED,
`notify` TINYINT UNSIGNED,
`timeout_dt` DATETIME AS (`login_time` + INTERVAL `timeout_mins` MINUTE),
PRIMARY KEY (`n_id`),
INDEX (`timeout_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now change the query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND N.timeout_dt < NOW() + INTERVAL 60 SECOND
and it will be able to use the index.
You can also try to replace
INDEX (`timeout_dt`)
with
INDEX (`is_open`, `notify`, `timeout_dt`)
and see if it is of any help.
Reformulate to avoid hiding columns inside functions. I can't grok your date expression, but note this:
login_time < NOW() - INTERVAL timeout_mins MINUTE
If you can achieve something like that, then this index should help:
INDEX(is_open, notify, login_time)
If that is not good enough, let's see the other formulation so we can compare them.
Having stuff separated by comma (or |) is likely to be a really bad idea.
Bottom line: Assume that JOINs are not a performance problem, write the queries with as many JOINs as needed. Then let's optimize that.

MySQL group by query with subselect optimization

I have the following tables in MySQL:
CREATE TABLE `events` (
`pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
`value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
`value_type` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`value_count` bigint(20) DEFAULT NULL,
`alarm_status` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`alarm_severity` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;
CREATE TEMPORARY TABLE `matching_pv_names` (
`pv_name` varchar(60) NOT NULL,
PRIMARY KEY (`pv_name`)
) ENGINE=Memory DEFAULT CHARSET=latin1;
The matching_pv_names table holds a subset of the unique events.pv_name values.
The following query runs using the 'loose index scan' optimization:
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= time_stamp_in
GROUP BY events.pv_name;
Is it possible to improve the time of this query by restricting the events.pv_name values to those in the matching_pv_names table without losing the 'loose index scan' optimization?
Try one of the below queries to limit your output to matching values found in matching_pv_names.
Query 1:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name;
Query 2:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
AND EXISTS ( select 1 from matching_pv_names pv WHERE e.pv_name = pv.pv_name )
GROUP BY e.pv_name;
Let me quote manual here, since I think it applies to your case (bold emphasis mine):
If the WHERE clause contains range predicates (...), a loose index scan looks up the first key of each group
that satisfies the range conditions, and again reads the least
possible number of keys. This is possible under the following
conditions:
The query is over a single table.
Knowing this, I believe Query 1 would not be able to use a loose index scan, but probably second query could do that. If that is still not the case, you could also give a try for third method proposed which uses a derived table.
Query 3:
SELECT e.*
FROM (
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name
) e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name;
Your query is very efficient. You can 'prove' it by doing this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Most numbers refer to "rows touched", either in the index or in the data. You will see very low numbers. If the biggest one is about the number of rows returned, that is very good. (I tried a similar query and got about 2x; I don't know why.)
With that few rows touched then either
Outputting the rows will overwhelm the run time. So, who cares about the efficiency; or
You were I/O-bound because of leapfrogging through the index (actually, the table in your case). Run it a second time; it will be fast because of caching.
The only way to speed up leapfrogging is to somehow move the desired rows next to each other. That seems unreasonable for this query.
As for playing games with another table -- Maybe. Will the JOIN significantly decrease the number of events to look at? Then Maybe. Otherwise, I say "a very efficient query is not going to get faster by adding complexity".

MySQL UPDATE Statement using LIKE with 2 Tables Takes Decades

can you please advise why such a query would take so long (literally 20-30 minutes)?
I seem to have proper indexes set up, don't I?
UPDATE `temp_val_import_435` t1,
`attr_upc` t2 SET t1.`attr_id` = t2.`id` WHERE t1.`value` LIKE t2.`upc`
CREATE TABLE `attr_upc` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`upc` varchar(255) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `upc` (`upc`),
KEY `last_update` (`last_update`)
) ENGINE=InnoDB AUTO_INCREMENT=102739 DEFAULT CHARSET=utf8
CREATE TABLE `temp_val_import_435` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`attr_id` int(11) DEFAULT NULL,
`translation_id` int(11) DEFAULT NULL,
`source_value` varchar(255) NOT NULL,
`value` varchar(255) DEFAULT NULL,
`count` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `core_value_id` (`core_value_id`),
KEY `translation_id` (`translation_id`),
KEY `source_value` (`source_value`),
KEY `value` (`value`),
KEY `count` (`count`)
) ENGINE=InnoDB AUTO_INCREMENT=32768 DEFAULT CHARSET=utf8
Ed Cottrell's solution worked for me. Using = instead of LIKE sped up a smaller test query on 1000 rows by a lot.
I measured 2 ways: 1 in phpMyAdmin, the other looking at the time for DOM load (which of course involves other processes).
DOM load went from 44 seconds to 1 second, a 98% increase.
But the difference in query execution time was much more dramatic, going from 43.4 seconds to 0.0052 seconds, a decrease of 99.988%. Pretty good. I will report back on results from huge datasets.
Use = instead of LIKE. = should be much faster than LIKE -- LIKE is only for matching patterns, as in '%something%', which matches anything with "something" anywhere in the text.
If you have this query:
SELECT * FROM myTable where myColumn LIKE 'blah'
MySQL can optimize this by pretending you typed myColumn = 'blah', because it sees that the pattern is fixed and has no wildcards. But what if you have this data in your upc column:
blah
foo
bar
%foo%
%bar
etc.
MySQL can't optimize your query in advance, because it's possible that the text it is trying to match is a pattern, like %foo%. So, it has to perform a full text search for LIKE matches on every single value of temp_val_import_435.value against every single value of attr_upc.upc. With a simple = and the indexes you have defined, this is unnecessary, and the query should be dramatically faster.
In essence you are joining on a LIKE which is going to be problematic (would need EXPLAIN to see is MySQL if utilizing indexes at all). Try this:
UPDATE `temp_val_import_435` t1
INNER JOIN `attr_upc` t2
ON t1.`value` LIKE t2.`upc`
SET t1.`attr_id` = t2.`id` WHERE t1.`value` LIKE t2.`upc`