I have a table filled with data (about 20,000 records). I am trying to update it by the data from another table, but I have a timeout (30 seconds).
At first I tried a naive solution:
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description;
This script is working much longer then 30 seconds, so I tried to reduce join:
UPDATE TableWhithBlobs a
SET a.SomeText = (select b.Description from AnotherTable b where a.AnotherTableId = b.Id);
but this one is still very slow. Is there any cases how it could be fast?
Edit:
A bit explanation about what I'm doing. Previously, I had two tables, which in my script are called TableWhithBlobs and AnotherTable. In table TableWhithBlobs, a link to table AnotherTable was stored, but this link was not a real foreign key, it was just a guid from table AnotherTable. And there is a Unique key constraint for this reference in TableWhithBlobs for this guid. I decided to fix this, remove the old field from table TableWhithBlobs and add a normal foreign key to it (using the primary ID from AnotherTable). The script from the question just adds the correct data to this new field. After that, I delete old guid reference and add a new foreign key constraint. Everything works fine in the small amount of data in TableWhithBlobs, but on QA database with 20000 rows its extremely slow.
Update
SHOW CREATE TABLE TableWhithBlobs;
CREATE TABLE `TableWhithBlobs` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`AnotherTableId` char(36) CHARACTER SET ascii NOT NULL,
`ChunkNumber` bigint(20) NOT NULL,
`Content` longblob NOT NULL,
`SomeText` bigint(20) NOT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `AnotherTableId` (`AnotherTableId`,`ChunkNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
SHOW CREATE TABLE AnotherTable ;
CREATE TABLE `AnotherTable` (
`Description` bigint(20) NOT NULL AUTO_INCREMENT,
`Id` char(36) CHARACTER SET ascii NOT NULL,
`Length` bigint(20) NOT NULL,
`ContentDigest` char(68) CHARACTER SET ascii NOT NULL,
`ContentAndMetadataDigest` char(68) CHARACTER SET ascii NOT NULL,
`Status` smallint(6) NOT NULL,
`ChunkStartNumber` bigint(20) NOT NULL DEFAULT '0',
`IsTestData` bit(1) NOT NULL DEFAULT b'0',
PRIMARY KEY (`Description`),
UNIQUE KEY `Id` (`Id`),
UNIQUE KEY `ContentAndMetadataDigest` (`ContentAndMetadataDigest`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
PS. Column names may look weird because i want to hide the actual production scheme names.
innodb_buffer_pool_size is 134217728, RAM is 4Gb
Result of
explain UPDATE TableWhithBlobs a JOIN AnotherTable b on a.AnotherTableId =
b.Id SET a.SomeText= b.Description;
Version: mysql Ver 14.14 Distrib 5.7.21-20, for debian-linux-gnu (x86_64) using 6.3
Some thoughts, none of which jump out as "the answer":
Increase innodb_buffer_pool_size to 1500M, assuming this does not lead to swapping.
Step back and look at "why" the BIGINT needs to be copied over so often. And whether "all" rows need updating.
Put the LONGBLOB into another table in parallel with the current one. That will add a JOIN for the cases when you need to fetch the blob, but may keep it out of the way for the current query. (I would not expect the blob to be "in the way", but apparently it is.)
What is in the blob? In some situations, it is better to have the blob in a file. A prime example is an image for a web site -- it could be accessed via http's <img...>.
Increase the timeout -- but this just "sweeps the problem under the rug" and probably leads to 30+ second delays in other things that are waiting for it. I don't recognize 30 seconds as a timeout amount. Look through SHOW VARIABLES LIKE '%out'; Try increasing any that are 30.
Do the update piecemeal -- but would this have other implications? (Anyway, Luuk should carry this option forward.)
What about doing smaller updates?
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description
WHERE a.SomeText <> b.Description;
or even:
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description
WHERE a.SomeText <> b.Description
LIMIT 100;
Your timeout problem should be solved 😉, but i do not know how many times you have to run this to finally get the 0 rows affected...
Related
I am trying to update one table based on another in the most efficient way.
Here is the table DDL of what I am trying to update
Table1
CREATE TABLE `customersPrimary` (
`id` int NOT NULL AUTO_INCREMENT,
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Table2
CREATE TABLE `customersSecondary` (
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.
In both tables the combination of groupID and IDInGroup are unique.
Here is the query I want to optimize
UPDATE customersPrimary
INNER JOIN customersSecondary ON
(customersPrimary.groupID = customersSecondary.groupID
AND customersPrimary.IDInGroup = customersSecondary.IDInGroup)
SET
customersPrimary.name = customersSecondary.name,
customersPrimary.address = customersSecondary.address
This query works but scans EVERY row in customersSecondary.
Adding
WHERE customersPrimary.groupID = (groupID)
Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE needs improvement.
I can control table structure and add indexes. I will have to keep both tables.
Any suggestions would be helpful.
Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing customersSecondary because it has fewer rows, or at least it thinks it has.
Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?
You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed updatedAt column and use that to track which records have been changed.
ALTER TABLE `customersPrimary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT '2000-01-01 00:00:00',
ADD INDEX `idx_customer_primary_updated` (`updatedAt`);
ALTER TABLE `customersSecondary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
ADD INDEX `idx_customer_secondary_updated` (`updatedAt`);
And then you can add updatedAt to your join criteria and the WHERE clause -
UPDATE customersPrimary cp
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > :last_query_run_time;
For :last_query_run_time you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could use NOW() - INTERVAL 65 MINUTE. Notice I have used more than one hour to make sure records aren't missed if there is a slight delay for some reason. Another option would be to use SELECT MAX(updatedAt) FROM customersPrimary -
UPDATE customersPrimary cp
INNER JOIN (SELECT MAX(updatedAt) maxUpdatedAt FROM customersPrimary) t
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > t.maxUpdatedAt;
Plan A:
Something like this would first find the "new" rows, then add only those:
UPDATE primary
SET ...
JOIN ( SELECT ...
FROM secondary
LEFT JOIN primary
WHERE primary... IS NULL )
ON ...
Might secondary have changes? If so, a variant of that would work.
Plan B:
Better yet is to TRUNCATE TABLE secondary after it is folded into primary.
I am trying to copy a table in mysql (version 5.7.38-1) using the following queries:
CREATE TABLE dest LIKE src;
INSERT INTO dest SELECT * FROM src;
Table dest is created and filled with records from Table src. So far, so good. You would expect the two tables to have roughly the same size. But Table dest has 646M, whereas Table src only has 134M. After the create-step, Table dest is 48K, more or less as expected.
Engine is InnoDB, default row format is dynamic and compression is on.
I have executed the following to see if it would help but to no avail:
ALTER TABLE dest ROW_FORMAT=COMPRESSED;
OPTIMIZE TABLE dest;
And this is SHOW CREATE TABLE src:
CREATE TABLE `src` (
`meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` bigint(20) unsigned NOT NULL DEFAULT '0',
`meta_key` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`meta_value` longtext COLLATE utf8mb4_unicode_ci,
PRIMARY KEY (`meta_id`),
KEY `post_id` (`post_id`),
KEY `meta_key` (`meta_key`(191))
) ENGINE=InnoDB AUTO_INCREMENT=6046271 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
I am aware that the mysql version is dated but changing that is outside my scope of control.
Two questions:
What is the reason for this unexpected behavior?
What is the solution to make Table dest smaller?
Thanks for your insights.
Various possibilities. The most likely is the indexes.
What engine? Was compression on? What row format?
Please provide SHOW CREATE TABLE src.
INSERT ... SELECT ... will feed the rows into the table one at a time (but a lot faster than one INSERT statement per row). If the Engine is InnoDB, then presumably the src rows are in PRIMARY KEY order. And the optimal order for inserting to dst is also that order. So I would expect the data's BTree to be effectively 'defragmented'.
Secondary indexes are another matter. They may or may not be efficiently ordered. And the "change buffer" may or may not compensate for the ordering. The resulting BTree for each secondary index may or may not be 'defragmented'.
What version of mysql/mariadb? I may have a tool to look deeper into the issue.
I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?
I have a column with data that exceeds MySQL's index length limit. Therefore, I can't use an unique key.
There's a solution here to the problem without using an unique key: MySQL: Insert record if not exists in table
However, in the comments, people are having issues with inserting the same value into multiple columns. In my case, a lot of my values are 0, so I'll get duplicate values very often.
I'm using Node and node-mysql to access the database. I'm thinking I can have a variable that keeps track of all values that are currently being inserted. Before inserting, I check if the value is currently being inserting. If so, I'll wait until it finishes inserting, then continue execution as if the value was originally inserted. However, I feel like this will be very error prone.
Here's part of my table schema:
CREATE TABLE `links` (
`id` int(10) UNSIGNED NOT NULL,
`url` varchar(2083) CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL,
`likes` int(10) UNSIGNED NOT NULL,
`tweets` int(10) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
ALTER TABLE `links`
ADD PRIMARY KEY (`id`),
ADD KEY `url` (`url`(50));
I cannot put an unique key on url because it can be 2083 bytes, which is over MySQL's key size limit. likes and tweets will often be 0, so the linked solution will not work.
Is there another possible solution?
If you phrase your INSERT in a certain way, you can make use of WHERE NOT EXISTS to check first if the URL does not exist before completing the insert:
INSERT INTO links (`url`, `likes`, `tweets`)
SELECT 'http://www.google.com', 10, 15 FROM DUAL
WHERE NOT EXISTS
(SELECT 1 FROM links WHERE url='http://www.google.com');
This assumes that the id column is a primary key/auto increment, and MySQL will automatically assign a value to it.
can you please advise why such a query would take so long (literally 20-30 minutes)?
I seem to have proper indexes set up, don't I?
UPDATE `temp_val_import_435` t1,
`attr_upc` t2 SET t1.`attr_id` = t2.`id` WHERE t1.`value` LIKE t2.`upc`
CREATE TABLE `attr_upc` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`upc` varchar(255) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `upc` (`upc`),
KEY `last_update` (`last_update`)
) ENGINE=InnoDB AUTO_INCREMENT=102739 DEFAULT CHARSET=utf8
CREATE TABLE `temp_val_import_435` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`attr_id` int(11) DEFAULT NULL,
`translation_id` int(11) DEFAULT NULL,
`source_value` varchar(255) NOT NULL,
`value` varchar(255) DEFAULT NULL,
`count` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `core_value_id` (`core_value_id`),
KEY `translation_id` (`translation_id`),
KEY `source_value` (`source_value`),
KEY `value` (`value`),
KEY `count` (`count`)
) ENGINE=InnoDB AUTO_INCREMENT=32768 DEFAULT CHARSET=utf8
Ed Cottrell's solution worked for me. Using = instead of LIKE sped up a smaller test query on 1000 rows by a lot.
I measured 2 ways: 1 in phpMyAdmin, the other looking at the time for DOM load (which of course involves other processes).
DOM load went from 44 seconds to 1 second, a 98% increase.
But the difference in query execution time was much more dramatic, going from 43.4 seconds to 0.0052 seconds, a decrease of 99.988%. Pretty good. I will report back on results from huge datasets.
Use = instead of LIKE. = should be much faster than LIKE -- LIKE is only for matching patterns, as in '%something%', which matches anything with "something" anywhere in the text.
If you have this query:
SELECT * FROM myTable where myColumn LIKE 'blah'
MySQL can optimize this by pretending you typed myColumn = 'blah', because it sees that the pattern is fixed and has no wildcards. But what if you have this data in your upc column:
blah
foo
bar
%foo%
%bar
etc.
MySQL can't optimize your query in advance, because it's possible that the text it is trying to match is a pattern, like %foo%. So, it has to perform a full text search for LIKE matches on every single value of temp_val_import_435.value against every single value of attr_upc.upc. With a simple = and the indexes you have defined, this is unnecessary, and the query should be dramatically faster.
In essence you are joining on a LIKE which is going to be problematic (would need EXPLAIN to see is MySQL if utilizing indexes at all). Try this:
UPDATE `temp_val_import_435` t1
INNER JOIN `attr_upc` t2
ON t1.`value` LIKE t2.`upc`
SET t1.`attr_id` = t2.`id` WHERE t1.`value` LIKE t2.`upc`