I've this query:
"explain UPDATE requests R JOIN profile as P ON R.intern_id = P.intern_id OR R.intern_id_decoded = P.intern_id OR R.intern_id_full_decode = P.intern_id SET R.found_id=P.id WHERE R.id >= 28000001 AND R.id <= 28000001+2000000 AND R.found_id is NULL"
1 UPDATE R NULL range PRIMARY,intern_id_customer_id_batch_num,id_found_id PRIMARY 4 NULL 3616888 10.00 Using where
1 SIMPLE P NULL ALL intern_id_dt_snapshot,intern_id NULL NULL NULL 179586254 27.10 Range checked for each record (index map: 0x6)
That query takes about 40 seconds to execute, it's updating 5000-10000 rows from the set of 2 million rows.
I am currently updating in 2 million row "jobs" to make the join perform faster.
The whole table is 170 million records currently.
The EXPLAIN shows the second part without using an INDEX, I am not sure if that's right or not.
The intern_id fields are varchars, found_id and id are INT
Does the explain output look like it's working performantly ?
I noticed the second line does not use an index, not sure if that's normal.
I would do this logic using multiple joins:
UPDATE requests r LEFT JOIN
profile p1
ON r.intern_id = p1.intern_id LEFT JOIN
profile p2
ON r.intern_id_decoded = p2.intern_id AND p1.id IS NULL LEFT JOIN
profile p3
ON r.intern_id_full_decode = p3.intern_id AND p2.id IS NULL
SET r.found_id = COALESCE(p1.id, p2.id, p3.id)
WHERE R.id >= 28000001 AND R.id <= 28000001 + 2000000 AND
R.found_id is NULL;
Databases are very bad at optimizing OR in JOIN conditions. It might be better with explicit JOINs.
The ON conditions also ensure only the first match.
I would do 3 chunked-up UPDATEs -- one for each of the ON conditions.
10K rows to update is excessive; crank it down to perhaps 1K. That means cranking the chunking down to 200K. (The speed might even be faster.)
UPDATE ... ON P.intern_id = R.intern_id SET ... WHERE ...
UPDATE ... ON P.intern_id = R.intern_id_decoded SET ... WHERE ...
UPDATE ... ON P.intern_id = R.intern_id_full SET ... WHERE ...
(The range is the same fore each set of 3, thereby helping with caching of R.)
Possibly INDEX(found_id) would help, but this is not a given.
See here for more chunking suggestions, especially the tip on finding 1000 rows before starting the operation:
SELECT id WHERE id > ... AND found_id IS NULL LIMIT 1000,1;
Then using that as the limit instead of the 2-millionth. A goal here is to even out the number of rows updated.
Related
I have two tables RequestHistoryLog and Request.
RequestHistoryLog table have these columns with 1.2 million of rows
id(bigint), status(VARCHAR), byUser(VARCHAR),
delegatedUserFor(text), reqId(bigint)
reqId with CONSTRAINTS FOREIGN KEY (`reqId`) REFERENCES `Request` (`id`)
Request table have many columns with 0.4 million of rows
id(bigint), title(VARCHAR), actionDateTime(Datetime), type(VARCHAR) etc.
In RequestHistoryLog there are multiple entries w.r.t status of a request.
And 1 Request has many LogHistory.
Here delegatedUserFor(is a text column type) has multiple names with emails.
Example is: 'X(x#xyz.com)A(a#xyz.com)Y(y#xyz.com)'
By the below query, I am trying to get the requests on which A(a#xyz.com) has done a status from "Approved", "Done", "Completed", "Queried", "Rejected" or
some other user has done a status for A(It means someone has done a status one-half of A) but this time entry goes to delegatedUserFor column.
SELECT *
FROM
(SELECT r.*
FROM Request AS r
JOIN RequestHistoryLog AS rh ON r.id = rh.reqId
where rh.status IN ("Approved", "Done", "Completed",
"Queried", "Rejected")
and (rh.byUser='a#xyz.com'
or rh.delegatedUserFor like '%(a#xyz.com)%')
and r.type='custom'
) AS a
GROUP BY id
ORDER BY actionDateTime desc limit 10;
I am writing a sample data for both table as:
RequestHistoryLog Table
id status byUser delegatedUserFor reqId
2 "Approved" 'A(a#xyz.com)' '' 15
3 "Rejected" 'G(g#xyz.com)' '' 15
4 "Approved" 'X(x#xyz.com)' 'A(a#xyz.com)Y(y#xyz.com)' 15
5 "Approved" 'X(x#xyz.com)' 'G(g#xyz.com)A(a#xyz.com)Y(y#xyz.com)' 16
6 "Rejected" 'B(b#xyz.com)' '' 16
7 "Completed"'Y(y#xyz.com)' '' 16
Request Table
id title actionDateTime
15 "Request1" '2021-11-23 01:23:20' ..........
16 "Request2" '2021-11-23 11:23:20' ..........
Now I am getting requests on which A has done a status or other user has done one-half of A.
Above query is taking much time.
So I need how to optimize it to get fast result ?
Plan A: (Probably better if not many rows are type=custom)
Do a "semi-join":
SELECT r.*
FROM Request AS r
JOIN RequestHistoryLog AS rh ON r.id = rh.reqId
WHERE r.type = 'custom'
AND EXISTS ( SELECT 1 FROM RequestHistoryLog AS rh
WHERE r.id = rh.reqId
AND rh.status IN ("Approved", "Done", "Completed",
"Queried", "Rejected")
AND ( rh.byUser='a#xyz.com'
or rh.delegatedUserFor like '%(a#xyz.com)%' )
ORDER BY r.actionDateTime desc
LIMIT 10;
Note that the GROUP BY and nested SELECT are avoided. Have these indexes:
r: INDEX(type, actionDateTime)
rh: INDEX(reqId, status, byUser, delegatedUserFor)
Plan B: (if type is often =custom and/or 'xyz' is rare)
FULLTEXT(byUser, delegatedUserFor)
and do
WHERE MATCH(byUser, delegatedUserFor) AGAINST ("+xyz" IN BOOLEAN MODE)
AND (rh.byUser='a#xyz.com'
r rh.delegatedUserFor like '%(a#xyz.com)%')
This should find the rows with domain xyz first by FULLTEXT (rapidly), then verify against the other tests (against fewer rows). Other simplifications can be done too. Perhaps something like
SELECT r.*
FROM ( SELECT DISTINCT rh.reqId
FROM RequestHistoryLog AS rh
WHERE MATCH ... AND ( ... OR ... )
AND rh.status IN (...)
) AS x
JOIN Request AS r ON r.id = x.reqId
WHERE r.type = 'custom'
ORDER BY r.actionDateTime desc
LIMIT 10;
(No other indexes needed.) The GROUP BY is replaced by DISTINCT, which is probably faster in this case. And the FULLTEXT index may be very fast.
Note that FULLTEXT has a minimum word length (default 3), hence you need to avoid searching for "a" or any other string shorter than that. Also "com" may be so common as to be not worth searching for.
Plan C
If there is some easy way to predict which one will be better, then have both queries and dynamically pick between them.
For example, when searching for ...#hp.com, note that "hp" is too short, making the fulltext approach unworkable.
You probably know which r.type values occur more than 20% of the time, making the Plan B a better choice.
Plan D: if only one domain
If byUser and delegatedUserFor either have the same "xyz.com" or are blank, then add a column to rh replace the messy test with AND rh.domain = 'xyz.com'. And still do something to get of the GROUP BY.
This query (along with a few others I think have a related issue) did not take 30 seconds when MySQL was local on the same EC2 instance as the rest of the website. More like milliseconds.
Does anything look off?
SELECT *, chv_images.image_id FROM chv_images
LEFT JOIN chv_storages ON chv_images.image_storage_id =
chv_storages.storage_id
LEFT JOIN chv_users ON chv_images.image_user_id = chv_users.user_id
LEFT JOIN chv_albums ON chv_images.image_album_id = chv_albums.album_id
LEFT JOIN chv_categories ON chv_images.image_category_id =
chv_categories.category_id
LEFT JOIN chv_meta ON chv_images.image_id = chv_meta.image_id
LEFT JOIN chv_likes ON chv_likes.like_content_type = "image" AND
chv_likes.like_content_id = chv_images.image_id AND chv_likes.like_user_id = 1
LEFT JOIN chv_follows ON chv_follows.follow_followed_user_id =
chv_images.image_user_id
LEFT JOIN chv_follows_projects ON
chv_follows_projects.follows_project_project_id =
chv_images.image_project_id LEFT JOIN chv_projects ON
chv_projects.project_id = follows_project_project_id WHERE
chv_follows.follow_user_id='1' OR (follows_project_user_id = 1 AND
chv_projects.project_privacy = "public" AND
chv_projects.project_is_public_upload = 1) GROUP BY chv_images.image_id
ORDER BY chv_images.image_id DESC
LIMIT 0,15
And this is what EXPLAIN shows:
Thank you
Update: This query has the same issue. It does not have a GROUP BY.
SELECT *, chv_images.image_id FROM chv_images
LEFT JOIN chv_storages ON chv_images.image_storage_id =
chv_storages.storage_id
LEFT JOIN chv_users ON chv_images.image_user_id = chv_users.user_id
LEFT JOIN chv_albums ON chv_images.image_album_id = chv_albums.album_id
LEFT JOIN chv_categories ON chv_images.image_category_id =
chv_categories.category_id
LEFT JOIN chv_meta ON chv_images.image_id = chv_meta.image_id
LEFT JOIN chv_likes ON chv_likes.like_content_type = "image" AND
chv_likes.like_content_id = chv_images.image_id AND chv_likes.like_user_id = 1
ORDER BY chv_images.image_id DESC
LIMIT 0,15
That EXPLAIN shows several table-scans (type: ALL), so it's not surprising that it takes over 30 seconds.
Here's your EXPLAIN:
Notice the column rows shows an estimated 14420 rows read from the first table chv_images. It's doing a table-scan of all the rows.
In general, when you do a series of JOINs, you can multiple together all the values in the rows column of the EXPLAIN, and the final result is how many row-reads MySQL has to do. In this case it's 14420 * 2 * 1 * 1 * 2 * 1 * 916, or 52,834,880 row-reads. That should put into perspective the high cost of doing several table-scans in the same query.
You might help avoid those table-scans by creating some indexes on these tables:
ALTER TABLE chv_storages
ADD INDEX (storage_id);
ALTER TABLE chv_categories
ADD INDEX (category_id);
ALTER TABLE chv_likes
ADD INDEX (like_content_id, like_content_type, like_user_id);
Try creating those indexes and then run the EXPLAIN again.
The other tables are already doing lookups by primary key (type: eq_ref) or by secondary key (type: ref) so those are already optimized.
Your EXPLAIN shows your query uses a temporary table and filesort. You should reconsider whether you need the GROUP BY, because that's probably causing the extra work.
Another tip is to avoid using SELECT * because it might be forcing the query to read many extra columns that you don't need. Instead, explicitly name only the columns you need.
Is there any indexes in chv_images?
I propose:
CREATE INDEX idx_image_id ON chv_images (image_id);
(Bill's ideas are good. I'll take the discussion a different way...)
Explode-Implode -- If the LEFT JOINs match no more than 1 row, change, for example,
SELECT
...
LEFT JOIN chv_meta ON chv_images.image_id = chv_meta.image_id
into
SELECT ...,
( SELECT foo FROM chv_meta WHERE image_id = chv_images.image_id ) AS foo, ...
If that can be done for all the JOINs, you can get rid of GROUP BY. This will avoid the costly "explode-implode" where JOINs lead to more rows, then GROUP BY gets rid of the dups. (I suspect you can't move all the joins in.)
OR -> UNION -- OR is hard to optimize. Your query looks like a good candidate for turning into UNION, then making more indexes that will become useful.
WHERE chv_follows.follow_user_id='1'
OR (follows_project_user_id = 1
AND chv_projects.project_privacy = "public"
AND chv_projects.project_is_public_upload = 1
)
Assuming that follows_project_user_id is in `chv_images,
( SELECT ...
WHERE chv_follows.follow_user_id='1' )
UNION DISTINCT -- or ALL, if you are sure there won't be dups
( SELECT ...
WHERE follows_project_user_id = 1
AND chv_projects.project_privacy = "public"
AND chv_projects.project_is_public_upload = 1 )
Indexes needed:
chv_follows: (follow_user_id)
chv_projects: (project_privacy, project_is_public_upload) -- either order
But this has not yet handled the ORDER BY and LIMIT. The general pattern for such:
( SELECT ... ORDER BY ... LIMIT 15 )
UNION
( SELECT ... ORDER BY ... LIMIT 15 )
ORDER BY ... LIMIT 15
Yes, the ORDER BY and LIMIT are repeated.
That works for page 1. If you want the next 15 rows, see http://mysql.rjweb.org/doc.php/pagination#pagination_and_union
After building those two sub-selects, look at them; I think you will be able to optimize each one, and may need new indexes because the Optimizer will start with a different 'first' table.
I wrote a query thas was taking way too much time (32 minutes) so I tried other methods to find one faster.
I finally wrote another one taking under 5 seconds
The problem is that I don't understand my optimization.
Can someone explain how it happen to be that much faster.
hugeTable has 494 500 rows
smallTable1 has 983 rows
smallTable2 has 983 rows
cursor.execute('UPDATE hugeTable dst,
(
SELECT smallTable1.hugeTableId, smallTable2.valueForHugeTable
FROM smallTable2
INNER JOIN smallTable1 ON smallTable1.id = smallTable2.id
-- This select represent 983 rows
)src
SET dst.columnToUpdate = src.valueForHugeTable
WHERE dst.id2 = %s AND dst.id = src.hugeTableId;', inputId2)
-- Condition : dst.id2 = %s alone target 983 rows.
-- Combinasion of : dst.id2 = %s AND dst.id = src.hugeTableId target a single unique row.
-- This query takes 32 minutes
And here is a way to do the exact same request with more steps, but way faster:
-- First create a temporary table to hold (983) rows from hugeTable that has to be updated
cursor.execute('CREATE TEMPORARY TABLE tmpTable AS
SELECT * from hugeTable
WHERE id2 = %s;', inputid)
-- Update the rows into tmpTable instead of into hugeTable
cursor.execute('UPDATE tmpTable dst,
(
SELECT smallTable1.hugeTableId, smallTable2.valueForHugeTable
FROM smallTable2
INNER JOIN smallTable1 ON smallTable1.id = smallTable2.id
-- This select represent 983 rows
)src
SET dst.columnToUpdate = src.valueForHugeTable
WHERE dst.id = src.hugeTableId;')
-- Then delete the (983) rows we want to update
cursor.execute('DELETE FROM hugeTable WHERE id2 = %s;', inputId2)
-- And create new rows replacing the above deleled ones with rows from tmpTable
cursor.execute('INSERT INTO hugeTable SELECT * FROM tmpTable;')
-- This takes litle under 5 seconds.
I would like to know why the first method takes so much time.
Understanding this will help me getting a new MySql level up.
Add a composite index to dst: INDEX(id2, id) (in either order).
More
Case 1:
UPDATE hugeTable dst,
( SELECT smallTable1.hugeTableId, smallTable2.valueForHugeTable
FROM smallTable2
INNER JOIN smallTable1 ON smallTable1.id = smallTable2.id
)src SET dst.columnToUpdate = src.valueForHugeTable
WHERE dst.id2 = 1234
AND dst.id = src.hugeTableId;
Case 2:
CREATE TEMPORARY TABLE tmpTable AS
SELECT *
from hugeTable
WHERE id2 = 1234;
UPDATE tmpTable dst,
( SELECT smallTable1.hugeTableId, smallTable2.valueForHugeTable
FROM smallTable2
INNER JOIN smallTable1 ON smallTable1.id = smallTable2.id
)src SET dst.columnToUpdate = src.valueForHugeTable
WHERE dst.id = src.hugeTableId;
Without knowing the MySQL version and seeing the EXPLAINs, I can only guess at why they are so different...
The subquery ( SELECT ... JOIN ... ) may or may not be 'materialized' into an implicit temp table. (Newer versions are better at doing such.)
Such a materialized subquery may or may not have an index created for it. (Again, new versions are better.)
If there are no adequate indexes on either dst or src, then the amount of 'effort' is the product of the sizes of the two tables. Note that in Case 2, dst is much smaller. (This may be the answer you are looking for.)
If the tables are not fully cached in RAM, one could artificially involve more I/O than the other. An I/O-bound query is often 10 times as slow as a the same query when it is fully cached in RAM. (This is less likely to be the answer, but may be part of the answer.)
Having a 3-table UPDATE would probably eliminate some of the issues above. And it may (or may not) eliminate the timing difference.
For further discussion, please provide
MySQL version
SHOW CREATE TABLE -- for each table
How big is innodb_buffer_pool_size
SHOW TABLE STATUS -- for each table
EXPLAIN UPDATE ... -- for each UPDATE -- requires at least 5.6
What percentage of the table has ( id2 = inputId2 )?
I tried to come up with a query that updates records in a MySQL table using other records in the same table, but I had mixed results between local testing and production. I don't know much about subqueries, so I want to bring this question here. In local development with MySQL InnoDB 5.6.23, a query on a dataset of about 180k records take 25 to 30 seconds. On a staging server with MySQL InnoDB 5.5.32 and a dataset of 254k records, the query seems to stall for hours until it's stopped, taking 100% of a CPU core.
This is the query I came up with:
UPDATE
`product_lang` AS `pl1`
SET
pl1.`name` = (
SELECT pl2.`name` FROM (SELECT `name`, `id_product`, `id_lang` FROM `product_lang`) AS `pl2`
WHERE pl1.`id_product` = pl2.`id_product`
AND pl2.`id_lang` = 1
)
WHERE
pl1.`id_lang` != 1
The objective is to replace the value of name in product records where id_lang is not 1 (default language for the sake of explaining) with the value of name of records value with the default id_lang of 1.
I know that subqueries are inefficient, but I really don't know how to solve this problem, and it would be a great plus to leave this in SQL-land instead of using the app layer to do the heavy lifting.
If you write the query like this:
UPDATE product_lang pl1
SET pl1.name = (SELECT pl2.`name`
FROM (SELECT `name`, `id_product`, `id_lang`
FROM `product_lang`
) `pl2`
WHERE pl1.`id_product` = pl2.`id_product` AND pl2.`id_lang` = 1
)
WHERE pl1.`id_lang` <> 1
Then you have a problem. The only index that can help is on product_lang(id_lang).
I would recommend writing this as a join:
UPDATE product_lang pl1 join
(select id_product, pl.name
from product_lang
where id_lang = 1
) pl2
on pl1.id_lang <> 1 and pl2.id_product = pl1.id_product
SET pl1.name = pl2.name
WHERE pl1.id_lang <> 1
The index that you want for this query is product_lang(id_lang, id_product) and product_lang(id_product). However, this seems like a strange update, because it will set all the names to the name from language 1.
UPDATE product_lang AS pl1
JOIN product_lang AS pl2
ON pl1.`id_product` =
pl2.`id_product`
SET pl1.name = pl2.name
WHERE pl2.`id_lang` = 1
AND pl1.`id_lang` != 1;
And have INDEX(id_lang, id_product).
Make sure that there is an index specifying columns id _ product and id_lang.
update pl1
set pl1.name=pl2.name
from product_lang pl1
,product_lang pl2
where pl1.id_product = pl2.id_product AND pl2.id_lang = 1
and pl1.id_lang <> 1
The composit index that will be required will be id_product and id_lang for
i have an SQL Requests:
SELECT DISTINCT id_tr
FROM planning_requests a
WHERE EXISTS(
SELECT 1 FROM planning_requests b
WHERE a.id_tr = b.id_tr
AND trainer IS NOT NULL
AND trainer != 'FREE'
)
AND EXISTS(
SELECT 1 FROM planning_requests c
WHERE a.id_tr = c.id_tr
AND trainer IS NULL
)
but this requests take 168.9490 sec to execute for returning 23162 rows of 2545088 rows
should i use LEFT JOIN or NOT IN ? and how can i rewrite it thx
You can speed this up by adding indexes. I would suggest: planning_requests(id_tr, trainer).
You can do this as:
create index planning_requests_id_trainer on planning_requests(id_tr, trainer);
Also, I think you are missing an = in the first subquery.
EDIT:
If you have a lot of duplicate values of id_tr, then in addition to the above indexes, it might make sense to phrase the query as:
select id_tr
from (select distinct id_tr
from planning_requests
) a
where . . .
The where conditions are being run on every row of the original table. The distinct is processed after the where.
I think your query can be simplified to this:
SELECT DISTINCT a.id_tr
FROM planning_requests a
JOIN planning_requests b
ON b.id_tr = a.id_tr
AND b.trainer IS NULL
WHERE a.trainer < 'FREE'
If you index planning_requests(trainer), then MySQL can utilize an index range to get all the rows that aren't FREE or NULL. All numeric strings will meet the < 'FREE' criteria, and it also won't return NULL values.
Then, use JOIN to make sure each record from that much smaller result set has a matching NULL record.
For the JOIN, index planning_requests(id_tr, trainer).
It might be simpler if you don't mix types in a column like FREE and 1.