I have two tables RequestHistoryLog and Request.
RequestHistoryLog table have these columns with 1.2 million of rows
id(bigint), status(VARCHAR), byUser(VARCHAR),
delegatedUserFor(text), reqId(bigint)
reqId with CONSTRAINTS FOREIGN KEY (`reqId`) REFERENCES `Request` (`id`)
Request table have many columns with 0.4 million of rows
id(bigint), title(VARCHAR), actionDateTime(Datetime), type(VARCHAR) etc.
In RequestHistoryLog there are multiple entries w.r.t status of a request.
And 1 Request has many LogHistory.
Here delegatedUserFor(is a text column type) has multiple names with emails.
Example is: 'X(x#xyz.com)A(a#xyz.com)Y(y#xyz.com)'
By the below query, I am trying to get the requests on which A(a#xyz.com) has done a status from "Approved", "Done", "Completed", "Queried", "Rejected" or
some other user has done a status for A(It means someone has done a status one-half of A) but this time entry goes to delegatedUserFor column.
SELECT *
FROM
(SELECT r.*
FROM Request AS r
JOIN RequestHistoryLog AS rh ON r.id = rh.reqId
where rh.status IN ("Approved", "Done", "Completed",
"Queried", "Rejected")
and (rh.byUser='a#xyz.com'
or rh.delegatedUserFor like '%(a#xyz.com)%')
and r.type='custom'
) AS a
GROUP BY id
ORDER BY actionDateTime desc limit 10;
I am writing a sample data for both table as:
RequestHistoryLog Table
id status byUser delegatedUserFor reqId
2 "Approved" 'A(a#xyz.com)' '' 15
3 "Rejected" 'G(g#xyz.com)' '' 15
4 "Approved" 'X(x#xyz.com)' 'A(a#xyz.com)Y(y#xyz.com)' 15
5 "Approved" 'X(x#xyz.com)' 'G(g#xyz.com)A(a#xyz.com)Y(y#xyz.com)' 16
6 "Rejected" 'B(b#xyz.com)' '' 16
7 "Completed"'Y(y#xyz.com)' '' 16
Request Table
id title actionDateTime
15 "Request1" '2021-11-23 01:23:20' ..........
16 "Request2" '2021-11-23 11:23:20' ..........
Now I am getting requests on which A has done a status or other user has done one-half of A.
Above query is taking much time.
So I need how to optimize it to get fast result ?
Plan A: (Probably better if not many rows are type=custom)
Do a "semi-join":
SELECT r.*
FROM Request AS r
JOIN RequestHistoryLog AS rh ON r.id = rh.reqId
WHERE r.type = 'custom'
AND EXISTS ( SELECT 1 FROM RequestHistoryLog AS rh
WHERE r.id = rh.reqId
AND rh.status IN ("Approved", "Done", "Completed",
"Queried", "Rejected")
AND ( rh.byUser='a#xyz.com'
or rh.delegatedUserFor like '%(a#xyz.com)%' )
ORDER BY r.actionDateTime desc
LIMIT 10;
Note that the GROUP BY and nested SELECT are avoided. Have these indexes:
r: INDEX(type, actionDateTime)
rh: INDEX(reqId, status, byUser, delegatedUserFor)
Plan B: (if type is often =custom and/or 'xyz' is rare)
FULLTEXT(byUser, delegatedUserFor)
and do
WHERE MATCH(byUser, delegatedUserFor) AGAINST ("+xyz" IN BOOLEAN MODE)
AND (rh.byUser='a#xyz.com'
r rh.delegatedUserFor like '%(a#xyz.com)%')
This should find the rows with domain xyz first by FULLTEXT (rapidly), then verify against the other tests (against fewer rows). Other simplifications can be done too. Perhaps something like
SELECT r.*
FROM ( SELECT DISTINCT rh.reqId
FROM RequestHistoryLog AS rh
WHERE MATCH ... AND ( ... OR ... )
AND rh.status IN (...)
) AS x
JOIN Request AS r ON r.id = x.reqId
WHERE r.type = 'custom'
ORDER BY r.actionDateTime desc
LIMIT 10;
(No other indexes needed.) The GROUP BY is replaced by DISTINCT, which is probably faster in this case. And the FULLTEXT index may be very fast.
Note that FULLTEXT has a minimum word length (default 3), hence you need to avoid searching for "a" or any other string shorter than that. Also "com" may be so common as to be not worth searching for.
Plan C
If there is some easy way to predict which one will be better, then have both queries and dynamically pick between them.
For example, when searching for ...#hp.com, note that "hp" is too short, making the fulltext approach unworkable.
You probably know which r.type values occur more than 20% of the time, making the Plan B a better choice.
Plan D: if only one domain
If byUser and delegatedUserFor either have the same "xyz.com" or are blank, then add a column to rh replace the messy test with AND rh.domain = 'xyz.com'. And still do something to get of the GROUP BY.
Related
I want to improve my current query. So I have this table called Incomes. Where I have a sourceId varchar field. I have a single SELECT for the fields I need, but I needed to add an extra field called isFirstTime to represent if it was the first time on the row on what that sourceId was used. This is my current query:
SELECT DISTINCT
`income`.*,
CASE WHEN (
SELECT
`income2`.id
FROM
`income` as `income2`
WHERE
`income2`."sourceId" = `income`."sourceId"
ORDER BY
`income2`.created asc
LIMIT 1
) = `income`.id THEN true ELSE false END
as isFirstIncome
FROM
`income` as `income`
WHERE `income`.incomeType IN ('passive', 'active') AND `income`.status = 'paid'
ORDER BY `income`.created desc
LIMIT 50
The query works but slows down if I keep increasing the LIMIT or OFFSET. Any suggestions?
UPDATE 1:
Added WHERE statements used on the original query
UPDATE 2:
MYSQL version 5.7.22
You can achieve it using Ordered Analytical Function.
You can use ROW_NUMBER or RANK to get the desired result.
Below query will give the desired output.
SELECT *,
CASE
WHEN Row_number()
OVER(
PARTITION BY sourceid
ORDER BY created ASC) = 1 THEN true
ELSE false
END AS isFirstIncome
FROM income
WHERE incomeType IN ('passive', 'active') AND status = 'paid'
ORDER BY created desc
DB Fiddle: See the result here
My first thought is that isFirstIncome should be an extra column in the table. It should be populated as the data is inserted.
If you don't like that, let's try to optimize the query...
Let's avoid doing the subquery more than 50 times. This requires turning the query inside-out. (It's like "explode-implode", where the query gathers lots of stuff, then sorts it and throws most of the rows away.)
To summarize:
do the least amount of effort to just identify the 5 rows.
JOIN to whatever tables are needed (including itself if appropriate); this is to get any other columns desired (including isFirstIncome).
SELECT i3.*,
( ... using i3 ... ) as isFirstIncome
FROM (
SELECT i1.id, i1.sourceId
FROM `income` AS i1
WHERE i1.incomeType IN ('passive', 'active')
AND i1.status = 'paid'
ORDER BY i1.created DESC
LIMIT 50
) AS i2
JOIN income AS i3 USING(id)
ORDER BY i2.created DESC -- yes, repeated
(I left out the computation of isFirstIncome; it is discussed in other Answers. But note that it will be executed at most 50 times.)
(The aliases -- i1, i2, i3 -- are numbered in the order they will be "used"; this is to assist in following the SQL.)
To assist in performance, add
INDEX(status, incomeType, created, id, sourceId)
It should help with my formulation, but probably not for the other versions. Your version would benefit from
INDEX(sourceId, created, id)
I've studied and tried days worth of SQL queries to find "something" that will work. I have a table, apj32_facileforms_subrecords, that uses 7 columns. All the data I want to display is in 1 column - "value". The "record" displays the number of the entry. The "title" is what I would like to appear in the header row, but that's not as important as "value" to display in 1 row based upon "record" number.
I've tried a lot of CONCAT and various Pivot queries, but nothing seems to do more than "get close" to what I'd like as the end result.
Here's a screen shot of the table:
The output "should" be linear, so that 1 row contains 9 columns:
Project; Zipcode; First Name; Last Name; Address; City; Phone; E-mail; Trade (in that order). And the values in the 9 columns come from "value" as they relate to the "record" number.
I know there are LOT of examples that are similar, but nothing I've found covers taking all the values from "value" and CONCAT to 1 row.
This works to get all the data I want - SELECT record,value FROM apj32_facileforms_subrecords WHERE (record IN (record,value)) ORDER BY record
But the values are still in multiple rows. I can play with that query to get just the values, but I'm still at a loss to get them into 1 row. I'll keep playing with that query to see if I can figure it out before one of the experts here shows me how simple it is to do that.
Any help would be appreciated.
Using SQL to flatten an EAV model representation into a relational representation can be somewhat convoluted, and not very efficient.
Two commonly used approaches are conditional aggregation and correlated subqueries in the SELECT list. Both approaches call out for careful indexing for suitable performance with large sets.
correlated subqueries example
Here's an example of the correlated subquery approach, to get one value of the "zipcode" attribute for some records
SELECT r.id
, ( SELECT v1.value
FROM `apj32_facileforms_subrecords` v1
WHERE v1.record = r.id
AND v1.name = 'zipcode'
ORDER BY v1.value LIMIT 0,1
) AS `Zipcode`
FROM ( SELECT 1 AS id ) r
Extending that, we repeat the correlated subquery, changing the attribute identifier ('firstname' in place of 'zipcode'. looks like we we could also reference it by element, e.g. v2.element = 2
SELECT r.id
, ( SELECT v1.value
FROM `apj32_facileforms_subrecords` v1
WHERE v1.record = r.id
AND v1.name = 'zipcode'
ORDER BY v1.value LIMIT 0,1
) AS `Zipcode`
, ( SELECT v2.value
FROM `apj32_facileforms_subrecords` v2
WHERE v2.record = r.id
AND v2.name = 'firstname'
ORDER BY v2.value LIMIT 0,1
) AS `First Name`
, ( SELECT v3.value
FROM `apj32_facileforms_subrecords` v3
WHERE v3.record = r.id
AND v3.name = 'lastname'
ORDER BY v3.value LIMIT 0,1
) AS `Last Name`
FROM ( SELECT 1 AS id UNION ALL SELECT 2 ) r
returns something like
id Zipcode First Name Last Name
-- ------- ---------- ---------
1 98228 David Bacon
2 98228 David Bacon
conditional aggregation approach example
We can use GROUP BY to collapse multiple rows into one row per entity, and use conditional tests in expressions to "pick out" attribute values with aggregate functions.
SELECT r.id
, MIN(IF(v.name = 'zipcode' ,v.value,NULL)) AS `Zip Code`
, MIN(IF(v.name = 'firstname' ,v.value,NULL)) AS `First Name`
, MIN(IF(v.name = 'lastname' ,v.value,NULL)) AS `Last Name`
FROM ( SELECT 1 AS id UNION ALL SELECT 2 ) r
LEFT
JOIN `apj32_facileforms_subrecords` v
ON v.record = r.id
GROUP
BY r.id
For more portable syntax, we can replace MySQL IF() function with more ANSI standard CASE expression, e.g.
, MIN(CASE v.name WHEN 'zipcode' THEN v.value END) AS `Zip Code`
Note that MySQL does not support SQL Server PIVOT syntax, or Oracle MODEL syntax, or Postgres CROSSTAB or FILTER syntax.
To extend either of these approaches to be dynamic, to return a resultset with a variable number of columns, and variety of column names ... that is not possible in the context of a single SQL statement. We could separately execute SQL statements to retrieve information, that would allow us to dynamically construct a SQL statement of a form show above, with an explicit set of columns to be returned.
The approaches outline above return a more traditional relational model, (individual columns each with a value).
non-relational munge of attributes and values into a single string
If we have some special delimiters, we could munge together a representation of the data using GROUP_CONCAT function
As a rudimentary example:
SELECT r.id
, GROUP_CONCAT(v.title,'=',v.value ORDER BY v.name) AS vals
FROM ( SELECT 1 AS id ) r
LEFT
JOIN `apj32_facileforms_subrecords` v
ON v.record = r.id
AND v.name in ('zipcode','firstname','lastname')
GROUP
BY r.id
To return two columns, something like
id vals
-- ---------------------------------------------------
1 First Name=David,Last Name=Bacon,Zip Code=98228
We need to be aware that the return from GROUP_CONCAT is limited to group_concat_max_len bytes. And here we have just squeezed the balloon, moving the problem to some later processing, to parse the resulting string. If we have any equal signs or commas that appear in the values, it's going to make a mess of parsing the result string. So we will have to properly escape any delimiters that appear in the data, so that GROUP_CONCAT expression is going to get more involved.
I've this query:
"explain UPDATE requests R JOIN profile as P ON R.intern_id = P.intern_id OR R.intern_id_decoded = P.intern_id OR R.intern_id_full_decode = P.intern_id SET R.found_id=P.id WHERE R.id >= 28000001 AND R.id <= 28000001+2000000 AND R.found_id is NULL"
1 UPDATE R NULL range PRIMARY,intern_id_customer_id_batch_num,id_found_id PRIMARY 4 NULL 3616888 10.00 Using where
1 SIMPLE P NULL ALL intern_id_dt_snapshot,intern_id NULL NULL NULL 179586254 27.10 Range checked for each record (index map: 0x6)
That query takes about 40 seconds to execute, it's updating 5000-10000 rows from the set of 2 million rows.
I am currently updating in 2 million row "jobs" to make the join perform faster.
The whole table is 170 million records currently.
The EXPLAIN shows the second part without using an INDEX, I am not sure if that's right or not.
The intern_id fields are varchars, found_id and id are INT
Does the explain output look like it's working performantly ?
I noticed the second line does not use an index, not sure if that's normal.
I would do this logic using multiple joins:
UPDATE requests r LEFT JOIN
profile p1
ON r.intern_id = p1.intern_id LEFT JOIN
profile p2
ON r.intern_id_decoded = p2.intern_id AND p1.id IS NULL LEFT JOIN
profile p3
ON r.intern_id_full_decode = p3.intern_id AND p2.id IS NULL
SET r.found_id = COALESCE(p1.id, p2.id, p3.id)
WHERE R.id >= 28000001 AND R.id <= 28000001 + 2000000 AND
R.found_id is NULL;
Databases are very bad at optimizing OR in JOIN conditions. It might be better with explicit JOINs.
The ON conditions also ensure only the first match.
I would do 3 chunked-up UPDATEs -- one for each of the ON conditions.
10K rows to update is excessive; crank it down to perhaps 1K. That means cranking the chunking down to 200K. (The speed might even be faster.)
UPDATE ... ON P.intern_id = R.intern_id SET ... WHERE ...
UPDATE ... ON P.intern_id = R.intern_id_decoded SET ... WHERE ...
UPDATE ... ON P.intern_id = R.intern_id_full SET ... WHERE ...
(The range is the same fore each set of 3, thereby helping with caching of R.)
Possibly INDEX(found_id) would help, but this is not a given.
See here for more chunking suggestions, especially the tip on finding 1000 rows before starting the operation:
SELECT id WHERE id > ... AND found_id IS NULL LIMIT 1000,1;
Then using that as the limit instead of the 2-millionth. A goal here is to even out the number of rows updated.
This query (along with a few others I think have a related issue) did not take 30 seconds when MySQL was local on the same EC2 instance as the rest of the website. More like milliseconds.
Does anything look off?
SELECT *, chv_images.image_id FROM chv_images
LEFT JOIN chv_storages ON chv_images.image_storage_id =
chv_storages.storage_id
LEFT JOIN chv_users ON chv_images.image_user_id = chv_users.user_id
LEFT JOIN chv_albums ON chv_images.image_album_id = chv_albums.album_id
LEFT JOIN chv_categories ON chv_images.image_category_id =
chv_categories.category_id
LEFT JOIN chv_meta ON chv_images.image_id = chv_meta.image_id
LEFT JOIN chv_likes ON chv_likes.like_content_type = "image" AND
chv_likes.like_content_id = chv_images.image_id AND chv_likes.like_user_id = 1
LEFT JOIN chv_follows ON chv_follows.follow_followed_user_id =
chv_images.image_user_id
LEFT JOIN chv_follows_projects ON
chv_follows_projects.follows_project_project_id =
chv_images.image_project_id LEFT JOIN chv_projects ON
chv_projects.project_id = follows_project_project_id WHERE
chv_follows.follow_user_id='1' OR (follows_project_user_id = 1 AND
chv_projects.project_privacy = "public" AND
chv_projects.project_is_public_upload = 1) GROUP BY chv_images.image_id
ORDER BY chv_images.image_id DESC
LIMIT 0,15
And this is what EXPLAIN shows:
Thank you
Update: This query has the same issue. It does not have a GROUP BY.
SELECT *, chv_images.image_id FROM chv_images
LEFT JOIN chv_storages ON chv_images.image_storage_id =
chv_storages.storage_id
LEFT JOIN chv_users ON chv_images.image_user_id = chv_users.user_id
LEFT JOIN chv_albums ON chv_images.image_album_id = chv_albums.album_id
LEFT JOIN chv_categories ON chv_images.image_category_id =
chv_categories.category_id
LEFT JOIN chv_meta ON chv_images.image_id = chv_meta.image_id
LEFT JOIN chv_likes ON chv_likes.like_content_type = "image" AND
chv_likes.like_content_id = chv_images.image_id AND chv_likes.like_user_id = 1
ORDER BY chv_images.image_id DESC
LIMIT 0,15
That EXPLAIN shows several table-scans (type: ALL), so it's not surprising that it takes over 30 seconds.
Here's your EXPLAIN:
Notice the column rows shows an estimated 14420 rows read from the first table chv_images. It's doing a table-scan of all the rows.
In general, when you do a series of JOINs, you can multiple together all the values in the rows column of the EXPLAIN, and the final result is how many row-reads MySQL has to do. In this case it's 14420 * 2 * 1 * 1 * 2 * 1 * 916, or 52,834,880 row-reads. That should put into perspective the high cost of doing several table-scans in the same query.
You might help avoid those table-scans by creating some indexes on these tables:
ALTER TABLE chv_storages
ADD INDEX (storage_id);
ALTER TABLE chv_categories
ADD INDEX (category_id);
ALTER TABLE chv_likes
ADD INDEX (like_content_id, like_content_type, like_user_id);
Try creating those indexes and then run the EXPLAIN again.
The other tables are already doing lookups by primary key (type: eq_ref) or by secondary key (type: ref) so those are already optimized.
Your EXPLAIN shows your query uses a temporary table and filesort. You should reconsider whether you need the GROUP BY, because that's probably causing the extra work.
Another tip is to avoid using SELECT * because it might be forcing the query to read many extra columns that you don't need. Instead, explicitly name only the columns you need.
Is there any indexes in chv_images?
I propose:
CREATE INDEX idx_image_id ON chv_images (image_id);
(Bill's ideas are good. I'll take the discussion a different way...)
Explode-Implode -- If the LEFT JOINs match no more than 1 row, change, for example,
SELECT
...
LEFT JOIN chv_meta ON chv_images.image_id = chv_meta.image_id
into
SELECT ...,
( SELECT foo FROM chv_meta WHERE image_id = chv_images.image_id ) AS foo, ...
If that can be done for all the JOINs, you can get rid of GROUP BY. This will avoid the costly "explode-implode" where JOINs lead to more rows, then GROUP BY gets rid of the dups. (I suspect you can't move all the joins in.)
OR -> UNION -- OR is hard to optimize. Your query looks like a good candidate for turning into UNION, then making more indexes that will become useful.
WHERE chv_follows.follow_user_id='1'
OR (follows_project_user_id = 1
AND chv_projects.project_privacy = "public"
AND chv_projects.project_is_public_upload = 1
)
Assuming that follows_project_user_id is in `chv_images,
( SELECT ...
WHERE chv_follows.follow_user_id='1' )
UNION DISTINCT -- or ALL, if you are sure there won't be dups
( SELECT ...
WHERE follows_project_user_id = 1
AND chv_projects.project_privacy = "public"
AND chv_projects.project_is_public_upload = 1 )
Indexes needed:
chv_follows: (follow_user_id)
chv_projects: (project_privacy, project_is_public_upload) -- either order
But this has not yet handled the ORDER BY and LIMIT. The general pattern for such:
( SELECT ... ORDER BY ... LIMIT 15 )
UNION
( SELECT ... ORDER BY ... LIMIT 15 )
ORDER BY ... LIMIT 15
Yes, the ORDER BY and LIMIT are repeated.
That works for page 1. If you want the next 15 rows, see http://mysql.rjweb.org/doc.php/pagination#pagination_and_union
After building those two sub-selects, look at them; I think you will be able to optimize each one, and may need new indexes because the Optimizer will start with a different 'first' table.
The query below gives me 2 out of the 3 answers I'm looking for. On the sub-query select I get null instead of no
the 3 possible values for column name isCyl could be blank, yes, no
I'm not sure if the sub-query is the best way to go about it, but I don't know how else to re-state the query.
The schedule table has a series of columns to show what tasks must be completed on an assignment. Related tables store the results of the tasks if they were assigned to be completed. So I need to test if a specific task was scheduled. If so, then I need to see if the results of the task have been recorded in the related table. For brevity I am only showing one of the columns here.
SELECT s.`reckey`,
if(s.cylinders="T",
(select
if(c.areckey is not null,
"yes",
"no"
)
from cylinders c where c.areckey = s.reckey limit 1
)
,""
) as isCyl
from schedule s
where s.assignmentDate between 20161015 and 20161016
order by s.reckey
Use a LEFT JOIN, which returns NULL for columns in the child table when there's no match.
SELECT s.reckey, IF(s.cylinders = "T",
IF(c.areckey IS NOT NULL, 'yes', 'no'),
"") AS isCyl
FROM schedule AS s
LEFT JOIN cylinders AS c ON c.areckey = s.reckey
WHERE s.assignmentDate between 20161015 and 20161016
ORDER BY s.reckey
If there can be multiple rows in cylinders with the same areckey, change it to:
LEFT JOIN (select distinct areckey FROM cylinders) AS c on c.areckey = s.reckey
or use SELECT DISTINCT in the main query.