FULL table scan in LEFT JOIN using OR - mysql

How can this query be optimized to avoid the full table scan described below?
I've got a slow query that's taking approximately 15 seconds to return.
Let's get this part out of the way - I've confirmed all indexes are there.
When I run EXPLAIN, it shows that there is a FULL TABLE scan ran on the crosswalk table (the index for fromQuestionCategoryJoinID is not used, even if I attempt to force) - if I remove either of the fields and the OR, the index is used and query completes in milliseconds.
SELECT c.id, c.name, GROUP_CONCAT(DISTINCT tags.externalDisplayID SEPARATOR ', ') AS tags
FROM checklist c
LEFT JOIN questionchecklistjoin qcheckj on qcheckj.checklistID = c.id
LEFT JOIN questioncategoryjoin qcatj ON qcatj.questionID = qcheckj.questionID
LEFT JOIN questioncategoryjoin qcatjsub on qcatjsub.parentQuestionID = qcatj.questionID
LEFT JOIN crosswalk cw on (cw.fromQuestionCategoryJoinID = qcatj.id OR cw.fromQuestionCategoryJoinID = qcatjsub.id)
-- index used if I remove OR, eg.: LEFT JOIN crosswalk cw on (cw.fromQuestionCategoryJoinID = qcatj.id)
LEFT JOIN questioncategoryjoin qcj1 on qcj1.id = cw.toQuestionCategoryJoinID
LEFT JOIN question tags on tags.id = qcj1.questionID
GROUP BY c.id
ORDER BY c.name, tags.externalDisplayID;

Split the query into two queries for each part of the OR. Then combine them with UNION.
SELECT id, name, GROUP_CONCAT(DISTINCT externalDisplayID SEPARATOR ', ') AS tags
FROM (
SELECT c.id, c.name, tags.externalDisplayID
FROM checklist c
LEFT JOIN questionchecklistjoin qcheckj on qcheckj.checklistID = c.id
LEFT JOIN questioncategoryjoin qcatj ON qcatj.questionID = qcheckj.questionID
LEFT JOIN crosswalk cw on cw.fromQuestionCategoryJoinID = qcatj.id
LEFT JOIN questioncategoryjoin qcj1 on qcj1.id = cw.toQuestionCategoryJoinID
LEFT JOIN question tags on tags.id = qcj1.questionID
UNION ALL
SELECT c.id, c.name, tags.externalDisplayID
FROM checklist c
LEFT JOIN questionchecklistjoin qcheckj on qcheckj.checklistID = c.id
LEFT JOIN questioncategoryjoin qcatj ON qcatj.questionID = qcheckj.questionID
LEFT JOIN questioncategoryjoin qcatjsub on qcatjsub.parentQuestionID = qcatj.questionID
LEFT JOIN crosswalk cw on cw.fromQuestionCategoryJoinID = qcatjsub.id
LEFT JOIN questioncategoryjoin qcj1 on qcj1.id = cw.toQuestionCategoryJoinID
LEFT JOIN question tags on tags.id = qcj1.questionID
) AS x
GROUP BY x.id
ORDER BY x.name
Also, it doesn't make sense to include externalDisplayID in ORDER BY, because that will order by its value from a random row in the group. You could put ORDER BY externalDisplayID in the GROUP_CONCAT() arguments if that's what you want.

There is a second inefficiency going on here. I call it "explode-implode". First a bunch of JOINs (potentially) expand the number of rows in an intermediate table, then GROUP BY c.id collapses the number of rows back to what you started with (one row of output per row of checkpoint).
Before trying to help with that, please answer:
Is LEFT really needed?
How many rows in each table? (Especially in cw)
Can you get rid of DISTINCT?
Barmar's answer can possibly be improved upon by delaying the JOINs to qcj1andtagsuntil after theUNION`:
SELECT ...
FROM ( SELECT ...
FROM first few tables
UNION ALL
SELECT ...
FROM first few tables
) AS u
[LEFT] JOIN qcj1
[LEFT] JOIN tags
GROUP BY ...
ORDER BY ...
Another optimization (again building on Barmar's)
GROUP BY x.id
ORDER BY x.name
-->
GROUP BY x.name, x.id
ORDER BY x.name, x.id
When the items in GROUP BY and ORDER BY are the "same", they can be done in a single action, thereby saving (at least) a sort.
x.name, x.id is deterministic, where as x.name might put two rows with the same name in a different order, depending (perhaps) on the phase of the moon.
These indexes may help:
qcheckj: INDEX(checklistID, questionID)
qcatj: INDEX(questionID, id)
qcatjsub: INDEX(parentQuestionID, id)
cw: INDEX(fromQuestionCategoryJoinID, toQuestionCategoryJoinID)

Related

SQL query optimization for speed

So I was working on the problem of optimizing the following query I have already optimized this to the fullest from my side can this be further optimized?
select distinct name ad_type
from dim_ad_type x where exists ( select 1
from sum_adserver_dimensions sum
left join dim_ad_tag_map on dim_ad_tag_map.id=sum.ad_tag_map_id and dim_ad_tag_map.client_id=sum.client_id
left join dim_site on dim_site.id = dim_ad_tag_map.site_id
left join dim_geo on dim_geo.id = sum.geo_id
left join dim_region on dim_region.id=dim_geo.region_id
left join dim_device_category on dim_device_category.id=sum.device_category_id
left join dim_ad_unit on dim_ad_unit.id=dim_ad_tag_map.ad_unit_id
left join dim_monetization_channel on dim_monetization_channel.id=dim_ad_tag_map.monetization_channel_id
left join dim_os on dim_os.id = sum.os_id
left join dim_ad_type on dim_ad_type.id = dim_ad_tag_map.ad_type_id
left join dim_integration_type on dim_integration_type.id = dim_ad_tag_map.integration_type_id
where sum.client_id = 50
and dim_ad_type.id=x.id
)
order by 1
Your query although joined ok, is an overall bloat. You are using the dim_ad_type table on the outside, just to make sure it exists on the inside as well. You have all those left-joins that have NO bearing on the final outcome, why are they even there. I would simplify by reversing the logic. By tracing your INNER query for the same dim_ad_type table, I find the following is the direct line. sum -> dim_ad_tag_map -> dim_ad_type. Just run that.
select distinct
dat.name Ad_Type
from
sum_adserver_dimensions sum
join dim_ad_tag_map tm
on sum.ad_tag_map_id = tm.id
and sum.client_id = tm.client_id
join dim_ad_type dat
on tm.ad_type_id = dat.id
where
sum.client_id = 50
order by
1
Your query was running ALL dim_ad_types, then finding all the sums just to find those that matched. Run it direct starting with the one client, then direct with JOINs.

Not getting the right results using GROUP_CONCAT in query

I have 7 tables to work with inside a query:
tb_post, tb_spots, users, td_sports, tb_spot_types, tb_users_sports, tb_post_media
This is the query I am using:
SELECT po.id_post AS id_post,
po.description_post as description_post,
sp.id_spot as id_spot,
po.date_post as date_post,
u.id AS userid,
u.user_type As tipousuario,
u.username AS username,
spo.id_sport AS sportid,
spo.sport_icon as sporticon,
st.logo_spot_type as spottypelogo,
sp.city_spot AS city_spot,
sp.country_spot AS country_spot,
sp.latitud_spot as latitudspot,
sp.longitud_spot as longitudspot,
sp.short_name AS spotshortname,
sp.verified_spot AS spotverificado,
u.profile_image AS profile_image,
sp.verified_spot_by as spotverificadopor,
uv.id AS spotverificador,
uv.user_type AS spotverificadornivel,
pm.media_type AS mediatype,
pm.media_file AS mediafile,
GROUP_CONCAT(tus.user_sport_sport) sportsdelusuario,
GROUP_CONCAT(logosp.sport_icon) sportsdelusuariologos,
GROUP_CONCAT(pm.media_file) mediapost,
GROUP_CONCAT(pm.media_type) mediaposttype
FROM tb_posts po
LEFT JOIN tb_spots sp ON po.spot_post = sp.id_spot
LEFT JOIN users u ON po.uploaded_by_post = u.id
LEFT JOIN tb_sports spo ON sp.sport_spot = spo.id_sport
LEFT JOIN tb_spot_types st ON sp.type_spot = st.id_spot_type
LEFT JOIN users uv ON sp.verified_spot_by = uv.id
LEFT JOIN tb_users_sports tus ON tus.user_sport_user = u.id
LEFT JOIN tb_sports logosp ON logosp.id_sport = tus.user_sport_sport
LEFT JOIN tb_post_media pm ON pm.media_post = po.id_post
WHERE po.status = 1
GROUP BY po.id_post,uv.id
I am having problems with some of the GROUP_CONCAT groups:
GROUP_CONCAT(tus.user_sport_sport) sportsdelusuario is giving me the right items but repeated, all items twice
GROUP_CONCAT(logosp.sport_icon) sportsdelusuariologos is giving me the right items but repeated, all items twice
GROUP_CONCAT(pm.media_file) mediapost is giving me the right items but repeated four times
GROUP_CONCAT(pm.media_type) mediaposttype s giving me the right items but repeated four times
I can put here all tables structures if you need them.
Multiple one-to-many relations JOINed in a query have a multiplicative affect on aggregation results; the standard solution is subqueries:
You can change
GROUP_CONCAT(pm.media_type) mediaposttype
...
LEFT JOIN tb_post_media pm ON pm.media_post = po.id_post
to
pm.mediaposttype
...
LEFT JOIN (
SELECT media_post, GROUP_CONCAT(media_type) AS mediaposttype
FROM tb_post_media
GROUP BY media_post
) AS pm ON pm.media_post = po.id_post
If tb_post_media is very big, and the po.status = 1 condition in the outer query would significantly reduce the results of the subquery, it can be worth replicating the original join within the subquery to filter down it's results.
Similarly, the correlated version I mentioned in the comments can also be more performant if the outer query has relatively few results. (Calculating the GROUP_CONCAT() for each individually can cost less than calculating it for all once if you would only actually using very few of the results of the latter).
or just add DISTINCT to all the group_concat, e.g., GROUP_CONCAT(DISTINCT pm.media_type)

How can I merge these two left joins into a single one?

How can I merge these two left joins: http://sqlfiddle.com/#!9/1d2954/69/0
SELECT d.`id`, (adcount + bdcount)
FROM `docs` d
LEFT JOIN
(
SELECT da.`doc_id`, COUNT(da.`doc_id`) AS adcount FROM `docs_scod_a` da
INNER JOIN `scod_a` a ON a.`id` = da.`scod_a_id`
WHERE a.`ver_a` IN ('AA', 'AB')
GROUP BY da.`doc_id`
) ad ON ad.`doc_id` = d.`id`
LEFT JOIN
(
SELECT db.`doc_id`, COUNT(db.`doc_id`) AS bdcount FROM `docs_scod_b` db
INNER JOIN `scod_b` b ON b.`id` = db.`scod_b_id`
WHERE b.`ver_b` IN ('BA', 'BB')
GROUP BY db.`doc_id`
) bd ON bd.`doc_id` = d.`id`
to be a Single left join just to ease its use in my code, while making it no less slower?
Let me first emphasize that your method of doing the calculation is the better method. You have two separate dimensions and aggregating them separately is often the most efficient method for doing the calculation. It is also the most scalable method.
That said, your query should be equivalent to this version:
SELECT d.id,
count(distinct a.id),
count(distinct b.id)
FROM docs d left join
docs_scod_a da
ON da.doc_id = d.id LEFT JOIN
scod_a a
ON a.id = da.scod_a_id AND a.ver_a IN ('AA', 'AB') LEFT JOIN
docs_scod_b db
ON db.doc_id = d.id LEFT JOIN
scod_b b
ON b.id = db.scod_b_id AND b.ver_b IN ('BA', 'BB')
GROUP BY d.id
ORDER BY d.id;
This query is more expensive than it looks, because the COUNT(DISTINCT) incurs additional overhead compared to COUNT().
And here is the SQL Fiddle.
And, because LEFT JOIN can return NULL values, your query is more correctly written as:
SELECT d.`id`, COALESCE(adcount, 0) + COALESCE(bdcount, 0)
If you were having problems with the results, this small change might fix those problems.
Performance may be a big problem, depending on sizes of each table. It appears to be an "inflate-deflate" situation since it first "inflates" the number of rows via JOIN, then "deflates" via GROUP BY. The formulation below avoids inflation-deflation.
But first, if I understand this subquery correctly, this
SELECT da.`doc_id`, COUNT(da.`doc_id`) AS adcount
FROM `docs_scod_a` da
INNER JOIN `scod_a` a ON a.`id` = da.`scod_a_id`
WHERE a.`ver_a` IN ('AA', 'AB')
GROUP BY da.`doc_id`
can be rewritten as
SELECT `doc_id`,
( SELECT COUNT(*)
FROM `scod_a`
WHERE `id` = da.`scod_a_id`
AND `ver_a` IN ('AA', 'AB')
) AS adcount
FROM `docs_scod_a` AS da
If that is correct, then the entire query becomes
SELECT d.id,
( SELECT COUNT(*)
FROM docs_scod_a ds
JOIN scod_a s ON s.id = ds.scod_a_id
WHERE ds.doc_id = d.id
AND s.ver_a IN ('AA', 'AB')
) +
( SELECT COUNT(*)
FROM docs_scod_b ds
JOIN scod_b s ON s.id = ds.scod_b_id
WHERE ds.doc_id = d.id
AND s.ver_b IN ('BA', 'BB')
)
FROM docs AS d
Which needs these indexes:
docs_scod_a: (doc_id, scod_a_id), (scod_a_id, doc_id)
docs_scod_b: (doc_id, scod_b_id), (scod_b_id, doc_id)
scod_a: (ver_a, id)
scod_b: (ver_b, id)
docs: -- presumably has PRIMARY KEY(id)
Note the lack of GROUP BY.
docs_scod_a smells like a many-to-many mapping table. I recommend you follow the tips here.
(No COALESCE is needed since COUNT will simply return zero.)
(I don't know whether my version is better (faster or whatever) than Gordon's, nor whether my indexes will help his formulation.)

MYSQL - Left join a double relationship?

SELECT stores.ID, store_info.display_name, store_info.address,store_info.phone,
IFNULL(
GROUP_CONCAT(DISTINCT storeBrands.display_name ORDER BY storeBrands.name),
GROUP_CONCAT(chainBrands.display_name ORDER BY chainBrands.name)
) AS brands,
IFNULL(
GROUP_CONCAT(DISTINCT storeFilters.name ORDER BY storeFilters.name),
GROUP_CONCAT(DISTINCT chainFilters.name ORDER BY chainFilters.name)
) AS filters
FROM stores
LEFT JOIN store_info ON stores.ID = store_info.storeID
LEFT JOIN store_brands ON stores.ID = store_brands.store
LEFT JOIN chain_brands ON stores.chainID = chain_brands.chain
LEFT JOIN brands AS storeBrands ON store_brands.brand = storeBrands.ID
LEFT JOIN brands AS chainBrands ON chain_brands.brand = chainBrands.ID
LEFT JOIN store_filters ON stores.ID = store_filters.store
LEFT JOIN chain_filters ON stores.chainID = chain_filters.chain
LEFT JOIN filters AS storeFilters ON store_filters.filter = storeFilters.ID
LEFT JOIN filters AS chainFilters ON chain_filters.filter = chainFilters.ID
WHERE stores.city = 1
GROUP BY stores.ID
I have updated this question because I have solved the initial problem myself, but there's still one more question:
How can I improve on this?
I feel like I've made a lot of progress already. I have gone from doing a union with subqueries, to doing a single query with subqueries, to improving my joins up to the point where I don't need to do a subquery for each row anymore.
However, it still feels like it could be better. I'm very insecure about my joins.
Does anyone have tips of improvement here?
The goal:
I want this query to get results from a hierarchy. We have 'parents' (chains) that share the same brands and filters (and other things) as their own children(stores). The idea is for the 'child' to inherit the parent's settings as a fallback, but completely ignores it when it starts setting its own data.
So, basically, with one query, you want "either this data or that data", never both. One or the other. (Another reason why UNIONwon't really fit)
If you want the chain's only when the store has none, formulate a UNION. One operand of the UNION joins the master data to the store data, while the other operand--instantiated only when the store has none--joins the master data to the chain data. That's what one uses a UNION for: "I sometimes want these, and I sometimes want those."
I managed to fix this issue myself, also making it way more efficient, removing most subqueries!
SELECT stores.ID, store_info.display_name, store_info.address,store_info.phone,
IFNULL(
GROUP_CONCAT(DISTINCT storeBrands.display_name ORDER BY storeBrands.name),
GROUP_CONCAT(chainBrands.display_name ORDER BY chainBrands.name)
) AS brands,
IFNULL(
GROUP_CONCAT(DISTINCT storeFilters.name ORDER BY storeFilters.name),
GROUP_CONCAT(DISTINCT chainFilters.name ORDER BY chainFilters.name)
) AS filters
FROM stores
LEFT JOIN store_info ON stores.ID = store_info.storeID
LEFT JOIN store_brands ON stores.ID = store_brands.store
LEFT JOIN chain_brands ON stores.chainID = chain_brands.chain
LEFT JOIN brands AS storeBrands ON store_brands.brand = storeBrands.ID
LEFT JOIN brands AS chainBrands ON chain_brands.brand = chainBrands.ID
LEFT JOIN store_filters ON stores.ID = store_filters.store
LEFT JOIN chain_filters ON stores.chainID = chain_filters.chain
LEFT JOIN filters AS storeFilters ON store_filters.filter = storeFilters.ID
LEFT JOIN filters AS chainFilters ON chain_filters.filter = chainFilters.ID
WHERE stores.city = $cityID
GROUP BY stores.ID"

What Would be the Correct SELECT Statement for This?

SELECT *
FROM notifications
INNER JOIN COMMENT
ON COMMENT.id = notifications.source_id
WHERE idblog IN (SELECT blogs_id
FROM blogs
WHERE STATUS = "active")
INNER JOIN reportmsg
ON reportmsg.msgid = notifications.source_id
WHERE uid =: uid
ORDER BY notificationid DESC
LIMIT 20;
Here I am INNER JOINing notifications with comment and reportmsg; then filtering content with WHERE.
But my problem is that for the first INNER JOIN [i.e, with comment], before joining notifications with comment, I want to match notifications.idblog with blogs.blogs_id and SELECT only those rows where blogs.status = "active".
For better understanding of the code above:
Here, for INNER JOIN, with comment I want to SELECT only those rows in notifications whose idblog matches blogs.blogs_id and has status = "active".
The second INNER JOIN with reportmsg needs not to be altered. I.e, it only filters through uid.
As you can see from the image below, you can just need to merge other tables to notifications table using LEFT JOIN like that:
SELECT n.notificationid, n.uid, n.idblog, n.source_id,
b.blogs_id, b.status,
c.id,
r.msgid
-- ... and the other columns you want
FROM notifications n
LEFT JOIN blogs b ON b.blogs_id = n.idblog AND b.STATUS = "active" AND n.uid =: uid
LEFT JOIN comment c ON c.id = n.source_id
LEFT JOIN reportmsg r ON r.msgid = n.source_id
ORDER BY n.notificationid DESC
LIMIT 20;
There's no need/reason to filter before the second join because you only use inner joins and then the order of joins and WHERE-conditions don't matter:
SELECT n.*, c.*, r.*
FROM notifications AS n
JOIN COMMENT as c
ON n.source_id = c.id
LEFT JOIN blogs as b
ON n.idblogs = b.blogs_id
AND B.STATUS = 'active'
JOIN reportmsg AS R
ON n.source_id = r.msgid
WHERE uid =: uid
ORDER BY notificationid DESC
LIMIT 20
You can switch the order of joins, you can move B.STATUS = 'active' into the join-condition, but all queries will return the same result. (After the edit it's a LEFT JOIN, of course now the result differs)
And of course you shouldn't use *, better list only the columns you actually need.
if query optimizer does its work, it does not matter where you put filtering statement in INNER JOIN case but in the LEFT JOIN it has effects. Putting filtering statement in LEFT JOIN conditions cause table filtered at first and joined after while putting filtering statement in WHERE clause will filter results of join. Hence, if you want to use LEFT JOIN your query must look like:
SELECT nt.*
FROM notifications nt
LEFT JOIN Blogs bg on nt.blogs_id = bg.blogs_id and bg.STATUS = "active"
LEFT JOIN COMMENT cm ON cm.id = nt.source_id
LEFT JOIN reportmsg rm ON rm.msgid = nt.source_id
WHERE uid =: uid
ORDER BY nt.notificationid DESC
LIMIT 20;
It's very unclear what you are after here.. while your table diagram is useful, you should really supply some sample data and an expected result even if it is just a couple of dummy rows for each table.
Queries work row by row, both INNER JOINs are applied to the same notification row and non-matching rows are discarded.
Any filter applies to both JOIN and any returned rows must have a match in BOTH comment and reportmsg.
Perhaps you want two LEFT JOINs that can apply different filters and guessing from the table names perhaps it could look like this:
SELECT *
FROM notifications n
LEFT JOIN blogs b
ON n.blogId = b.blogs_id
LEFT JOIN comment c
ON c.id = n.source_id
AND b.status = "Active"
LEFT JOIN reportmsg rm
ON rm.msgid = n.source_id
WHERE n.uid =: uid
AND (c.id IS NOT NULL OR rm.msgid IS NOT NULL)
ORDER BY n.notificationid DESC
LIMIT 20
You also should work on your naming convention:
notifications, comment -> pick either plural or singular table names
notifications.notificationid, comment.id -> pick adding table name to id
notificationid, source_id -> pick underscore or no separation
idblog, notificationid -> pick prepending or appending id
Currently you pretty much have to look up every id field every time you want to use one.
You should change your query to this:
SELECT *
FROM notifications
INNER JOIN comment ON comment.id = notifications.source_id
INNER JOIN reportmsg ON reportmsg.msgid=notifications.source_id
LEFT JOIN blogs ON notifications.idblog = blogs.blogs_id
WHERE blogs.status = 'active'
ORDER BY notificationid DESC
LIMIT 20;