mysql not using index on simple OR condition - mysql

I have ran into the age-old problem of MySQL refusing to use an index for seemingly basic stuff.
The query in question:
SELECT c.*
FROM app_comments c
LEFT JOIN app_comments reply_c ON c.reply_to = reply_c.id
WHERE (c.external_id = '840774' AND c.external_context = 'deals')
OR (reply_c.external_id = '840774' AND reply_c.external_context = 'deals')
ORDER BY c.reply_to ASC, c.date ASC
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE c ALL external_context,external_id,idx_app_comments_externals NULL NULL NULL 903507 Using filesort
1 SIMPLE reply_c eq_ref PRIMARY PRIMARY 4 altero_full.c.reply_to 1 Using where
There are indexes on external_id and external_context separately, and I also tried adding a composite index (idx_app_comments_externals), but that did not help at all.
The query executes in 4-6 seconds in production (>1m records), but removing the OR part of the WHERE condition decreases that to 0.05s (it still uses filesort though).
Clearly indexes don't work here, but I have no idea why. Can anyone explain this?
P.S. We're using MariaDB 10.3.18, could that be at fault here?

MySQL (and MariaDB) cannot optimize OR conditions on different columns or tables. Note that in the context of the query plan c and reply_c are considered different tables. These queries are usually optimized "by hand" with UNION statements, which often contain a lot of code duplication. But in your case and with a quite recent version, which supports CTEs (Common Table Expressions) you can avoid most of it:
WITH p AS (
SELECT *
FROM app_comments
WHERE external_id = '840774'
AND external_context = 'deals'
)
SELECT * FROM p
UNION DISTINCT
SELECT c.* FROM p JOIN app_comments c ON c.reply_to = p.id
ORDER BY reply_to ASC, date ASC
Good indices for this query would be a composite one on (external_id, external_context) (in any order) and a separate one on (reply_to).
You will though not avoid a "filesort", but that shouldn't be a problem, when the data are filtered to a small set.

With the equality predicates on external_id and external_context columns in the WHERE clause, MySQL could make effective use of an index... when those predicates specify the subset of rows that can possibly satisfy the query.
But with the OR added to the WHERE clause, now the rows to be returned from c are not limited by external_id and external_content values. It's now possible that rows with other values of those columns could be returned; rows with any values of those columns.
And that negates the big benefit of using an index range scan operation... very quickly eliminating vast swaths of rows from being considered. Yes, an index range scan is used to quickly locate rows. That is true. But the meat of the matter is that the range scan operation uses the index to quickly to bypass millions and millions of rows that can't possibly be returned.
This is not behavior specific to MariaDB 10.3. We are going to observe the same behavior in MariaDB 10.2, MySQL 5.7, MySQL 5.6.
I'm questioning the join operation: Is it necessary to return multiple copies of rows from c when there are multiple matching rows from reply_c ? Or is the specification to just return distinct rows from c ?
We can look at the required resultset as two parts.
1) the rows from app_contents with equality predicates on external_id and external_context
SELECT c.*
FROM app_comments c
WHERE c.external_id = '840774'
AND c.external_context = 'deals'
ORDER
BY c.external_id
, c.external_context
, c.reply_to
, c.date
For optimal performance (excluding considering a covering index because of the * in the SELECT list), an index like this could be used to satisfy both the range scan operation and the order by (eliminating a Using filesort operation)
... ON app_comments (external_id, external_context, reply_to, date)
2) The second part of the result is the reply_to rows related to matching rows
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
ORDER
BY d.reply_to
, d.date
The same index recommended before can be used to accessing rows in e (range scan operation). Ideally, that index would also include the id column. Our best option is probably to modify the index to include id column following date
... ON app_comments (external_id, external_context, reply_to, date, id)
Or, for equivalent performance, at the expense of an extra index, we could define an index like this:
... ON app_comments (external_id, external_context, id)
For accessing rows from d with a range scan, we likely want an index:
... ON app_comments (reply_to, date)
We can combine the two sets with a UNION ALL set operator; but there's potential for the same row being returned by both queries. A UNION operator would force a unique sort to eliminate duplicate rows. Or we could add a condition to the second query to eliminate rows that will be returned by the first query.
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
HAVING NOT ( d.external_id <=> '840774'
AND d.external_context <=> 'deals'
)
ORDER
BY d.reply_to
, d.date
Combining the two parts, wrap each part in a set of parens add the UNION ALL set operator and an ORDER BY operator at the end (outside the parens), something like this:
(
SELECT c.*
FROM app_comments c
WHERE c.external_id = '840774'
AND c.external_context = 'deals'
ORDER
BY c.external_id
, c.external_context
, c.reply_to
, c.date
)
UNION ALL
(
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
HAVING NOT ( d.external_id <=> '840774'
AND d.external_context <=> 'deals'
)
ORDER
BY d.reply_to
, d.date
)
ORDER BY `reply_to`, `date`
This will need a "Using filesort" operation over the combined set, but now we've got a really good shot at getting good execution plan for each part.
There's still my question of how many rows we should return when there are multiple matching reply_to rows.

However, the name index is not used for lookups in the following queries:
SELECT * FROM test
WHERE last_name='Jones' OR first_name='John';
enter link description here

Related

MySQL query optimization - getting the last post of all threads

My MySQL query is loading very slow (over 30 secs), I was wondering what tweaks I can make to optimize it.
The query should return the last post with the string "?" of all threads.
SELECT FeedbackId, ParentFeedbackId, PageId, FeedbackTitle, FeedbackText, FeedbackDate
FROM ReaderFeedback AS c
LEFT JOIN (
SELECT max(FeedbackId) AS MaxFeedbackId
FROM ReaderFeedback
WHERE ParentFeedbackId IS NOT NULL
GROUP BY ParentFeedbackId
) AS d ON d.MaxFeedbackId = c.FeedbackId
WHERE ParentFeedbackId IS NOT NULL
AND FeedbackText LIKE '%?%'
GROUP BY ParentFeedbackId
ORDER BY d.MaxFeedbackId DESC LIMIT 50
Before discuss this problem, I have formatted your SQL:
SELECT feedbackid,
parentfeedbackid,
pageid,
feedbacktitle,
feedbacktext,
feedbackdate
FROM readerfeedback AS c
LEFT JOIN (SELECT Max(feedbackid) AS MaxFeedbackId
FROM readerfeedback
WHERE parentfeedbackid IS NOT NULL
GROUP BY parentfeedbackid) AS d
ON d.maxfeedbackid = c.feedbackid
WHERE parentfeedbackid IS NOT NULL
AND feedbacktext LIKE '%?%'
GROUP BY parentfeedbackid
ORDER BY d.maxfeedbackid DESC
LIMIT 50
Since there is an Inefficient query criteria in your SQL:
feedbacktext LIKE '%?%'
Which is not able to take benefit from Index and needs a full scan, I suggest you to add a new field
isQuestion BOOLEAN
to your table, and then add logic in your program to assign this field when insert/update a feedbacktext.
Finally your can query based on this field and take benefit from index.
Firstly your SQL is not valid. The outer Group by is not valid.
According to the SQL the second group by is not needed. I moved the 2 where into inner SQL, as well as the limit, wonder if the following is quicker:
SELECT FeedbackId, ParentFeedbackId, PageId, FeedbackTitle, FeedbackText, FeedbackDate
FROM ReaderFeedback AS c
JOIN (
SELECT max(FeedbackId) AS MaxFeedbackId
FROM ReaderFeedback
WHERE ParentFeedbackId IS NOT NULL
AND FeedbackText LIKE '%?%'
GROUP BY ParentFeedbackId
ORDER BY 1 DESC LIMIT 50
) AS d ON d.MaxFeedbackId = c.FeedbackId
Please have a look at your table structure, see if there is any normalisation be downed for speed concern.

How to fix SQL query with Left Join and subquery?

I have SQL query with LEFT JOIN:
SELECT COUNT(stn.stocksId) AS count_stocks
FROM MedicalFacilities AS a
LEFT JOIN stocks stn ON
(stn.stocksIdMF = ( SELECT b.MedicalFacilitiesIdUser
FROM medicalfacilities AS b
WHERE b.MedicalFacilitiesIdUser = a.MedicalFacilitiesIdUser
ORDER BY stn.stocksId DESC LIMIT 1)
AND stn.stocksEndDate >= UNIX_TIMESTAMP() AND stn.stocksStartDate <= UNIX_TIMESTAMP())
These query I want to select one row from table stocks by conditions and with field equal value a.MedicalFacilitiesIdUser.
I get always count_stocks = 0 in result. But I need to get 1
The count(...) aggregate doesn't count null, so its argument matters:
COUNT(stn.stocksId)
Since stn is your right hand table, this will not count anything if the left join misses. You could use:
COUNT(*)
which counts every row, even if all its columns are null. Or a column from the left hand table (a) that is never null:
COUNT(a.ID)
Your subquery in the on looks very strange to me:
on stn.stocksIdMF = ( SELECT b.MedicalFacilitiesIdUser
FROM medicalfacilities AS b
WHERE b.MedicalFacilitiesIdUser = a.MedicalFacilitiesIdUser
ORDER BY stn.stocksId DESC LIMIT 1)
This is comparing MedicalFacilitiesIdUser to stocksIdMF. Admittedly, you have no sample data or data layouts, but the naming of the columns suggests that these are not the same thing. Perhaps you intend:
on stn.stocksIdMF = ( SELECT b.stocksId
-----------------------------^
FROM medicalfacilities AS b
WHERE b.MedicalFacilitiesIdUser = a.MedicalFacilitiesIdUser
ORDER BY b.stocksId DESC
LIMIT 1)
Also, ordering by stn.stocksid wouldn't do anything useful, because that would be coming from outside the subquery.
Your subquery seems redundant and main query is hard to read as much of the join statements could be placed in where clause. Additionally, original query might have a performance issue.
Recall WHERE is an implicit join and JOIN is an explicit join. Query optimizers
make no distinction between the two if they use same expressions but readability and maintainability is another thing to acknowledge.
Consider the revised version (notice I added a GROUP BY):
SELECT COUNT(stn.stocksId) AS count_stocks
FROM MedicalFacilities AS a
LEFT JOIN stocks stn ON stn.stocksIdMF = a.MedicalFacilitiesIdUser
WHERE stn.stocksEndDate >= UNIX_TIMESTAMP()
AND stn.stocksStartDate <= UNIX_TIMESTAMP()
GROUP BY stn.stocksId
ORDER BY stn.stocksId DESC
LIMIT 1

Poor Performance from MySQL JOIN - How to Make Improvements?

A bit of a generic question title but I have the following query:
SELECT t.from_number, COUNT(*) AS calls
FROM t
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
and it executes in 488ms.
However, aswell as retrieving the data from that table I need to lookup who the number belongs to.
SELECT t.from_number, COUNT(*) AS calls
FROM t
LEFT JOIN n on CONCAT('44', n.number) = t.from_number
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
As soon as I add the JOIN the query execution time jumps up to anything from 8 - 12 seconds and that's only to find the organisation that the number belongs to, I'd need yet another join after that to retrieve the organisation name from the organisations table.
The cardinality of t and n are > 2,000,000 and ~ 63,000 respectively, and, as you can guess from above, the numbers are stored slightly differently in each:
t stores numbers as 123456789 since the country code (44) is stored in a separate column but n stores numbers as 44123456789 which is why I need to use the CONCAT but I didn't think this would affect performance since it's not in the WHERE clause.
As far as I can tell, I have indexed the important columns in each table.
Are there any suggestions on how I can improve the performance of queries when it comes to these tables?
Update
EXPLAIN output added
id, select_type, table, possible_keys, key, key_len, ref, rows, Extra
1 SIMPLE t index_merge organisation_id,start_time,direction,from_number organisation_id,direction 4,13 NULL 4174 Using intersect(organisation_id,direction); Using where; Using temporary; Using filesort
1 SIMPLE n index NULL number 768 NULL 62759 Using index
The problem is on the JOIN clause:
LEFT JOIN n on CONCAT('44', n.number) = t.from_number
It is joining the tables using the result of the function CONCAT('44', n.number).
Some databases (as Oracle), can create an index based on a funcion, but others (as MySQL) cannot. So, it cannot use any index on table n to make the join.
A solution would be to create a new column on n with the result of the used function and to index it.
You could use a code similar to:
ALTER TABLE n ADD COLUMN extended_number varchar(128) null;
UPDATE n
SET extended_number = CONCAT('44', number);
CREATE INDEX ext_numb_idx
ON n.extended_number;
After this, modify the JOIN clause of the query:
SELECT t.from_number, COUNT(*) AS calls
FROM t
LEFT JOIN n on n.extended_number = t.from_number
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
Then MySQL will use the newly created index and will execute the query much faster.

mySQL query ignoring NOT IN function

This query is processing and running but it is completely ignoring the NOT IN section
SELECT * FROM `offers` as `o` WHERE `o`.country_iso = '$country_iso' AND `o`.`id`
not in (select distinct(offer_id) from aff_disabled_offers
where offer_id = 'o.id' and user_id = '1') ORDER by rand() LIMIT 7
Maybe your "not in" query returns nothing.
Shouldn't the
where offer_id='o.id'
Be
where offer_id=o.id
?
guido has the answer... it looks like you meant to create a correlated subquery. 'o.id' is being seen as a literal.
SOME CAUTIONS:
You usually want some sort of guarantee that the subquery in the NOT IN predicate does NOT return a NULL value. If you don't have that guarantee enforced from the database, adding a WHERE/HAVING return_expr IS NOT NULL in the subquery is sufficient to give you that guarantee.
That correlated subquery is going to eat your lunch, performance wise, on large sets. As will that ORDER BY rand().
Generally, an anti-join pattern turns out to be much more efficient on large sets:
SELECT o.*
FROM offers o
LEFT
JOIN aff_disabled_offers d
ON d.user_id = '1'
AND d.offer_id = o.id
WHERE d.offer_id IS NULL
AND o.country_iso = '$country_iso'
ORDER BY rand()
LIMIT 7

Tips for improving this slow mysql query?

I'm using a query which generally executes in under a second, but sometimes takes between 10-40 seconds to finish. I'm actually not totally clear on how the subquery works, I just know that it works, in that it gives me 15 rows for each faverprofileid.
I'm logging slow queries and it's telling me 5823244 rows were examined, which is odd because there aren't anywhere close to that many rows in any of the tables involved (the favorites table has the most at 50,000 rows).
Can anyone offer me some pointers? Is it an issue with the subquery and needing to use filesort?
EDIT: Running explain shows that the users table is not using an index (even though id is the primary key). Under extra it says: Using temporary; Using filesort.
SELECT F.id,F.created,U.username,U.fullname,U.id,I.*
FROM favorites AS F
INNER JOIN users AS U ON F.faver_profile_id = U.id
INNER JOIN items AS I ON F.notice_id = I.id
WHERE faver_profile_id IN (360,379,95,315,278,1)
AND F.removed = 0
AND I.removed = 0
AND F.collection_id is null
AND I.nudity = 0
AND (SELECT COUNT(*) FROM favorites WHERE faver_profile_id = F.faver_profile_id
AND created > F.created AND removed = 0 AND collection_id is null) < 15
ORDER BY F.faver_profile_id, F.created DESC;
The number of rows examined represents is large because many rows have been examined more than once. You are getting this because of an incorrectly optimized query plan which results in table scans when index lookups should have been performed. In this case the number of rows examined is exponential, i.e. of an order of magnitude comparable to the product of the total number of rows in more than one table.
Make sure that you have run ANALYZE TABLE on your three tables.
Read on how to avoid table scans, and identify then create any missing indexes
Rerun ANALYZE and re-explain your queries
the number of examined rows must drop dramatically
if not, post the full explain plan
use query hints to force the use of indices (to see the index names for a table, use SHOW INDEX):
SELECT
F.id,F.created,U.username,U.fullname,U.id,I.*
FROM favorites AS F FORCE INDEX (faver_profile_id_key)
INNER JOIN users AS U FORCE INDEX FOR JOIN (PRIMARY) ON F.faver_profile_id = U.id
INNER JOIN items AS I FORCE INDEX FOR JOIN (PRIMARY) ON F.notice_id = I.id
WHERE faver_profile_id IN (360,379,95,315,278,1)
AND F.removed = 0
AND I.removed = 0
AND F.collection_id is null
AND I.nudity = 0
AND (SELECT COUNT(*) FROM favorites FORCE INDEX (faver_profile_id_key) WHERE faver_profile_id = F.faver_profile_id
AND created > F.created AND removed = 0 AND collection_id is null) < 15
ORDER BY F.faver_profile_id, F.created DESC;
You may also change your query to use GROUP BY faver_profile_id/HAVING count > 15 instead of the nested SELECT COUNT(*) subquery, as suggested by vartec. The performance of both your original and vartec's query should be comparable if both are properly optimized e.g. using hints (your query would use nested index lookups, whereas vartec's query would use a hash-based strategy.)
I think with GROUP BY and HAVING it should be faster.
Is that what you want?
SELECT F.id,F.created,U.username,U.fullname,U.id, I.field1, I.field2, count(*) as CNT
FROM favorites AS F
INNER JOIN users AS U ON F.faver_profile_id = U.id
INNER JOIN items AS I ON F.notice_id = I.id
WHERE faver_profile_id IN (360,379,95,315,278,1)
AND F.removed = 0
AND I.removed = 0
AND F.collection_id is null
AND I.nudity = 0
GROUP BY F.id,F.created,U.username,U.fullname,U.id,I.field1, I.field2
HAVING CNT < 15
ORDER BY F.faver_profile_id, F.created DESC;
Don't know which fields from items you need, so I've put placeholders.
I suggest you use Mysql Explain Query to see how your mysql server handles the query. My bet is your indexes aren't optimal, but explain should do much better than my bet.
You could do a loop on each id and use limit instead of the count(*) subquery:
foreach $id in [123,456,789]:
SELECT
F.id,
F.created,
U.username,
U.fullname,
U.id,
I.*
FROM
favorites AS F INNER JOIN
users AS U ON F.faver_profile_id = U.id INNER JOIN
items AS I ON F.notice_id = I.id
WHERE
F.faver_profile_id = {$id} AND
I.removed = 0 AND
I.nudity = 0 AND
F.removed = 0 AND
F.collection_id is null
ORDER BY
F.faver_profile_id,
F.created DESC
LIMIT
15;
I'll suppose the result of that query is intented to be shown as a paged list. In that case, perhaps you could consider to do a simpler "unjoined query" and do a second query for each row to read only the 15, 20 or 30 elements shown. Was not a JOIN a heavy operation? This would simplify the query and It wouldn't become slower when the joined tables grow.
Tell me if I'm wrong, please.