Poor Performance from MySQL JOIN - How to Make Improvements? - mysql

A bit of a generic question title but I have the following query:
SELECT t.from_number, COUNT(*) AS calls
FROM t
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
and it executes in 488ms.
However, aswell as retrieving the data from that table I need to lookup who the number belongs to.
SELECT t.from_number, COUNT(*) AS calls
FROM t
LEFT JOIN n on CONCAT('44', n.number) = t.from_number
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
As soon as I add the JOIN the query execution time jumps up to anything from 8 - 12 seconds and that's only to find the organisation that the number belongs to, I'd need yet another join after that to retrieve the organisation name from the organisations table.
The cardinality of t and n are > 2,000,000 and ~ 63,000 respectively, and, as you can guess from above, the numbers are stored slightly differently in each:
t stores numbers as 123456789 since the country code (44) is stored in a separate column but n stores numbers as 44123456789 which is why I need to use the CONCAT but I didn't think this would affect performance since it's not in the WHERE clause.
As far as I can tell, I have indexed the important columns in each table.
Are there any suggestions on how I can improve the performance of queries when it comes to these tables?
Update
EXPLAIN output added
id, select_type, table, possible_keys, key, key_len, ref, rows, Extra
1 SIMPLE t index_merge organisation_id,start_time,direction,from_number organisation_id,direction 4,13 NULL 4174 Using intersect(organisation_id,direction); Using where; Using temporary; Using filesort
1 SIMPLE n index NULL number 768 NULL 62759 Using index

The problem is on the JOIN clause:
LEFT JOIN n on CONCAT('44', n.number) = t.from_number
It is joining the tables using the result of the function CONCAT('44', n.number).
Some databases (as Oracle), can create an index based on a funcion, but others (as MySQL) cannot. So, it cannot use any index on table n to make the join.
A solution would be to create a new column on n with the result of the used function and to index it.
You could use a code similar to:
ALTER TABLE n ADD COLUMN extended_number varchar(128) null;
UPDATE n
SET extended_number = CONCAT('44', number);
CREATE INDEX ext_numb_idx
ON n.extended_number;
After this, modify the JOIN clause of the query:
SELECT t.from_number, COUNT(*) AS calls
FROM t
LEFT JOIN n on n.extended_number = t.from_number
WHERE t.organisation_id = 999
AND t.direction = 'inbound'
AND t.start_time BETWEEN '2014-03-26' AND NOW()
AND t.from_number != ''
GROUP BY t.from_number
ORDER BY calls DESC LIMIT 20
Then MySQL will use the newly created index and will execute the query much faster.

Related

Why two mysql selects executed separately are much faster than one combined?

I want to understand the case when I run two queries separately i takes around 400ms in total, but when I combined them using sub-select it takes around 12 seconds.
I have two InnoDB tables:
event: 99 914 rows
even_prizes: 24 540 770 rows
Below are my queries:
SELECT
id
FROM
event e
WHERE
e.status != 'SCHEDULED';
-- takes 130ms, returns 2406 rows
SELECT
id, count(*)
FROM
event_prizes
WHERE event_id in (
-- 2406 ids returned from the previous query
)
GROUP BY
id;
-- takes 270ms, returns the same amount of rows
From the other side when I run the query from below:
SELECT
id, count(*)
FROM
event_prizes
WHERE event_id in (
SELECT
id
FROM
event e
WHERE
e.status != 'SCHEDULED'
)
GROUP BY
id;
-- takes 12seconds
I guess in the second case MySQL makes the full-scan of the event_prizes table ?
Is there any better way to create a single query for this case ?
You can use a INNER JOIN instead of a sub-select:
SELECT ep.id, COUNT(*)
FROM event_prizes ep INNER JOIN event e ON ep.event_id = e.id
WHERE e.status <> 'SCHEDULED'
GROUP BY ep.id
Make sure you are using
a PRIMARY KEY on event.id
a PRIMARY KEY on event_prizes.id
a FOREIGN KEY on event_prizes.event_id
You can also try the following indices at least:
event(status)

mysql not using index on simple OR condition

I have ran into the age-old problem of MySQL refusing to use an index for seemingly basic stuff.
The query in question:
SELECT c.*
FROM app_comments c
LEFT JOIN app_comments reply_c ON c.reply_to = reply_c.id
WHERE (c.external_id = '840774' AND c.external_context = 'deals')
OR (reply_c.external_id = '840774' AND reply_c.external_context = 'deals')
ORDER BY c.reply_to ASC, c.date ASC
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE c ALL external_context,external_id,idx_app_comments_externals NULL NULL NULL 903507 Using filesort
1 SIMPLE reply_c eq_ref PRIMARY PRIMARY 4 altero_full.c.reply_to 1 Using where
There are indexes on external_id and external_context separately, and I also tried adding a composite index (idx_app_comments_externals), but that did not help at all.
The query executes in 4-6 seconds in production (>1m records), but removing the OR part of the WHERE condition decreases that to 0.05s (it still uses filesort though).
Clearly indexes don't work here, but I have no idea why. Can anyone explain this?
P.S. We're using MariaDB 10.3.18, could that be at fault here?
MySQL (and MariaDB) cannot optimize OR conditions on different columns or tables. Note that in the context of the query plan c and reply_c are considered different tables. These queries are usually optimized "by hand" with UNION statements, which often contain a lot of code duplication. But in your case and with a quite recent version, which supports CTEs (Common Table Expressions) you can avoid most of it:
WITH p AS (
SELECT *
FROM app_comments
WHERE external_id = '840774'
AND external_context = 'deals'
)
SELECT * FROM p
UNION DISTINCT
SELECT c.* FROM p JOIN app_comments c ON c.reply_to = p.id
ORDER BY reply_to ASC, date ASC
Good indices for this query would be a composite one on (external_id, external_context) (in any order) and a separate one on (reply_to).
You will though not avoid a "filesort", but that shouldn't be a problem, when the data are filtered to a small set.
With the equality predicates on external_id and external_context columns in the WHERE clause, MySQL could make effective use of an index... when those predicates specify the subset of rows that can possibly satisfy the query.
But with the OR added to the WHERE clause, now the rows to be returned from c are not limited by external_id and external_content values. It's now possible that rows with other values of those columns could be returned; rows with any values of those columns.
And that negates the big benefit of using an index range scan operation... very quickly eliminating vast swaths of rows from being considered. Yes, an index range scan is used to quickly locate rows. That is true. But the meat of the matter is that the range scan operation uses the index to quickly to bypass millions and millions of rows that can't possibly be returned.
This is not behavior specific to MariaDB 10.3. We are going to observe the same behavior in MariaDB 10.2, MySQL 5.7, MySQL 5.6.
I'm questioning the join operation: Is it necessary to return multiple copies of rows from c when there are multiple matching rows from reply_c ? Or is the specification to just return distinct rows from c ?
We can look at the required resultset as two parts.
1) the rows from app_contents with equality predicates on external_id and external_context
SELECT c.*
FROM app_comments c
WHERE c.external_id = '840774'
AND c.external_context = 'deals'
ORDER
BY c.external_id
, c.external_context
, c.reply_to
, c.date
For optimal performance (excluding considering a covering index because of the * in the SELECT list), an index like this could be used to satisfy both the range scan operation and the order by (eliminating a Using filesort operation)
... ON app_comments (external_id, external_context, reply_to, date)
2) The second part of the result is the reply_to rows related to matching rows
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
ORDER
BY d.reply_to
, d.date
The same index recommended before can be used to accessing rows in e (range scan operation). Ideally, that index would also include the id column. Our best option is probably to modify the index to include id column following date
... ON app_comments (external_id, external_context, reply_to, date, id)
Or, for equivalent performance, at the expense of an extra index, we could define an index like this:
... ON app_comments (external_id, external_context, id)
For accessing rows from d with a range scan, we likely want an index:
... ON app_comments (reply_to, date)
We can combine the two sets with a UNION ALL set operator; but there's potential for the same row being returned by both queries. A UNION operator would force a unique sort to eliminate duplicate rows. Or we could add a condition to the second query to eliminate rows that will be returned by the first query.
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
HAVING NOT ( d.external_id <=> '840774'
AND d.external_context <=> 'deals'
)
ORDER
BY d.reply_to
, d.date
Combining the two parts, wrap each part in a set of parens add the UNION ALL set operator and an ORDER BY operator at the end (outside the parens), something like this:
(
SELECT c.*
FROM app_comments c
WHERE c.external_id = '840774'
AND c.external_context = 'deals'
ORDER
BY c.external_id
, c.external_context
, c.reply_to
, c.date
)
UNION ALL
(
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
HAVING NOT ( d.external_id <=> '840774'
AND d.external_context <=> 'deals'
)
ORDER
BY d.reply_to
, d.date
)
ORDER BY `reply_to`, `date`
This will need a "Using filesort" operation over the combined set, but now we've got a really good shot at getting good execution plan for each part.
There's still my question of how many rows we should return when there are multiple matching reply_to rows.
However, the name index is not used for lookups in the following queries:
SELECT * FROM test
WHERE last_name='Jones' OR first_name='John';
enter link description here

How to make JOINS faster?

I had this query to start out with:
SELECT DISTINCT spentits.*
FROM `spentits`
WHERE (spentits.user_id IN
(SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44')
ORDER BY id DESC
LIMIT 15 OFFSET 0
This query takes 10ms to execute.
But once I add a simple join in:
SELECT DISTINCT spentits.*
FROM `spentits`
LEFT JOIN wishlist_items ON wishlist_items.user_id = 44 AND wishlist_items.spentit_id = spentits.id
WHERE (spentits.user_id IN
(SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44')
ORDER BY id DESC
LIMIT 15 OFFSET 0
This execute time increased by 11x. Now it takes around 120ms to execute. What's interesting is that if I remove either the LEFT JOIN clause or the ORDER BY id DESC , the time goes back to 10ms.
I am new to databases so I don't understand this. Why is it that removing either one of these clauses speeds it up 11x ? And how can I keep it as is but make it faster?
I have indexes on spentits.user_id, follows.follower_id, follows.accepted, and on primary ids of each table.
EXPLAIN:
1 PRIMARY spentits index index_spentits_on_user_id PRIMARY 4 NULL 15 Using where; Using temporary
1 PRIMARY wishlist_items ref index_wishlist_items_on_user_id,index_wishlist_items_on_spentit_id index_wishlist_items_on_spentit_id 5 spentit.spentits.id 1 Using where; Distinct
2 SUBQUERY follows index_merge index_follows_on_follower_id,index_follows_on_following_id,index_follows_on_accepted
index_follows_on_follower_id,index_follows_on_accepted 5,2 NULL 566 Using intersect(index_follows_on_follower_id,index_follows_on_accepted); Using where
You should have index also on:
wishlist_items.spentit_id
Because you are joining over that column
The LEFT JOIN is easy to explain: A cross product of all entries against all other entries is made. The conditions of the join (in your case: Take all entries on the left and find fitting ones on the right) are applied afterwards. So if your spentits table is large it will take the server some time. Would suggest you get rid of your subquery and make three joins. Start with the smallest table to avoid big amounts of data.
In the 2nd example the subselect runs for every spentits.user_id.
If you write is like this it will be faster because the subselect runs once:
SELECT DISTINCT spentits.*
FROM `spentits`, (SELECT following_id
FROM `follows`
WHERE `follows`.`follower_id` = '44'
AND `follows`.`accepted` = 1)
OR spentits.user_id = '44') as `follow`
LEFT JOIN wishlist_items ON wishlist_items.user_id = 44 AND wishlist_items.spentit_id = spentits.id
WHERE (spentits.user_id IN
(follow)
ORDER BY id DESC
LIMIT 15 OFFSET 0
As you can see the subselect moved to the FROM-part of the query and creates a imaginary tabel (or view).
This imaginary tabel is a inline-view.
JOINs and inline-views are faster every time than a subselect in the WHERE-part.

Tips for improving this slow mysql query?

I'm using a query which generally executes in under a second, but sometimes takes between 10-40 seconds to finish. I'm actually not totally clear on how the subquery works, I just know that it works, in that it gives me 15 rows for each faverprofileid.
I'm logging slow queries and it's telling me 5823244 rows were examined, which is odd because there aren't anywhere close to that many rows in any of the tables involved (the favorites table has the most at 50,000 rows).
Can anyone offer me some pointers? Is it an issue with the subquery and needing to use filesort?
EDIT: Running explain shows that the users table is not using an index (even though id is the primary key). Under extra it says: Using temporary; Using filesort.
SELECT F.id,F.created,U.username,U.fullname,U.id,I.*
FROM favorites AS F
INNER JOIN users AS U ON F.faver_profile_id = U.id
INNER JOIN items AS I ON F.notice_id = I.id
WHERE faver_profile_id IN (360,379,95,315,278,1)
AND F.removed = 0
AND I.removed = 0
AND F.collection_id is null
AND I.nudity = 0
AND (SELECT COUNT(*) FROM favorites WHERE faver_profile_id = F.faver_profile_id
AND created > F.created AND removed = 0 AND collection_id is null) < 15
ORDER BY F.faver_profile_id, F.created DESC;
The number of rows examined represents is large because many rows have been examined more than once. You are getting this because of an incorrectly optimized query plan which results in table scans when index lookups should have been performed. In this case the number of rows examined is exponential, i.e. of an order of magnitude comparable to the product of the total number of rows in more than one table.
Make sure that you have run ANALYZE TABLE on your three tables.
Read on how to avoid table scans, and identify then create any missing indexes
Rerun ANALYZE and re-explain your queries
the number of examined rows must drop dramatically
if not, post the full explain plan
use query hints to force the use of indices (to see the index names for a table, use SHOW INDEX):
SELECT
F.id,F.created,U.username,U.fullname,U.id,I.*
FROM favorites AS F FORCE INDEX (faver_profile_id_key)
INNER JOIN users AS U FORCE INDEX FOR JOIN (PRIMARY) ON F.faver_profile_id = U.id
INNER JOIN items AS I FORCE INDEX FOR JOIN (PRIMARY) ON F.notice_id = I.id
WHERE faver_profile_id IN (360,379,95,315,278,1)
AND F.removed = 0
AND I.removed = 0
AND F.collection_id is null
AND I.nudity = 0
AND (SELECT COUNT(*) FROM favorites FORCE INDEX (faver_profile_id_key) WHERE faver_profile_id = F.faver_profile_id
AND created > F.created AND removed = 0 AND collection_id is null) < 15
ORDER BY F.faver_profile_id, F.created DESC;
You may also change your query to use GROUP BY faver_profile_id/HAVING count > 15 instead of the nested SELECT COUNT(*) subquery, as suggested by vartec. The performance of both your original and vartec's query should be comparable if both are properly optimized e.g. using hints (your query would use nested index lookups, whereas vartec's query would use a hash-based strategy.)
I think with GROUP BY and HAVING it should be faster.
Is that what you want?
SELECT F.id,F.created,U.username,U.fullname,U.id, I.field1, I.field2, count(*) as CNT
FROM favorites AS F
INNER JOIN users AS U ON F.faver_profile_id = U.id
INNER JOIN items AS I ON F.notice_id = I.id
WHERE faver_profile_id IN (360,379,95,315,278,1)
AND F.removed = 0
AND I.removed = 0
AND F.collection_id is null
AND I.nudity = 0
GROUP BY F.id,F.created,U.username,U.fullname,U.id,I.field1, I.field2
HAVING CNT < 15
ORDER BY F.faver_profile_id, F.created DESC;
Don't know which fields from items you need, so I've put placeholders.
I suggest you use Mysql Explain Query to see how your mysql server handles the query. My bet is your indexes aren't optimal, but explain should do much better than my bet.
You could do a loop on each id and use limit instead of the count(*) subquery:
foreach $id in [123,456,789]:
SELECT
F.id,
F.created,
U.username,
U.fullname,
U.id,
I.*
FROM
favorites AS F INNER JOIN
users AS U ON F.faver_profile_id = U.id INNER JOIN
items AS I ON F.notice_id = I.id
WHERE
F.faver_profile_id = {$id} AND
I.removed = 0 AND
I.nudity = 0 AND
F.removed = 0 AND
F.collection_id is null
ORDER BY
F.faver_profile_id,
F.created DESC
LIMIT
15;
I'll suppose the result of that query is intented to be shown as a paged list. In that case, perhaps you could consider to do a simpler "unjoined query" and do a second query for each row to read only the 15, 20 or 30 elements shown. Was not a JOIN a heavy operation? This would simplify the query and It wouldn't become slower when the joined tables grow.
Tell me if I'm wrong, please.

How to optimize query looking for rows where conditional join rows do not exist?

I've got a table of keywords that I regularly refresh against a remote search API, and I have another table that gets a row each each time I refresh one of the keywords. I use this table to block multiple processes from stepping on each other and refreshing the same keyword, as well as stat collection. So when I spin up my program, it queries for all the keywords that don't have a request currently in process, and don't have a successful one within the last 15 mins, or whatever the interval is. All was working fine for awhile, but now the keywords_requests table has almost 2 million rows in it and things are bogging down badly. I've got indexes on almost every column in the keywords_requests table, but to no avail.
I'm logging slow queries and this one is taking forever, as you can see. What can I do?
# Query_time: 20 Lock_time: 0 Rows_sent: 568 Rows_examined: 1826718
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT JOIN `keywords_requests` as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
AND KeywordsRequest.created > FROM_UNIXTIME(1234551323)
)
WHERE KeywordsRequest.id IS NULL
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;
It seems your most selective index on Keywords is one on KeywordRequest.created.
Try to rewrite query this way:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` as kr
WHERE created > FROM_UNIXTIME(1234567890) /* Happy unix_time! */
) AS KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
)
WHERE keyword_id IS NULL;
It will (hopefully) hash join two not so large sources.
And Bill Karwin is right, you don't need the GROUP BY or ORDER BY
There is no fine control over the plans in MySQL, but you can try (try) to improve your query in the following ways:
Create a composite index on (keyword_id, status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
UNION
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
This ideally should use NESTED LOOPS on your index.
Create a composite index on (status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
UNION ALL
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
ON keyword_id = id
WHERE keyword_id IS NULL
This will hopefully use HASH JOIN on even more restricted hash table.
When diagnosing MySQL query performance, one of the first things you need to analyze is the report from EXPLAIN.
If you learn to read the information EXPLAIN gives you, then you can see where queries are failing to make use of indexes, or where they are causing expensive filesorts, or other performance red flags.
I notice in your query, the GROUP BY is irrelevant, since there will be only one NULL row returned from KeywordRequests. Also the ORDER BY is irrelevant, since you're ordering by a column that will always be NULL due to your WHERE clause. If you remove these clauses, you'll probably eliminate a filesort.
Also consider rewriting the query into other forms, and measure the performance of each. For example:
SELECT k.id, k.keyword
FROM `keywords` AS k
WHERE NOT EXISTS (
SELECT * FROM `keywords_requests` AS kr
WHERE kr.keyword_id = k.id
AND kr.status IN ('success', 'active')
AND kr.source_id = '29'
AND kr.created > FROM_UNIXTIME(1234551323)
);
Other tips:
Is kr.source_id an integer? If so, compare to the integer 29 instead of the string '29'.
Are there appropriate indexes on keyword_id, status, source_id, created? Perhaps even a compound index over all four columns would be best, since MySQL will use only one index per table in a given query.
You did a screenshot of your EXPLAIN output and posted a link in the comments. I see that the query is not using an index from Keywords, which makes sense since you're scanning every row in that table anyway. The phrase "Not exists" indicates that MySQL has optimized the LEFT OUTER JOIN a bit.
I think this should be improved over your original query. The GROUP BY/ORDER BY was probably causing it to save an intermediate data set as a temporary table, and sorting it on disk (which is very slow!). What you'd look for is "Using temporary; using filesort" in the Extra column of EXPLAIN information.
So you may have improved it enough already to mitigate the bottleneck for now.
I do notice that the possible keys probably indicate that you have individual indexes on four columns. You may be able to improve that by creating a compound index:
CREATE INDEX kr_cover ON keywords_requests
(keyword_id, created, source_id, status);
You can give MySQL a hint to use a specific index:
... FROM `keywords_requests` AS kr USE INDEX (kr_cover) WHERE ...
Dunno about MySQL but in MSSQL the lines of attack I would take are:
1) Create a covering index on KeywordsRequest status, source_id and created
2) UNION the results tog et around the OR on KeywordsRequest.status
3) Use NOT EXISTS instead o the Outer Join (and try with UNION instead of OR too)
Try this
SELECT Keyword.id, Keyword.keyword
FROM keywords as Keyword
LEFT JOIN (select * from keywords_requests where source_id = '29' and (status = 'success' OR status = 'active')
AND source_id = '29'
AND created > FROM_UNIXTIME(1234551323)
AND id IS NULL
) as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
)
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;