Optimizing self join in mysql - mysql

The following query takes forever to complete. I've added the indexes on all fields included in the join, tried putting the where conditions into the join and I thought I'd ask for advice before tinkering with FORCE/USE indexes. It just seems that indexes should be used on both sides of this join. Seems only i1 is being used.
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE a ALL i1,i2,i3 2399303 100.00 Using temporary
1 SIMPLE b ref i1,i2,i3 i1 5 db.a.bt 11996 100.00 Using where
create index i1 on obs(bt);
create index i2 on obs(st);
create index i3 on obs(bt,st);
create index i4 on obs(sid);
explain extended
select distinct b.sid
from obs a inner join obs b on a.bt = b.bt and a.st = b.st
where
a.sid != b.sid and
abs( datediff( b.sid_start_date , a.sid_expire_date ) ) < 60;
I've tried both ALTER TABLE and CREATE INDEX above to add indexes to obs.

Since you are not selecting any of the columns in table a, it may be better to use an exists. An exists allows you to check if the information you are looking for is in a specified table without using a join. Removing the join improves the performance. I also like exists because I think it makes the query easier to understand when you come back to it months later.
select distinct b.sid
from obs b
where exists (Select 1
From obs a
Where a.bt = b.bt
and a.st = b.st
and a.sid != b.sid
and abs( datediff( b.sid_start_date , a.sid_expire_date ) ) < 60);

Related

Find employees latest activity is slow when adding ORDER BY

I am working on a legacy system in Laravel and I am trying to pull the latest action of some specific types of actions an employee has done.
Performance is good when I don't add ORDER BY. When adding it the query will go from something like 130 ms to 18 seconds. There are about 1.5 million rows in the actions table.
How do I fix the performance problem?
I have tried to isolate the problem by cutting out all the other parts of the query so it is more readable for you:
SELECT
employees.id,
(
SELECT DATE_FORMAT(actions.date, '%Y-%m-%d')
FROM pivot
JOIN actions
ON pivot.actions_id = actions.id
WHERE employees.id = pivot.employee_id
AND (actions.type = 'meeting'
OR (actions.type = 'phone_call'
AND JSON_VALID(actions.data) = 1
AND actions.data->>'$.update_status' = 1))
LIMIT 1
) AS latest_action
FROM employees
ORDER BY latest_action DESC
I tried using LEFT JOIN and MAX() instead but it didn't seem to solve my problem.
I just added a subquery because it was the original query is already very complex. But if you have an alternative suggestion I am all ears.
UPDATE
Result of EXPLAIN:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY employees NULL ALL NULL NULL NULL NULL 15217 10 Using where
2 DEPENDENT SUBQUERY pivot NULL ref actions_type_index,pivot_type_index pivot_type_index 4 dev.employees.id 104 11.11 Using index condition
2 DEPENDENT SUBQUERY actions NULL eq_ref PRIMARY,Logs PRIMARY 4 dev.pivot.actions_id 1 6.68 Using where
UPDATE 2
Here is the indexes. The index employee_type I don't think is important for my specific query, but maybe it should be re-worked?
# pivot table
KEY `actions_type_index` (`actions_id`,`employee_type`),
KEY `pivot_type_index` (`employee_id`,`employee_type`)
# actions table
KEY `Logs` (`type`,`id`,`is_log`)
# I tried to add `date` index to `actions` table but the problem remains.
KEY `date_index` (`date`)
First of all your query is very non-optimal.
I would rewrite it this way:
SELECT
e.id,
DATE_FORMAT(vMAX(a.date), '%Y-%m-%d') AS latest_action
FROM employees e
LEFT JOIN pivot p ON p.employee_id = e.id
LEFT JOIN actions a ON p.actions_id = a.id AND (a.type = 'meeting'
OR (a.type = 'phone_call'
AND JSON_VALID(a.data) = 1
AND a.data->>'$.update_status' = 1))
GROUP BY e.id
ORDER BY latest_action DESC
Obviously there must be indexes on p.employee_id, p.actions_id, a.date. Also would be good on a.type.
Also it would be good to replace a.data->>'$.update_status' with some simple field with an index on it.

What sequence should indexes have?

I have such sql statement which aggregates data from 3 table of MySQL database. The query takes a very long time to complete. I am trying to use index to speed up the process.
SELECT
A.ID_SITE AS OBJECT_ID,
B.SITE_NAME AS OBJECT_NAME,
A.POLYGON,
C.TIME_KEY AS DATE_TIME_KEY,
B.ADDRESS,
B.REGION,
B.DISTRICT,
B.LOCATION,
B.LOCATION_TYPE
FROM TABLE_C AS C
INNER JOIN TABLE_A AS A
ON C.ID_OBJECT = A.ID_SITE
INNER JOIN TABLE_B B
ON A.ID_SITE = B.SITE_ID AND TRACK_IND != 1
WHERE
(C.TIME_KEY BETWEEN '2018-10-01 00:00:00' AND '2018-10-31 23:59:59')
AND
C.ID_TIME_MODE = 3
AND
C.ID_SUBOBJECT_TYPE = 1
AND (
C.CONG_POWER >= 1 OR
C.DIGITAL_POWER >= 3
)
AND
C.ID_OBJECT NOT IN (20158, 26875)
AND
A.MONTH_KEY = '2018-10-01'
I need some advice. In what sequence is the best way to create and use index in my case?
What I did right now:
CREATE INDEX index_a ON TABLE_A (ID_SITE);
CREATE INDEX index_b ON TABLE_B (SITE_ID, TRACK_IND);
CREATE INDEX index_c ON TABLE_C (TIME_KEY, ID_TIME_MODE, ID_SUBOBJECT_TYPE, CONG_POWER, DIGITAL_POWER, ID_OBJECT)
CREATE INDEX index_a_month_key ON TABLE_A (MONTH_KEY);
Also I think it would be better use FORCE INDEX operator, but I am confused how correctly to use them in my case.
For your query, the best indexes would probably be:
TABLE_C(ID_TIME_MODE, ID_SUBOBJECT_TYPE, TIME_KEY, CONG_POWER, DIGITAL_POWER, ID_OBJECT)
TABLE_A(ID_SITE, MONTH_KEY)
TABLE_B(SITE_ID)

SQL optimization - slow query

Given SQL takes 1.2s:
SELECT DISTINCT contracts.id, jt0.id, jt1.id, jt2.id, jt3.id FROM contracts
LEFT JOIN accounts jt0 ON jt0.id = contracts.account_id AND jt0.deleted=0
LEFT JOIN manufacturers jt1 ON jt1.id = contracts.manufacturer_id AND jt1.deleted=0
LEFT JOIN products jt2 ON jt2.id = contracts.product_id AND jt2.deleted=0
LEFT JOIN users jt3 ON jt3.id = contracts.assigned_user_id AND jt3.deleted=0
WHERE contracts.deleted=0
ORDER BY contracts.application_number ASC
LIMIT 0,21
here is what explain extended returns:
id select_type table type possible_keys key key_len ref rows
1 SIMPLE contracts ref idx_contracts_deleted idx_contracts_deleted 2 const 18968 100.00 Using where; Using temporary; Using filesort
1 SIMPLE jt0 eq_ref PRIMARY,idx_accnt_id_del,idx_accnt_assigned_del PRIMARY 108 xxx.contracts.account_id 1 100.00
1 SIMPLE jt1 eq_ref PRIMARY,idx_manufacturers_id_deleted,idx_manufacturers_deleted PRIMARY 108 xxx.contracts.manufacturer_id 1 100.00
1 SIMPLE jt2 eq_ref PRIMARY,idx_products_id_deleted,idx_products_deleted PRIMARY 108 xxx.contracts.product_id 1 100.00
1 SIMPLE jt3 eq_ref PRIMARY,idx_users_id_del,idx_users_id_deleted,idx_users_deleted PRIMARY 108 xxx.contracts.assigned_user_id 1 100.00
I need the distinct, I need all the joins to be left, I need order by and i need limit.
Can i optimize it somehow?
These are the only suggestions i've got
I hope the id's are defined as primary keys and the foreign keys with the relation between the tables.
Maybe the application_number can be indexed (then the sort will be faster)
Maybe, if you are using MyISAM, the sql could be faster if you lock the tables before selecting (don't forget to unlock afterwards)
Try changing the indexes on the subsidiary tables to include the deleted column:
accounts(id, deleted)
manufacturers(id, deleted)
products(id, deleted)
users(id, deleted)
By including all the columns in the index, MySQL has a better opportunity to take advantage of the index.
Another suggestion is to figure out what is causing the duplication in values and to use subqueries to eliminate the duplicates, rather than distinct.
For instance, with the above indexes:
from contracts c left join
(select id
from accounts
where deleted = 0
group by id
) a
on c.account_id = a.id
. . .
The subquery should only use the index, which might speed things up.
First create necessary index on the following columns.
contracts.application_number, manufacturers.deleted, products.deleted, users.deleted
SELECT DISTINCT contracts.id, jt0.id, jt1.id, jt2.id, jt3.id
FROM contracts
LEFT JOIN accounts jt0
ON contracts.deleted=0 AND jt0.id = contracts.account_id
LEFT JOIN manufacturers jt1
ON jt1.deleted=0 AND jt1.id = contracts.manufacturer_id
LEFT JOIN products jt2
ON jt2.deleted=0 AND jt2.id = contracts.product_id
LEFT JOIN users jt3
ON jt3.deleted=0 AND jt3.id = contracts.assigned_user_id
ORDER BY contracts.application_number ASC
LIMIT 0,21
As you have mentioned you are have already index on contracts.deleted
FROM
(SELECT * FROM contracts WHERE contracts.deleted = 0 USE INDEX(<deletedIndexName>))
LEFT JOIN
accounts jt0
ON
jt0.id = contracts.account_id
LEFT JOIN
...
Try a little referential integrity? I bet the query is much faster with inner joins. It should be, because the query optimizer has more to work with. You're paying the price at select time for not taking more care with create.
I would also remove deleted rows to their own tables, and strike the deleted columns. Your queries will be simpler and likely run faster.
try those indexes
CREATE INDEX PAW_IDX1921682121 ON products(deleted,id);
CREATE INDEX PAW_IDX1196677611 ON manufacturers(deleted,id);
CREATE INDEX PAW_IDX1360881332 ON users(deleted,id);
CREATE INDEX PAW_IDX1028958902 ON accounts(deleted,id);
CREATE INDEX PAW_IDX1564931998 ON contracts(deleted,application_number);

Left JOIN faster or Inner Join faster?

So... which one is faster (NULl value is not an issue), and are indexed.
SELECT * FROM A
JOIN B b ON b.id = a.id
JOIN C c ON c.id = b.id
WHERE A.id = '12345'
Using Left Joins:
SELECT * FROM A
LEFT JOIN B ON B.id=A.bid
LEFT JOIN C ON C.id=B.cid
WHERE A.id = '12345'
Here is the actual query
Here it is.. both return the same result
Query (0.2693sec) :
EXPLAIN EXTENDED SELECT *
FROM friend_events, zcms_users, user_events,
EVENTS WHERE friend_events.userid = '13006'
AND friend_events.state =0
AND UNIX_TIMESTAMP( friend_events.t ) >=1258923485
AND friend_events.xid = user_events.id
AND user_events.eid = events.eid
AND events.active =1
AND zcms_users.id = user_events.userid
EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE zcms_users ALL PRIMARY NULL NULL NULL 43082
1 SIMPLE user_events ref PRIMARY,eid,userid userid 4 zcms_users.id 1
1 SIMPLE events eq_ref PRIMARY,active PRIMARY4 user_events.eid 1 Using where
1 SIMPLE friend_events eq_ref PRIMARY PRIMARY 8 user_events.id,const 1 Using where
LEFTJOIN QUERY: (0.0393 sec)
EXPLAIN EXTENDED SELECT *
FROM `friend_events`
LEFT JOIN `user_events` ON user_events.id = friend_events.xid
LEFT JOIN `events` ON user_events.eid = events.eid
LEFT JOIN `zcms_users` ON user_events.userid = zcms_users.id
WHERE (
events.active =1
)
AND (
friend_events.userid = '13006'
)
AND (
friend_events.state =0
)
AND (
UNIX_TIMESTAMP( friend_events.t ) >=1258923485
)
EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE friend_events ALL PRIMARY NULL NULL NULL 53113 Using where
1 SIMPLE user_events eq_ref PRIMARY,eid PRIMARY 4 friend_events.xid 1 Using where
1 SIMPLE zcms_users eq_ref PRIMARY PRIMARY 4 user_events.userid 1
1 SIMPLE events eq_ref PRIMARY,active PRIMARY 4 user_events.eid 1 Using where
It depends; run them both to find out; then run an 'explain select' for an explanation.
The actual performance difference may range from "virtually non-existent" to "pretty significant" depending on how many rows in A with id='12345' have no matching records in B and C.
Update (based on posted query plans)
When you use INNER JOIN it doesn't matter (results-wise, not performance-wise) which table to start with, so optimizer tries to pick the one it thinks would perform best. It seems you have indexes on all appropriate PK / FK columns and you either don't have an index on friend_events.userid or there are too many records with userid = '13006' and it's not being used; either way optimizer picks the table with less rows as "base" - in this case it's zcms_users.
When you use LEFT JOIN it does matter (results-wise) which table to start with; thus friend_events is picked. Now why it takes less time that way I'm not quite sure; I'm guessing friend_events.userid condition helps. If you were to add an index (is it really varchar, btw? not numeric?) on that, your INNER JOIN might behave differently (and become faster) as well.
The INNER JOIN has to do an extra check to remove any records from A that don't have matching records in B and C. Depending on the number of records initially returned from A it COULD have an impact.
LEFT JOIN shows all data from A and only shows data from B/C only if the condition is true. As for INNER JOIN, it has to do some extra checking on both tables. So, I guess that explains why LEFT JOIN is faster.
Use EXPLAIN to see the query plan. It's probably the same plan for both cases, so I doubt it makes much difference, assuming there are no rows that don't match. But these are two different queries so it really doesn't make sense to compare them - you should just use the correct one.
Why not use the "INNER JOIN" keyword instead of "LEFT JOIN"?

How to optimize query looking for rows where conditional join rows do not exist?

I've got a table of keywords that I regularly refresh against a remote search API, and I have another table that gets a row each each time I refresh one of the keywords. I use this table to block multiple processes from stepping on each other and refreshing the same keyword, as well as stat collection. So when I spin up my program, it queries for all the keywords that don't have a request currently in process, and don't have a successful one within the last 15 mins, or whatever the interval is. All was working fine for awhile, but now the keywords_requests table has almost 2 million rows in it and things are bogging down badly. I've got indexes on almost every column in the keywords_requests table, but to no avail.
I'm logging slow queries and this one is taking forever, as you can see. What can I do?
# Query_time: 20 Lock_time: 0 Rows_sent: 568 Rows_examined: 1826718
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT JOIN `keywords_requests` as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
AND KeywordsRequest.created > FROM_UNIXTIME(1234551323)
)
WHERE KeywordsRequest.id IS NULL
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;
It seems your most selective index on Keywords is one on KeywordRequest.created.
Try to rewrite query this way:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` as kr
WHERE created > FROM_UNIXTIME(1234567890) /* Happy unix_time! */
) AS KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
)
WHERE keyword_id IS NULL;
It will (hopefully) hash join two not so large sources.
And Bill Karwin is right, you don't need the GROUP BY or ORDER BY
There is no fine control over the plans in MySQL, but you can try (try) to improve your query in the following ways:
Create a composite index on (keyword_id, status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
UNION
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
This ideally should use NESTED LOOPS on your index.
Create a composite index on (status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
UNION ALL
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
ON keyword_id = id
WHERE keyword_id IS NULL
This will hopefully use HASH JOIN on even more restricted hash table.
When diagnosing MySQL query performance, one of the first things you need to analyze is the report from EXPLAIN.
If you learn to read the information EXPLAIN gives you, then you can see where queries are failing to make use of indexes, or where they are causing expensive filesorts, or other performance red flags.
I notice in your query, the GROUP BY is irrelevant, since there will be only one NULL row returned from KeywordRequests. Also the ORDER BY is irrelevant, since you're ordering by a column that will always be NULL due to your WHERE clause. If you remove these clauses, you'll probably eliminate a filesort.
Also consider rewriting the query into other forms, and measure the performance of each. For example:
SELECT k.id, k.keyword
FROM `keywords` AS k
WHERE NOT EXISTS (
SELECT * FROM `keywords_requests` AS kr
WHERE kr.keyword_id = k.id
AND kr.status IN ('success', 'active')
AND kr.source_id = '29'
AND kr.created > FROM_UNIXTIME(1234551323)
);
Other tips:
Is kr.source_id an integer? If so, compare to the integer 29 instead of the string '29'.
Are there appropriate indexes on keyword_id, status, source_id, created? Perhaps even a compound index over all four columns would be best, since MySQL will use only one index per table in a given query.
You did a screenshot of your EXPLAIN output and posted a link in the comments. I see that the query is not using an index from Keywords, which makes sense since you're scanning every row in that table anyway. The phrase "Not exists" indicates that MySQL has optimized the LEFT OUTER JOIN a bit.
I think this should be improved over your original query. The GROUP BY/ORDER BY was probably causing it to save an intermediate data set as a temporary table, and sorting it on disk (which is very slow!). What you'd look for is "Using temporary; using filesort" in the Extra column of EXPLAIN information.
So you may have improved it enough already to mitigate the bottleneck for now.
I do notice that the possible keys probably indicate that you have individual indexes on four columns. You may be able to improve that by creating a compound index:
CREATE INDEX kr_cover ON keywords_requests
(keyword_id, created, source_id, status);
You can give MySQL a hint to use a specific index:
... FROM `keywords_requests` AS kr USE INDEX (kr_cover) WHERE ...
Dunno about MySQL but in MSSQL the lines of attack I would take are:
1) Create a covering index on KeywordsRequest status, source_id and created
2) UNION the results tog et around the OR on KeywordsRequest.status
3) Use NOT EXISTS instead o the Outer Join (and try with UNION instead of OR too)
Try this
SELECT Keyword.id, Keyword.keyword
FROM keywords as Keyword
LEFT JOIN (select * from keywords_requests where source_id = '29' and (status = 'success' OR status = 'active')
AND source_id = '29'
AND created > FROM_UNIXTIME(1234551323)
AND id IS NULL
) as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
)
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;