Optimizing MySQL query with inner join - mysql

I've spent a lot of time optimizing this query but it's starting to slow down with larger tables. I imagine these are probably the worst types of questions but I'm looking for some guidance. I'm not really at liberty to disclose the database schema so hopefully this is enough information. Thanks,
SELECT tblA.id, tblB.id, tblC.id, tblD.id
FROM tblA, tblB, tblC, tblD
INNER JOIN (SELECT max(tblB.id) AS xid
FROM tblB
WHERE tblB.rdd = 11305
GROUP BY tblB.index_id
ORDER BY NULL) AS rddx
ON tblB.id = rddx.xid
WHERE
tblA.id = tblB.index_id
AND tblC.name = tblD.s_type
AND tblD.name = tblA.s_name
GROUP BY tblA.s_name
ORDER BY NULL;
There is a one-to-many relationship between:
tblA.id and tblB.index_id
tblC.name and tblD.s_type
tblD.name and tblA.s_name
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
| 1 | PRIMARY | derived2 | ALL | NULL | NULL | NULL | NULL | 32568 | Using temporary |
| 1 | PRIMARY | tblB | eq_ref | PRIMARY | PRIMARY | 8 | rddx.xid | 1 | |
| 1 | PRIMARY | tblA | eq_ref | PRIMARY | PRIMARY | 8 | tblB.index_id | 1 | Using where |
| 1 | PRIMARY | tblD | eq_ref | PRIMARY | PRIMARY | 22 | tblA.s_name | 1 | Using where |
| 1 | PRIMARY | tblC | eq_ref | PRIMARY | PRIMARY | 22 | tblD.s_type | 1 | |
| 2 | DERIVED | tblB | ref | rdd_idx | rdd_idx | 7 | | 65722 | Using where; Using temporary |
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+

Unless I've misunderstood the information that you've provided I believe you could re-write the above query as follows
EXPLAIN SELECT tblA.id, MAX(tblB.id), tblC.id, tblD.id
FROM tblA
LEFT JOIN tblD ON tblD.name = tblA.s_name
LEFT JOIN tblC ON tblC.name = tblD.s_type
LEFT JOIN tblB ON tblA.id = tblB.index_id
WHERE tblB.rdd = 11305
ORDER BY NULL;
Obviously I can't provide an explain for this as explain depends on the data in your database. It would be interesting to see the explain on this query.
Obviously explain only gives you an estimate of what will happen. You can use SHOW SESSION STATUS to provide in details of what happened when you run an actual query. Make sure to run before you run the query that you are investigating so that you have clean data to read from. So in this case you would run
FLUSH STATUS;
EXPLAIN SELECT tblA.id, MAX(tblB.id), tblC.id, tblD.id
FROM tblA
LEFT JOIN tblD ON tblD.name = tblA.s_name
LEFT JOIN tblC ON tblC.name = tblD.s_type
LEFT JOIN tblB ON tblA.id = tblB.index_id
WHERE tblB.rdd = 11305
ORDER BY NULL;
SHOW SESSION STATUS LIKE 'ha%';
This gives you a number of indicators to show what actually happened when a query executed.
Handler_read_rnd_next - Number of requests to read next row in the data file
Handler_read_key - Number of requests to read a row based on a key
Handler_read_next - Number of requests to read the next row in key order
Using these values you can see exactly what is going on under the hood.
Unfortunately without knowing the data in the tables, engine type and the data types used in the queries it is quite hard to advise on how you could optimize.

I have updated the query using joins instead of the join within the WHERE clause. Also, by looking at it, as a developer, you can directly see the relationship between the tables.
A->B, A->D and D->C. Now, on table B where you want the highest ID based on the common "ID=Index_ID" AND the RDD = 11305 won't require a complete sub-query. However, this has moved the "MAX()" to the upper portion of the field selection clause. I would ensure you have an index on tblB on (index_id, rdd). Finally, by doing STRAIGHT_JOIN will help enforce the order to run the query based on how specifically listed.
-- EDIT FROM COMMENT --
It appears you are getting nulls from the tblB. This typically indicates a valid tblA record, but no tblB record by same ID that has an RDD = 11305. That said, it appears you are only concerned with those entries associated with 11305, so I'm adjusting the query accordingly. Please make sure you have an index on tblB based on the "RDD" column (at least in the first position in case multiple column index)
As you can see in this one, I'm pre-querying from table B only for 11305 entries and pre-grouping by the index_ID (as linked to tblA). This gives me one record per index where they will exist... From THIS result, I'm joining back to A, then directly back to B again, but based on that highest match ID found, then D and C as was before. So NOW, you can get any column from any of the tables and get proper record in question... There should be no NULL values left in this query.
Hopefully, I've clarified HOW I'm getting the pieces together for you.
SELECT STRAIGHT_JOIN
PreQuery.HighestPerIndexID
tblA.id,
tblA.AnotherAField,
tblA.Etc,
tblB.SomeOtherField,
tblB.AnotherField,
tblC.id,
tblD.id
FROM
( select PQ1.Index_ID,
max( PQ1.ID ) as HighestPerIndexID
from tblB PQ1
where PQ1.RDD = 11305
group by PQ1.Index_ID ) PreQuery
JOIN tblA
on PreQuery.Index_ID = tblA.ID
join tblB
on PreQuery.HighestPerIndexID = tblB.ID
join tblD
on tblA.s_Name = tblD.name
join tblC
on tblD.s_type = tblC.Name
ORDER BY
tblA.s_Name

Related

Self join on a huge table with conditions is taking a lot of time , optimize query

I have a master table which has details.
I wanted to find all the combinations for a product in that session with every other product in that particular sessions for all sessions.
create table combinations as
select
a.main_id,
a.sub_id as sub_id_x,
b.sub_id as sub_id_y,
count(*) as count1,
a.dates as rundate
from
master_table a
left join
master_table b
on a.session_id = b.session_id
and a.visit_number = b.visit_number
and a.main_id = b.main_id
and a.sub_id != b.sub_id
where
a.sub_id is not null
and b.sub_id is not null
group by
a.main_id,
a.sub_id,
b.sub_id,
rundate;
I did a explain on query
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | a | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 90.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | b | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 0.08 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
The main issue is, my master table consists of 80 million rows. This query is taking more than 24 hours to execute.
All the columns are indexed and I am doing a self join.
Would creating a like table first 'master_table_2' and then doing a join would make my query faster?
Is there any way to optimize the query time?
As your table consists of a lot of rows, the join query will take a lot of time if it is not optimized properly and the WHERE clause is not used properly. But an optimized query could save your time and effort. The following link has a good explanation about the optimization of the join queries and its facts -
Optimization of Join Queries
#Marcus Adams has already provided a similar answer here
Another option is you can select individually and process in the code end for the optimization. But it is only applicable in some specific conditions only. You will have to try to compare both processes (join query and code end execution) and check the performance. I have got better performance once using this method.
Suppose a join query is like as the following -
SELECT A.a1, B.b1, A.a2
FROM A
INNER JOIN B
ON A.a3=B.b3
WHERE B.b3=C;
What I am trying to say is query individually from A and B satisfying the necessary conditions and then try to get your desired result from the code end.
N.B. : It is an unorthodox way and it could not be taken as granted to be applicable in all criteria.
Hope it helps.

sql query optimization sugarcrm

I am trying to optimize sql query in mysql db. Tried various variations with indexes, but nothing helps. Maybe I am missing something
Query:
SELECT count(1) AS fAccounts
from sugarcrm.accounts t4,
( SELECT t3.related_id
FROM sugarcrm.prospect_lists_prospects t3, sugarcrm.prospect_list_campaigns t2
where t3.deleted=0
and t3.related_type='Accounts'
and t3.prospect_list_id=t2.prospect_list_id
and t2.deleted=0
and t2.campaign_id='10909eb7-8080-45b6-8c9f-563b42be91e5'
) t3
where t4.deleted=0
and t4.id=t3.related_id;
Explain:
+----+-------------+------------+--------+---------------------------------------------------+----------------+---------+------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------------------------------------------+----------------+---------+------------------------------+--------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5000 | |
| 1 | PRIMARY | t4 | eq_ref | PRIMARY;idx_accnt_id_del;idx_accnt_assigned_del | PRIMARY | 108 | t3.related_id | 1 | Using where |
| 2 | DERIVED | t2 | ref | idx_pro_id;idx_cam_id;idx_prospect_list_campaigns | idx_cam_id | 111 | | 1 | Using where |
| 2 | DERIVED | t3 | ref | idx_plp_pro_id;idx_plp_rel_id_2 | idx_plp_pro_id | 111 | sugarcrm.t2.prospect_list_id | 463968 | Using where |
+----+-------------+------------+--------+---------------------------------------------------+----------------+---------+------------------------------+--------+-------------+
The inner query is the trouble maker. There are two ways it could be performed: Start with t2 and do a "Nested Loop Join" to t3 or vice versa. The Optimizer will look at the WHERE clause and the table sizes and the indexes to estimate which one would be best to start with. Let's give the optimizer the 'best' index for going each way:
Starting with t2:
t2: INDEX(deleted, campaign_id) -- in either order
t3: INDEX(prospect_list_id, deleted, related_type) -- in any order
Starting with t3:
t3: INDEX(deleted, related_type) -- in either order
t2: INDEX(prospect_list_id, deleted, campaign_id) -- in any order
Rather than adding 2 indexes to each table, let's do
t2: INDEX(campaign_id, deleted, prospect_list_id) -- in this order
t3: INDEX(related_type, deleted, prospect_list_id) -- in this order
Similarly, t4 (which will be last) needs
INDEX(deleted, id)
unless it is InnoDB and already has PRIMARY KEY(id), which will be 'clustered' with the data.
There is a problem... When you do a JOIN, then compute aggregates, the JOIN first gives you an explosion of rows, then the COUNT() counts too many of them, thereby getting an inflated number. So, be sure to sanity check the results.
Since the only need for t4 is to verify that the related_id is there, the query could be reformulated as
SELECT COUNT(*) AS fAccounts
FROM prospect_lists_prospects t3
-- Note the use of `JOIN...ON...`:
JOIN prospect_list_campaigns t2 ON t3.prospect_list_id=t2.prospect_list_id
where t3.deleted=0
and t3.related_type='Accounts'
and t2.deleted=0
and t2.campaign_id='10909eb7-8080-45b6-8c9f-563b42be91e5'
AND ( EXISTS *
FROM accounts
FROM accounts t4
WHERE t4.id = t3.related_id
)
This still needs the suggested indexes (one per table).
Since you don't use DISTINCT anywhere, I see no need to bother creating a temporary table. Try this one:
SELECT
count(1) AS fAccounts
from sugarcrm.accounts t4 inner join
sugarcrm.prospect_lists_prospects t3 on t4.id=t3.related_id inner join
sugarcrm.prospect_list_campaigns t2 on t3.prospect_list_id=t2.prospect_list_id
where t3.deleted=0
and t3.related_type='Accounts'
and t2.deleted=0
and t2.campaign_id='10909eb7-8080-45b6-8c9f-563b42be91e5'
and t4.deleted=0

Why is my MySQL query is so slow?

I'm trying to figure out why that query so slow (take about 6 second to get result)
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
WHERE
c.id NOT IN (... big list of ids which should be excluded)
This is execution plan
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| 1 | SIMPLE | z1 | index | PRIMARY | PRIMARY | 4 | NULL | 318563 | 99.85 | Using where; Using index; Using temporary |
| 1 | SIMPLE | c | eq_ref | PRIMARY,member_id | PRIMARY | 4 | z1.id | 1 | 100.00 | |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | c.member_id | 1 | 100.00 | Using index |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
is it because mysql has to take out almost whole 1st table ? Can it be adjusted ?
You can try to replace c with a subquery.
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
(select c.id
from c
WHERE
c.id NOT IN (... big list of ids which should be excluded)) c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
to leave only necessary id's
It is imposible to say from the information you've provided whether there is a faster solution to obtaining the same data (we would need to know abou data distributions and what foreign keys are obligatory). However assuming that this is a hierarchical data set, then the plan is probably not optimal: the only predicate to reduce the number of rows is c.id NOT IN.....
The first question to ask yourself when optimizing any query is Do I need all the rows? How many rows is this returning?
I'm struggling to see any utlity in a query which returns a list of 'id' values (implying a set of autoincrement integers).
You can't use an index for a NOT IN (or <>) hence the most eficient solution is probably to start with a full table scan on 'c' - which should be the outcome of StanislavL's query.
Since you don't use the values from i and z, the joins could be replaced with 'exists' which may help performance.
I would consider creating a compound index for c(id, member_id). This way the query should work at index level only without scanning any rows in tables.

Indexes and optimization

I'm not brilliant when it comes to going beyond the basics with MySQL, however, I'm trying to optimize a query:
SELECT DATE_FORMAT(t.completed, '%H') AS hour, t.orderId, t.completed as stamp,
t.deadline as deadline, t.completedBy as user, p.largeFormat as largeFormat
FROM tasks t
JOIN orders o ON o.id=t.orderId
JOIN products p ON p.id=o.productId
WHERE DATE(t.completed) = '2013-09-11'
AND t.type = 7
AND t.completedBy IN ('user1', 'user2')
AND t.suspended = '0'
AND o.shanleys = 0
LIMIT 0,100
+----+-------------+-------+--------+----------------------------+-----------+---------+-----------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------+-----------+---------+-----------------+-------+-------------+
| 1 | SIMPLE | o | ref | PRIMARY,productId,shanleys | shanleys | 2 | const | 54464 | Using where |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | sfp.o.productId | 1 | |
| 1 | SIMPLE | t | ref | NewIndex1 | NewIndex1 | 5 | sfp.o.id | 6 | Using where |
+----+-------------+-------+--------+----------------------------+-----------+---------+-----------------+-------+-------------+
Before some of the indexes were added it was performing full table scans on both the p table and the o table.
Basically, I thought that MySQL would:
limit down the rows from the tasks table with the where clauses (should be 84 rows without the joins)
then go through the orders table to the products table to get a flag (largeFormat).
My questions are why does MySQL look up 50000+ rows when it's only got 84 different ids to look for, and is there a way that I can optimize the query?
I'm not able to add new fields or new tables.
Thank you in advance!
SQL needs to work on available indexes to best qualify the query
I would have a compound index on
( type, suspended, completedby, completed)
to match the criteria you have... Your orders and products tables appear ok with their existing indexes.
SELECT
DATE_FORMAT(t.completed, '%H') AS hour,
t.orderId,
t.completed as stamp,
t.deadline,
t.completedBy as user,
p.largeFormat as largeFormat
FROM
tasks t
JOIN orders o
ON t.orderId = o.id
AND o.shanleys = 0
JOIN products p
ON o.productId = p.id
WHERE
t.type = 7
AND t.suspended = 0
AND t.completedBy IN ('user1', 'user2')
AND t.completed >= '2013-09-11'
AND t.completed < '2013-09-12'
LIMIT
0,100
I suspect that suspended is a flag and is numeric (int) based, if so, leave the
criteria as a numeric and not string by wrapping in '0' quotes.
FOR datetime fields, if you try TO apply functions TO it, it cant utilize the index
well... so, if you only care about the one DAY(or range in other queries),
notice I have the datetime field >= '2013-09-11' which is implied of 12:00:00 AM,
AND the datetime field is LESS THAN '2013-09-12' which allows up to 11:59:59PM on the 2013-09-11
which is the entire day AND the index can take advantage of it.

understanding mysql explain

So, I've never understood the explain of MySQL. I understand the gross concepts that you should have at least one entry in the possible_keys column for it to use an index, and that simple queries are better. But what is the difference between ref and eq_ref? What is the best way to be optimizing queries.
For example, this is my latest query that I'm trying to figure out why it takes forever (generated from django models) :
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| 1 | SIMPLE | T6 | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_alias_id | 4 | const | 244 | Using temporary; Using filesort |
| 1 | SIMPLE | T5 | eq_ref | PRIMARY | PRIMARY | 4 | paul.T6.achievement_id | 1 | Using index |
| 1 | SIMPLE | T4 | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_achievement_id | 4 | paul.T6.achievement_id | 298 | |
| 1 | SIMPLE | yourock_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.T4.alias_id | 1 | Using index |
| 1 | SIMPLE | yourock_achiever | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_alias_id | 4 | paul.T4.alias_id | 152 | |
| 1 | SIMPLE | yourock_achievement | eq_ref | PRIMARY | PRIMARY | 4 | paul.yourock_achiever.achievement_id | 1 | |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
6 rows in set (0.00 sec)
I had hoped to learn enough about mysql explain that the query wouldn't be needed. Alas, it seems that you can't get enough information from the explain statement and you need the raw SQL. Query :
SELECT `yourock_achievement`.`id`,
`yourock_achievement`.`modified`,
`yourock_achievement`.`created`,
`yourock_achievement`.`string_id`,
`yourock_achievement`.`owner_id`,
`yourock_achievement`.`name`,
`yourock_achievement`.`description`,
`yourock_achievement`.`owner_points`,
`yourock_achievement`.`url`,
`yourock_achievement`.`remote_image`,
`yourock_achievement`.`image`,
`yourock_achievement`.`parent_achievement_id`,
`yourock_achievement`.`slug`,
`yourock_achievement`.`true_points`
FROM `yourock_achievement`
INNER JOIN
`yourock_achiever`
ON `yourock_achievement`.`id` = `yourock_achiever`.`achievement_id`
INNER JOIN
`yourock_alias`
ON `yourock_achiever`.`alias_id` = `yourock_alias`.`id`
INNER JOIN
`yourock_achiever` T4
ON `yourock_alias`.`id` = T4.`alias_id`
INNER JOIN
`yourock_achievement` T5
ON T4.`achievement_id` = T5.`id`
INNER JOIN
`yourock_achiever` T6
ON T5.`id` = T6.`achievement_id`
WHERE
T6.`alias_id` = 6
ORDER BY
`yourock_achievement`.`modified` DESC
Paul:
eq_ref
One row is read from this table for each combination of rows from the previous tables. Other than the system and const types, this is the best possible join type. It is used when all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE index.
eq_ref can be used for indexed columns that are compared using the = operator. The comparison value can be a constant or an expression that uses columns from tables that are read before this table. In the following examples, MySQL can use an eq_ref join to process ref_table:
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
ref
All rows with matching index values are read from this table for each combination of rows from the previous tables. ref is used if the join uses only a leftmost prefix of the key or if the key is not a PRIMARY KEY or UNIQUE index (in other words, if the join cannot select a single row based on the key value). If the key that is used matches only a few rows, this is a good join type.
ref can be used for indexed columns that are compared using the = or <=> operator. In the following examples, MySQL can use a ref join to process ref_table:
SELECT * FROM ref_table WHERE key_column=expr;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
These are copied verbatim from the MySQL manual: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
If you could post your query that is taking forever, I could help pinpoint what is slowing it down. Also, please specify what your definition of forever is. Also, if you could provide your "SHOW CREATE TABLE xxx;" statements for these tables, I could help in optimizing your query as much as possible.
What jumps out at me immediately as a possible point of improvement is the "Using temporary; Using filesort;". This means that a temporary table was created to satisfy the query (not necessarily a bad thing), and that the GROUP BY/ORDER BY you designated could not be retrieved from an index, thus resulting in a filesort.
You query seems to process (244 * 298 * 152) = 11,052,224 records, which according to Using temporary; Using filesort need to be sorted.
This can take long.
If you post your query here, we probably will be able to optimize it somehow.
Update:
You query indeed does a number of nested loops and seems to yield lots of values which need to be sorted then.
Could you please run the following query:
SELECT COUNT(*)
FROM `yourock_achievement`
INNER JOIN
`yourock_achiever`
ON `yourock_achievement`.`id` = `yourock_achiever`.`achievement_id`
INNER JOIN
`yourock_alias`
ON `yourock_achiever`.`alias_id` = `yourock_alias`.`id`
INNER JOIN
`yourock_achiever` T4
ON `yourock_alias`.`id` = T4.`alias_id`
INNER JOIN
`yourock_achievement` T5
ON T4.`achievement_id` = T5.`id`
INNER JOIN
`yourock_achiever` T6
ON T5.`id` = T6.`achievement_id`
WHERE
T6.`alias_id` = 6