joining table in mysql not using index properly? - mysql

I have four tables that I am trying to join and output the result to a new table. My code looks like this:
create table tbl
select a.dte, a.permno, (ret - rf) f0_xs_ret, (xs_ret - (betav*xs_mkt)) f0_resid, mkt_cap last_year_mkt_cap, betav beta_value
from a inner join b using (dte)
inner join c on (year(a.dte) = c.yr and a.permno = c.permno)
inner join d on (a.permno = d.permno and year(a.dte)-1 = year(d.dte));
All of the tables have multiple indices and for table a, (dte, permno) identify a unique record, for table b, dte id's a unique record, for table c, (yr, permno) id a unique record and for table d, (dte, permno) id a unique record. the explain from the select part of the query is:
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| 1 | SIMPLE | d | ALL | idx1 | NULL | NULL | NULL | 264129 | |
| 1 | SIMPLE | c | ref | idx2 | idx2 | 4 | achernya.d.permno | 16 | |
| 1 | SIMPLE | b | ALL | PRIMARY,idx2 | NULL | NULL | NULL | 12336 | Using join buffer |
| 1 | SIMPLE | a | eq_ref | PRIMARY,idx1,idx2 | PRIMARY | 7 | achernya.b.dte,achernya.d.permno | 1 | Using where |
+----+-------------+-------+--------+-------------------+---------+---------+----------------------------------+--------+-------------------+
Why does mysql have to read so many rows to process this thing? and if i am reading this correctly, it has to read (264129*16*12336) rows which should take a good month.
Could someone please explain what's going on here?

MySQL has to read the rows because you're using functions as your join conditions. An index on dte will not help resolve YEAR(dte) in a query. If you want to make this fast, then put the year in its own column to use in joins and move the index to that column, even if that means some denormalization.
As for the other columns in your index that you don't apply functions to, they may not be used if the index won't provide much benefit, or they aren't the leftmost column in the index and you don't use the leftmost prefix of that index in your join condition.
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

Related

Bad query performance. Where do I need to create the indexes?

I have a relativelly simple query, but it's performance is really bad.
I'm currently at something like 0.800 sec. per query.
Is there anything that I could change to make it faster?
I've already tried indexing the columns, used in where statement and join, but nothing works.
Here's the query that I'm using:
SELECT c.bloqueada, c.nomeF, c.confirmadoData
FROM encomendas_c_linhas el1
LEFT JOIN encomendas_c c ON c.lojaEncomenda=el1.lojaEncomenda
WHERE (el1.artigoE IN (342197) OR el1.artigoD IN (342197))
AND el1.desmembrado = 1
Here's the EXPLAIN:
As Bill Karwin asked, here follows the Query used to create the table:
Table "encomendas_c_linhas"
https://pastebin.com/73WcsDnE
Table "encomendas_c"
https://pastebin.com/yCUx3wh0
In your EXPLAIN, we see that it's accessing the el1 table with type: ALL which means it's scanning the whole table. The rows field shows an estimate of 644,236 rows scanned (this is approximate).
In your table definition for the el1 table, you have this index:
KEY `order_item_id_desmembrado` (`order_item_id`,`desmembrado`),
Even though desmembrado appears in an index, that index will not help this query. Your query searches for desmembrado = 1, with no condition for order_item_id.
Think of a telephone book: I can search for people with last name 'Smith' and the order of the phone book helps me find them quickly. But if I search for people with the first name 'Sarah' the book doesn't help. Only if I search on the leftmost column(s) of the index does it help.
So you need an index with desmembrado as the leftmost column. Then the search for desmembrado = 1 might use the index to select those matching rows.
ALTER TABLE encomendas_c_linhas ADD INDEX (desmembrado);
Note that if a large enough portion of the table matches, MySQL skips that index anyway. There's no advantage in using the index if it will match a large portion of rows. In my experience, MySQL's optimizer's judgement is that it avoids the index if the condition matches > 20% of the rows of the table.
The other conditions are in a disjunction (terms of an OR expression). There's no way to optimize these with a single index. Again, the telephone book example: Search for people with last name 'Smith' OR first name 'Sarah'. The last name lookup is optimized, but the first name lookup is not. No problem, we can make another index with first name listed first, so the first name lookup will be optimized. But in general, MySQL will use only one index per table reference per query. So optimizing the first name lookup will spoil the last name lookup.
Here's a workaround: Rewrite the OR condition into a UNION of two queries, with one term in each query:
SELECT c.bloqueada, c.nomeF, c.confirmadoData
FROM encomendas_c_linhas el1
LEFT JOIN encomendas_c c ON c.lojaEncomenda=el1.lojaEncomenda
WHERE el1.artigoD IN ('342197')
AND el1.desmembrado = 1
UNION
SELECT c.bloqueada, c.nomeF, c.confirmadoData
FROM encomendas_c_linhas el1
LEFT JOIN encomendas_c c ON c.lojaEncomenda=el1.lojaEncomenda
WHERE el1.artigoE IN ('342197')
AND el1.desmembrado = 1;
Make sure there's an index for each case.
ALTER TABLE encomendas_c_linhas
ADD INDEX des_artigoD (desmembrado, artigoD),
ADD INDEX des_artigoE (desmembrado, artigoE);
Each of these compound indexes may be used in the respective subquery, so it will optimize the lookup of two columns in each case.
Also notice I put the values in quotes, like IN ('342197') because the columns are varchar, and you need to compare to a varchar to make use of the index. Comparing a varchar column to an integer value will make the match successfully, but will not use the index.
Here's an EXPLAIN I tested for the previous query, that shows the two new indexes are used, and it shows ref: const,const which means both columns of the index are used for the lookup.
+----+--------------+------------+------+-------------------------+---------------+---------+------------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+------------+------+-------------------------+---------------+---------+------------------------+------+-----------------------+
| 1 | PRIMARY | el1 | ref | des_artigoD,des_artigoE | des_artigoD | 41 | const,const | 1 | Using index condition |
| 1 | PRIMARY | c | ref | lojaEncomenda | lojaEncomenda | 61 | test.el1.lojaEncomenda | 2 | NULL |
| 2 | UNION | el1 | ref | des_artigoD,des_artigoE | des_artigoE | 41 | const,const | 1 | Using index condition |
| 2 | UNION | c | ref | lojaEncomenda | lojaEncomenda | 61 | test.el1.lojaEncomenda | 2 | NULL |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using temporary |
+----+--------------+------------+------+-------------------------+---------------+---------+------------------------+------+-----------------------+
But we can do one step better. Sometimes adding other columns to the index is helpful, because if all columns needed by the query are included in the index, then it doesn't have to read the table rows at all. This is called a covering index and it's indicated if you see "Using index" in the EXPLAIN Extra field.
So here's the new index definition:
ALTER TABLE encomendas_c_linhas
ADD INDEX des_artigoD (desmembrado, artigoD, lojaEncomenda),
ADD INDEX des_artigoE (desmembrado, artigoE, lojaEncomenda);
The third column is not used for lookup, but it's used when joining to the other table c.
You can also get the same covering index effect by creating the second index suggested in the answer by #TheImpaler:
create index ix2 on encomendas_c (lojaEncomenda, bloqueada, nomeF, confirmadoData);
We see the EXPLAIN now shows the "Using index" note for all table references:
+----+--------------+------------+------+-------------------------+-------------+---------+------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+------------+------+-------------------------+-------------+---------+------------------------+------+--------------------------+
| 1 | PRIMARY | el1 | ref | des_artigoD,des_artigoE | des_artigoD | 41 | const,const | 1 | Using where; Using index |
| 1 | PRIMARY | c | ref | lojaEncomenda,ix2 | ix2 | 61 | test.el1.lojaEncomenda | 1 | Using index |
| 2 | UNION | el1 | ref | des_artigoD,des_artigoE | des_artigoE | 41 | const,const | 1 | Using where; Using index |
| 2 | UNION | c | ref | lojaEncomenda,ix2 | ix2 | 61 | test.el1.lojaEncomenda | 1 | Using index |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using temporary |
+----+--------------+------------+------+-------------------------+-------------+---------+------------------------+------+--------------------------+
I would create the following indexes:
create index ix1 on encomendas_c_linhas (artigoE, artigoD, desmembrado);
create index ix2 on encomendas_c (lojaEncomenda, bloqueada, nomeF, confirmadoData);
The first one is the critical one. The second one will improve performance further.

Slow mysql query, join on huge table, not using indexes

SELECT archive.id, archive.file, archive.create_timestamp, archive.spent
FROM archive LEFT JOIN submissions ON archive.id = submissions.id
WHERE submissions.id is NULL
AND archive.file is not NULL
AND archive.create_timestamp < DATE_SUB(NOW(), INTERVAL 6 month)
AND spent = 0
ORDER BY archive.create_timestamp ASC LIMIT 10000
EXPLAIN result:
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
| 1 | SIMPLE | archive | range | create_timestamp,file_in | create_timestamp | 4 | NULL | 111288502 | Using where |
| 1 | SIMPLE | submissions | eq_ref | PRIMARY | PRIMARY | 4 | production.archive.id | 1 | Using where; Using index; Not exists |
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
I've tried hinting use of indexes for archive table with:
USE INDEX (create_timestamp,file_in)
Archive table is huge, ~150mil records.
Any help with speeding up this query would be greatly appreciated.
You want to use a composite index. For this query:
create index archive_file_spent_createts on archive(file, spent, createtimestamp);
In such an index, you want the where conditions with = to come first, followed by up to one column with an inequality. In this case, I'm not sure if MySQL will use the index for the order by.
This assumes that the where conditions do, indeed, significantly reduce the size of the data.

Why is my MySQL query is so slow?

I'm trying to figure out why that query so slow (take about 6 second to get result)
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
WHERE
c.id NOT IN (... big list of ids which should be excluded)
This is execution plan
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| 1 | SIMPLE | z1 | index | PRIMARY | PRIMARY | 4 | NULL | 318563 | 99.85 | Using where; Using index; Using temporary |
| 1 | SIMPLE | c | eq_ref | PRIMARY,member_id | PRIMARY | 4 | z1.id | 1 | 100.00 | |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | c.member_id | 1 | 100.00 | Using index |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
is it because mysql has to take out almost whole 1st table ? Can it be adjusted ?
You can try to replace c with a subquery.
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
(select c.id
from c
WHERE
c.id NOT IN (... big list of ids which should be excluded)) c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
to leave only necessary id's
It is imposible to say from the information you've provided whether there is a faster solution to obtaining the same data (we would need to know abou data distributions and what foreign keys are obligatory). However assuming that this is a hierarchical data set, then the plan is probably not optimal: the only predicate to reduce the number of rows is c.id NOT IN.....
The first question to ask yourself when optimizing any query is Do I need all the rows? How many rows is this returning?
I'm struggling to see any utlity in a query which returns a list of 'id' values (implying a set of autoincrement integers).
You can't use an index for a NOT IN (or <>) hence the most eficient solution is probably to start with a full table scan on 'c' - which should be the outcome of StanislavL's query.
Since you don't use the values from i and z, the joins could be replaced with 'exists' which may help performance.
I would consider creating a compound index for c(id, member_id). This way the query should work at index level only without scanning any rows in tables.

MySQL Query Optimization; SELECT multiple fields vs. JOIN

We've got a relatively straightforward query that does LEFT JOINs across 4 tables. A is the "main" table or the top-most table in the hierarchy. B links to A, C links to B. Furthermore, X links to A. So the hierarchy is basically
A
C => B => A
X => A
The query is essentially:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
WHERE
b.flag = true
ORDER BY
x.date DESC
LIMIT 25
Via EXPLAIN, I've confirmed that the correct indexes are in place, and that the built-in MySQL query optimizer is using those indexes correctly and properly.
So here's the strange part...
When we run the query as is, it takes about 1.1 seconds to run.
However, after doing some checking, it seems that if I removed most of the SELECT fields, I get a significant speed boost.
So if instead we made this into a two-step query process:
First query same as above except change the SELECT clause to only SELECT a.id instead of SELECT *
Second query also same as above, except change the WHERE clause to only do an a.id IN agains the result of Query 1 instead of what we have before
The result is drastically different. It's .03 seconds for the first query and .02 for the second query.
Doing this two-step query in code essentially gives us a 20x boost in performance.
So here's my question:
Shouldn't this type of optimization already be done within the DB engine? Why does the difference in which fields that are actually SELECTed make a difference on the overall performance of the query?
At the end of the day, it's merely selecting the exact same 25 rows and returning the exact same full contents of those 25 rows. So, why the wide disparity in performance?
ADDED 2012-08-24 13:02 PM PDT
Thanks eggyal and invertedSpear for the feedback. First off, it's not a caching issue -- I've run tests running both queries multiple times (about 10 times) alternating between each approach. The result averages at 1.1 seconds for the first (single query) approach and .03+.02 seconds for the second (2 query) approach.
In terms of indexes, I thought I had done an EXPLAIN to ensure that we're going thru the keys, and for the most part we are. However, I just did a quick check again and one interesting thing to note:
The slower "single query" approach doesn't show the Extra note of "Using index" for the third line:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
While it does show "Using index" for when we query for just the IDs:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | Using index |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
The strange thing is that both do list the correct index being used... but I guess it begs the questions:
Why are they different (considering all the other clauses are the exact same)? And is this an indication of why it's slower?
Unfortunately, the MySQL docs do not give much information for when the "Extra" column is blank/null in the EXPLAIN results.
More important than speed, you have a flaw in your query logic. When you test a LEFT JOINed column in the WHERE clause (other than testing for NULL), you force that join to behave as if it were an INNER JOIN. Instead, you'd want:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
AND b.flag = true
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
ORDER BY
x.date DESC
LIMIT 25
My next suggestion would be to examine all of those .*'s in your SELECT. Do you really need all the columns from all the tables?

SQL Query Optimization

I am trying to speed up this django app (note: I didn't design this... just stuck maintaining it) and the biggest bottle neck seems to be these queries that are being generated by the admin. We have a content class that 4-5 other sub-classes inherit from and anytime the master list is pulled up in the admin a query like this is generated:
SELECT `content_content`.`id`,
`content_content`.`issue_id`,
`content_content`.`slug`,
`content_content`.`section_id`,
`content_content`.`priority`,
`content_content`.`group_id`,
`content_content`.`rotatable`,
`content_content`.`pub_status`,
`content_content`.`created_on`,
`content_content`.`modified_on`,
`content_content`.`old_pk`,
`content_content`.`content_type_id`,
`content_image`.`content_ptr_id`,
`content_image`.`caption`,
`content_image`.`kicker`,
`content_image`.`pic`,
`content_image`.`crop_x`,
`content_image`.`crop_y`,
`content_image`.`crop_side`,
`content_issue`.`id`,
`content_issue`.`special_issue_name`,
`content_issue`.`web_publish_date`,
`content_issue`.`issue_date`,
`content_issue`.`fm_name`,
`content_issue`.`arts_name`,
`content_issue`.`comments`,
`content_section`.`id`,
`content_section`.`name`,
`content_section`.`audiodizer_id`
FROM `content_image`
INNER
JOIN `content_content`
ON `content_image`.`content_ptr_id` = `content_content`.`id`
INNER
JOIN `content_issue`
ON `content_content`.`issue_id` = `content_issue`.`id`
INNER
JOIN `content_section`
ON `content_content`.`section_id` = `content_section`.`id`
WHERE NOT ( `content_content`.`pub_status` = -1 )
ORDER BY `content_issue`.`issue_date` DESC LIMIT 30
I ran an EXPLAIN on this and got the following:
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
| 1 | SIMPLE | content_image | ALL | PRIMARY | NULL | NULL | NULL | 40499 | Using temporary; Using filesort |
| 1 | SIMPLE | content_content | eq_ref | PRIMARY,issue_id,content_content_issue_id,content_content_section_id,content_content_pub_status | PRIMARY | 4 | content_image.content_ptr_id | 1 | Using where |
| 1 | SIMPLE | content_section | eq_ref | PRIMARY | PRIMARY | 4 | content_content.section_id | 1 | |
| 1 | SIMPLE | content_issue | eq_ref | PRIMARY | PRIMARY | 4 | content_content.issue_id | 1 | |
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
Now, from what I've read, I need to somehow figure out how to make the query to content_image not be terrible; however, I'm drawing a blank on where to start.
Currently, judging by the execution plan, MySQL is starting with content_image, retrieving all rows, and only thereafter using primary keys on the other tables: content_image has a foreign key to content_content, and content_content has foreign keys to content_issue and content_section. Also, only after all the joins are complete can it make much use of the ORDER BY content_issue.issue_date DESC LIMIT 30, since it can't tell which of these joins might fail, and therefore, how many records from content_issue will really be needed before it can get the first thirty rows of output.
So, I would try the following:
Change JOIN content_issue to JOIN (SELECT * FROM content_issue ORDER BY issue_date DESC LIMIT 30) content_issue. This will allow MySQL, if it starts with content_issue and works its way to the other tables, to grab a very small subset of content_issue.
Note: properly speaking, this changes the semantics of the query: it means that only records from at most the last 30 content_issues will be retrieved, and therefore that if some of those issues don't have published contents with images, then fewer than 30 records will be retrieved. I don't have enough information about your data to gauge whether this change of semantics would actually change the results you get.
Also note: I'm not suggesting to remove the ORDER BY content_issue.issue_date DESC LIMIT 30 from the end of the query. I think you want it in both places.
Add an index on content_issue.issue_date, to optimize the above subquery.
Add an index on content_image.content_ptr_id, so MySQL can work its way from content_content to content_image without doing a full table scan.