SQL Query Optimization - mysql

I am trying to speed up this django app (note: I didn't design this... just stuck maintaining it) and the biggest bottle neck seems to be these queries that are being generated by the admin. We have a content class that 4-5 other sub-classes inherit from and anytime the master list is pulled up in the admin a query like this is generated:
SELECT `content_content`.`id`,
`content_content`.`issue_id`,
`content_content`.`slug`,
`content_content`.`section_id`,
`content_content`.`priority`,
`content_content`.`group_id`,
`content_content`.`rotatable`,
`content_content`.`pub_status`,
`content_content`.`created_on`,
`content_content`.`modified_on`,
`content_content`.`old_pk`,
`content_content`.`content_type_id`,
`content_image`.`content_ptr_id`,
`content_image`.`caption`,
`content_image`.`kicker`,
`content_image`.`pic`,
`content_image`.`crop_x`,
`content_image`.`crop_y`,
`content_image`.`crop_side`,
`content_issue`.`id`,
`content_issue`.`special_issue_name`,
`content_issue`.`web_publish_date`,
`content_issue`.`issue_date`,
`content_issue`.`fm_name`,
`content_issue`.`arts_name`,
`content_issue`.`comments`,
`content_section`.`id`,
`content_section`.`name`,
`content_section`.`audiodizer_id`
FROM `content_image`
INNER
JOIN `content_content`
ON `content_image`.`content_ptr_id` = `content_content`.`id`
INNER
JOIN `content_issue`
ON `content_content`.`issue_id` = `content_issue`.`id`
INNER
JOIN `content_section`
ON `content_content`.`section_id` = `content_section`.`id`
WHERE NOT ( `content_content`.`pub_status` = -1 )
ORDER BY `content_issue`.`issue_date` DESC LIMIT 30
I ran an EXPLAIN on this and got the following:
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
| 1 | SIMPLE | content_image | ALL | PRIMARY | NULL | NULL | NULL | 40499 | Using temporary; Using filesort |
| 1 | SIMPLE | content_content | eq_ref | PRIMARY,issue_id,content_content_issue_id,content_content_section_id,content_content_pub_status | PRIMARY | 4 | content_image.content_ptr_id | 1 | Using where |
| 1 | SIMPLE | content_section | eq_ref | PRIMARY | PRIMARY | 4 | content_content.section_id | 1 | |
| 1 | SIMPLE | content_issue | eq_ref | PRIMARY | PRIMARY | 4 | content_content.issue_id | 1 | |
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
Now, from what I've read, I need to somehow figure out how to make the query to content_image not be terrible; however, I'm drawing a blank on where to start.

Currently, judging by the execution plan, MySQL is starting with content_image, retrieving all rows, and only thereafter using primary keys on the other tables: content_image has a foreign key to content_content, and content_content has foreign keys to content_issue and content_section. Also, only after all the joins are complete can it make much use of the ORDER BY content_issue.issue_date DESC LIMIT 30, since it can't tell which of these joins might fail, and therefore, how many records from content_issue will really be needed before it can get the first thirty rows of output.
So, I would try the following:
Change JOIN content_issue to JOIN (SELECT * FROM content_issue ORDER BY issue_date DESC LIMIT 30) content_issue. This will allow MySQL, if it starts with content_issue and works its way to the other tables, to grab a very small subset of content_issue.
Note: properly speaking, this changes the semantics of the query: it means that only records from at most the last 30 content_issues will be retrieved, and therefore that if some of those issues don't have published contents with images, then fewer than 30 records will be retrieved. I don't have enough information about your data to gauge whether this change of semantics would actually change the results you get.
Also note: I'm not suggesting to remove the ORDER BY content_issue.issue_date DESC LIMIT 30 from the end of the query. I think you want it in both places.
Add an index on content_issue.issue_date, to optimize the above subquery.
Add an index on content_image.content_ptr_id, so MySQL can work its way from content_content to content_image without doing a full table scan.

Related

Joining table twice makes the query slow

My problem is that my query is very slow when use JOIN on the same table twice.
I want to retrieve all the products from a given category. But since the product can be in multiple categories I also want to get the (c.canonical) category that should provide the URL base. Therefore I have 2 extra JOIN on categories AS c and categories_products AS cp2.
Original query
SELECT p.product_id
FROM products AS p
JOIN categories_products AS cp
ON p.product_id = cp.product_id
JOIN product_variants AS pv
ON pv.product_id = p.product_id
WHERE cp.category_id = 2
AND p.status = 2
GROUP BY p.product_id
ORDER BY cp.product_sortorder ASC
LIMIT 0, 40
EXPLAIN
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
|----|-------------|-------|--------|------------------------|------------------------|---------|-------------------------|------|----------------------------------------------|
| 1 | SIMPLE | cp | ref | FK_categories_products | FK_categories_products | 4 | const | 1074 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | superlove.cp.product_id | 1 | Using where |
| 1 | SIMPLE | pv | ref | FK_product_variants | FK_product_variants | 4 | superlove.p.product_id | 1 | Using where |
Slow query
SELECT p.product_id, c.category_id
FROM products AS p
JOIN categories_products AS cp
ON p.product_id = cp.product_id
JOIN categories_products AS cp2 // Extra line
ON p.product_id = cp2.product_id // Extra line
JOIN categories AS c // Extra line
ON cp2.category_id = c.category_id // Extra line
JOIN product_variants AS pv
ON pv.product_id = p.product_id
WHERE cp.category_id = 2
AND p.status = 2
AND c.canonical = 1 // Extra line
GROUP BY p.product_id
ORDER BY cp.product_sortorder ASC
LIMIT 0, 40
EXPLAIN
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
|----|-------------|-------|--------|------------------------|------------------------|---------|--------------------------|------|----------------------------------------------|
| 1 | SIMPLE | c | ALL | PRIMARY | (null) | (null) | (null) | 221 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | cp2 | ref | FK_categories_products | FK_categories_products | 4 | superlove.c.category_id | 33 | |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | superlove.cp2.product_id | 1 | Using where |
| 1 | SIMPLE | pv | ref | FK_product_variants | FK_product_variants | 4 | superlove.p.product_id | 1 | Using where |
| 1 | SIMPLE | cp | ref | FK_categories_products | FK_categories_products | 4 | const | 1074 | Using where |
The MySQL optimizer seems to have trouble with this query. I get the impression that only rather few products would be in the requested category, but there would likely be many canonical categories. However, the optimizer apparently cannot tell that cp.category_id = 2 is a stronger condition than c.canonical = 1, so it starts the new query with c instead of cp, leading to a lot of superfluous rows along the way.
Providing data to the optimizer
Your first attempt should be trying to provide the optimizer with the required data: using the ANALYZE TABLE command, you can collect information about key distribution. For this to work, you'd have to have suitable keys in place. So perhaps you should add a key on categories.canonical. Then MySQL would know that there are (if I understand you correctly) only two distinct values for that column, and perhaps even how many rows in each. With a bit of luck, that would tell it that using c.canonical = 1 as the starting point would be a poor choice.
Forcing join order
If that does not help, then I suggest you force the order using STRAIGHT_JOIN. In particular, you might want to force cp as the first table, just as your original (and fast) query had it. If that solves the problem, you can stick to that solution. If not, then you should provide a new EXPLAIN output, so we can see where that approach fails.
Schema considerations
One more thing to consider: your question implies that for every product, there is exactly one canonical category associated with it. But your database schema does not reflect that fact. You might want to consider ways to modify your schema to reflect that fact. For example, you could have a column called canonical_category_id in products table, and use categories_products for non-canonical categories only. If you use such a setup, you might want to create a VIEW which joins products to all their categories, both canonical and non-canonical ones, using a UNION like this:
CREATE VIEW products_all_categories AS
SELECT product_id, canonical_category_id AS category_id
FROM products
UNION ALL
SELECT product_id, category_id
FROM categories_products
You could use this instead of categories_products in those places where you don't care whether a category is canonical or not. You could even rename the table and name the view categories_products instead, so that your existing queries work as they used to. You should add an index on the two columns from products used in this query. Perhaps even two indices, one for either order of these columns.
Not sure whether this whole setup would be acceptable in your application. Not sure whether it would really bring the intended speed gain. In the end, you might be forced to maintain redundant data, like a products.canonical column in addition to a reference to the canonical category in the categories_products table. I know redundant data is ugly from a design point of view, but for the sake of performance it might be necessary in order to avoid long computations. At least on a RDBMS which doesn't support materialized views. You could probably use triggers to keep data consistent, though I have no actual experience there.

mysql slow complex query with order by

The below query even without the order by is very slow and I can't figure out why. I'm guessing it's the where date_affidavit_file but how can I make it fast with that order byas well? Perhaps a sublect on the job_id's that match the where and then pass that into the rest of the code but I still need to order by server the servername like this. Any suggestions?
explain select sql_no_cache court_county, job.id as jid, job_status,
DATE_FORMAT(job.datetime_served, '%m/%d/%Y') as dserved ,
CONCAT(server.namefirst, ' ', server.namelast) as servername, client_name,
DATE_FORMAT(job.datetime_received, '%m/%d/%Y') as dtrec ,
DATE_FORMAT(job.datetime_give2server, '%m/%d/%Y') as dtg2s,
DATE_FORMAT(date_kase_filed, '%m/%d/%Y') as dkf,
DATE_FORMAT(job.date_sent_to_court, '%m/%d/%Y') as dtstc ,
TO_DAYS(datetime_served )-TO_DAYS(date_kase_filed) as totaldays from job
left join kase on kase.id=job.kase_id
left join server on job.server_id=server.id
left join client on kase.client_id=client.id
left join LUcourt on LUcourt.id=kase.court_id
where date_affidavit_filed is not null and date_affidavit_filed !='' order by servername;
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | kase | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.kase_id | 1 | |
| 1 | SIMPLE | server | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.server_id | 1 | |
| 1 | SIMPLE | client | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.client_id | 1 | |
| 1 | SIMPLE | LUcourt | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.court_id | 1 | |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
Check that you have indexes on the following columns. job.kase_id or job.server_id
Also you are ordering by a calculated field which is not optimal. Perhaps order by a field with index.
If you need to preserve that exact sort, you might want to add a field in the DB for that value. And populate it with appropriate values or set up a trigger on the DB to populate it for you automatically.
This can speed up the order by:
CREATE INDEX namefull ON server (namefirst,namelast);
if you do ORDER BY (server.namefirst, server.namelast) instead of ORDER BY servername, which should produce the same output.
You can also create indexes on each table on any field you are left joining, that can improve the performance of your query too.
When you write,
where date_affidavit_filed is not null and date_affidavit_filed !=''
you practically are selecting most of the rows. Or at least so many that it is not worthwhile to run through the indexing. The query planner sees that there is an index involving date_affidavit_filed, but decides not to use it and go with the WHERE clause, which only involves date_affidavit_filed; so we know it's not a key issue, it must be a cardinality issue.
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
You can try optimizing this by creating an index on
date_affidavit_filed, kase_id, server_id
in that order. How many rows are returned by the query?
You are selecting everything that isn't empty really.
That really means everything.
I don't know how many rows of data you have have but it's a lot to go through.
Try narrowing your query to a date range or specific client.
If you really need everything, don't output it one row after a time, but build up a big string in the software you use to output with all formatting, and then when you're finished looping through the results and you have constructed the data you wish to output you can output them in one big go.
You could also use paging.
Just add limit 0,30 on page 1, limit 30,30 on page two, etc.. and let the end user walk through the pages.

joining table in mysql not using index properly?

I have four tables that I am trying to join and output the result to a new table. My code looks like this:
create table tbl
select a.dte, a.permno, (ret - rf) f0_xs_ret, (xs_ret - (betav*xs_mkt)) f0_resid, mkt_cap last_year_mkt_cap, betav beta_value
from a inner join b using (dte)
inner join c on (year(a.dte) = c.yr and a.permno = c.permno)
inner join d on (a.permno = d.permno and year(a.dte)-1 = year(d.dte));
All of the tables have multiple indices and for table a, (dte, permno) identify a unique record, for table b, dte id's a unique record, for table c, (yr, permno) id a unique record and for table d, (dte, permno) id a unique record. the explain from the select part of the query is:
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| 1 | SIMPLE | d | ALL | idx1 | NULL | NULL | NULL | 264129 | |
| 1 | SIMPLE | c | ref | idx2 | idx2 | 4 | achernya.d.permno | 16 | |
| 1 | SIMPLE | b | ALL | PRIMARY,idx2 | NULL | NULL | NULL | 12336 | Using join buffer |
| 1 | SIMPLE | a | eq_ref | PRIMARY,idx1,idx2 | PRIMARY | 7 | achernya.b.dte,achernya.d.permno | 1 | Using where |
+----+-------------+-------+--------+-------------------+---------+---------+----------------------------------+--------+-------------------+
Why does mysql have to read so many rows to process this thing? and if i am reading this correctly, it has to read (264129*16*12336) rows which should take a good month.
Could someone please explain what's going on here?
MySQL has to read the rows because you're using functions as your join conditions. An index on dte will not help resolve YEAR(dte) in a query. If you want to make this fast, then put the year in its own column to use in joins and move the index to that column, even if that means some denormalization.
As for the other columns in your index that you don't apply functions to, they may not be used if the index won't provide much benefit, or they aren't the leftmost column in the index and you don't use the leftmost prefix of that index in your join condition.
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

Optimizing / improving a slow mysql query - indexing? reorganizing?

First off, I've looked at several other questions about optimizing sql queries, but I'm still unclear for my situation what is causing my problem. I read a few articles on the topic as well and have tried implementing a couple possible solutions, as I'll describe below, but nothing has yet worked or even made an appreciable dent in the problem.
The application is a nutrition tracking system - users enter the foods they eat and based on an imported USDA database the application breaks down the foods to the individual nutrients and gives the user a breakdown of the nutrient quantities on a (for now) daily basis.
here's
A PDF of the abbreviated database schema
and here it is as a (perhaps poor quality) JPG. I made this in open office - if there are suggestions for better ways to visualize a database, I'm open to suggestions on that front as well! The blue tables are directly from the USDA, and the green and black tables are ones I've made. I've omitted a lot of data in order to not clutter things up unnecessarily.
Here's the query I'm trying to run that takes a very long time:
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM
(SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (nutr_no <100000
OR nutr_no IN
(SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'))
) as listing
LEFT JOIN
(SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no ) as data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units
So I know that's rather complex - The first select gets a listing of all the nutrients that the user consumed within the given date range, and the second fills in all the quantities.
When I implement them separately, the first query is really fast, but the second is slow and gets very slow when the date ranges get large. The join makes the whole thing ridiculously slow. I know that the 'main' problem is the join between these two derived tables, and I can get rid of that and do the join by hand basically in php much faster, but I'm not convinced that's the whole story.
For example: for 1 month of data, the query takes about 8 seconds, which is slow, but not completely terrible. Separately, each query takes ~.01 and ~2 seconds respectively. 2 seconds still seems high to me.
If I try to retrieve a year's worth of data, it takes several (>10) minutes to run the whole query, which is problematic - the client-server connection sometimes times out, and in any case we don't want I don't want to sit there with a spinning 'please wait' icon. Mainly, I feel like there's a problem because it takes more than 12x as long to retrieve 12x more information, when it should take less than 12x as long, if I were doing things right.
Here's the 'explain' for each of the slow queries: (the whole thing, and just the second half).
Whole thing:
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5053 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 4341 | |
| 4 | DERIVED | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 4 | DERIVED | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 4 | DERIVED | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 4 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 4 | DERIVED | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
| 2 | DERIVED | meals | range | day_ind | day_ind | 9 | NULL | 30 | Using where |
| 2 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | Using where |
| 3 | DEPENDENT SUBQUERY | nutr_rights | index_subquery | users_userid,nutr_def_nutr_no | nutr_def_nutr_no | 19 | func | 1 | Using index; Using where |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
10 rows in set (2.82 sec)
Second chunk (data):
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 1 | SIMPLE | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 1 | SIMPLE | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 1 | SIMPLE | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
5 rows in set (0.00 sec)
I've 'analyzed' all the tables involved in the query, and added an index on the datetime field that is joining meals and food entries. I called it 'day_ind'. I hoped that would accelerate things, but it didn't seem to make a difference. I also tried removing the 'sum' function, as I understand that having a function in the query will frequently mean a full table scan, which is obviously much slower. Unfortunately removing the 'sum' didn't seem to make a difference either (well, about 3-5% or so, but not the order magnitude that I'm looking for).
I would love any suggestions and will be happy to provide any more information you need to help diagnose and improve this problem. Thanks in advance!
There are a few type All in your explain suggest full table scan. and hence create temp table. You could re-index if it is not there already.
Sort and Group By are usually the performance killer, you can adjust Mysql memory settings to avoid physical i/o to tmp table if you have extra memory available.
Lastly, try to make sure the data type of the join attributes matches. Ie data.date_time = listing.date_time has same data format.
Hope that helps.
Okay, so I eventually figured out what I'm gonna end up doing. I couldn't make the 'data' query any faster - that's still the bottleneck. But now I've made it so the total query process is pretty close to linear, not exponential.
I split the query into two parts and made each one into a temporary table. Then I added an index for each of those temp tables and did the join separately afterwards. This made the total execution time for 1 month of data drop from 8 to 2 seconds, and for 1 year of data from ~10 minutes to ~30 seconds. Good enough for now, I think. I can work with that.
Thanks for the suggestions. Here's what I ended up doing:
create table listing (
SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (
nutr_no <100000 OR nutr_no IN (
SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'
)
)
);
create table data (
SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no
);
create index joiner on data(nutr_no, date_time);
create index joiner on listing(nutr_no, date_time);
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM listing
LEFT JOIN data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units;

Why scan type is changed from ALL to RANGE when using LIMIT on SQL queries + Optimize query

I have this query
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
INNER JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
INNER JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
GROUP BY l.licitatii_id
ORDER BY data_publicarii DESC
Explain outputs:
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | ALL | PRIMARY,key_status_tip_domeniu | NULL | NULL | NULL | 120 | 85.83 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
As you see type=ALL for d table
now if I add LIMIT 100 to the query
plan changes to range:
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | range | PRIMARY,key_status_tip_domeniu | key_status_tip_domeniu | 9 | NULL | 103 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
Why does this happen?
Can this query be optimized more, both queries take 13 seconds.
Table schema is visible on gist github
MySQL chooses domenii as the leading table for the join.
This table is filtered on (status, tip_domeniu) = (1, 1).
It does not seem to be a very selective condition, so normally a full table scan with filtering would be preferred over the index scan.
We can see that MySQL expects 120 records to be returned from domanii for which this condition would hold.
When you add a LIMIT, the number of records expected to be processed is decreased, and MySQL considers the index scan more efficient for this.
Note that this condition:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita) AS DATE))) < '1300683793'
is not sargable, so you deprive the optimizer to use an index on data_limita.
Create the following indexes:
licitatii_ue (status, data_limita)
licitatii_ue (status, data_publicarii)
and rewrite the query like this:
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND l.data_limita < FROM_UNIXTIME(((1300683793 - 86400) div 86400) * 86400)
GROUP BY
l.licitatii_id
ORDER BY
data_publicarii DESC
Ah, the mysteries of the query optimizer are many and unknowable...
At a quick glance, the most obvious thing to optimize might be the
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
clause.
depending on the number of records in the licitatii_ue table, this looks like an expensive operation, and it will bypass any indices available.
ALL is table scan, range is range scan (due to LIMIT). Nothing bad with that, actually it also causes a key to be used (key_status_tip_domeniu).
The reason you are slow is, most likely, that you are using ORDER BY data_publicarii DESC (this is easy to test, just drop the ORDER BY and benchmark the query; would expect few orders of magnitude).
Mysql admits (under Extra column of explain) that it is using filesort (needed for order by because it can't or does not know how to use an index). Adding yet another index to the mix might help, especially if you confirm that ORDER BY is making it slow.
EDIT
Actually, you do have a cardinal sin in your query:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
Avoid applying any functions to your field values if you can apply them to a constant. So switch it around and rewrite it as
l.data_limita < some_function('1300683793')
However complext the some_function would be, it will be calculated only once. Mysql planner will know it is a constant. The way you wrote it would force mysql to apply unix_timestamp, timestampadd, cast and from_unixtime to value of data_limita from each row. Now in I/O bound systems this will usually just burn some extra CPU cycles while waiting for the disks to spin around (however, it might get significant, your system might get CPU bound and it is just a bad thing). Biggest difference is that you loose possibility to use an index on data_limita.
Finally, all your indexes are singe field indexes and mysql does some index merging, but is not stellar in it. You might want to try creating indexes that cover all your conditions and sorting order (in order of selectivity for target query).