Unable to optimize MySQL query which uses a ORDER BY clause - mysql

I'm using Drupal 6 with MySQL version 5.0.95 and at an impasse where one of my queries which displays content based on most recent article date slows down and because of the frequency of being used kills the site performance altogether. The query in question is as below:
SELECT n.nid,
n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
INNER JOIN content_type_article ma ON n.nid=ma.nid
INNER JOIN term_node tn ON n.nid=tn.nid
WHERE tn.tid= 153
AND n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
The EXPLAIN of the query shows the below result:
+----+-------------+-------+--------+--------------------------+---------+---------+----------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+----------------------+-------+---------------------------------+
| 1 | SIMPLE | tn | ref | PRIMARY,nid | PRIMARY | 4 | const | 19006 | Using temporary; Using filesort |
| 1 | SIMPLE | ma | ref | nid,ix_article_date | nid | 4 | drupal_mm_stg.tn.nid | 1 | |
| 1 | SIMPLE | n | eq_ref | PRIMARY,node_status_type | PRIMARY | 4 | drupal_mm_stg.ma.nid | 1 | Using where |
+----+-------------+-------+--------+--------------------------+---------+---------+----------------------+-------+---------------------------------+
This query seemed relatively simple and straight forward and retrieves articles which belong to a category (term) 153 and are of status 1 (published). But apparently Using temporary table and Using filesort means the query is bound to fail from what I've learnt browsing about it.
Removing field_article_date_format_value from the ORDER BY clause solves the Using temporary; Using filesort reduces the query execution time but is required and cannot be traded off, unfortunately same holds equally true for the site performance.
My hunch is that most of the trouble comes from the term_node table which maps articles to categories and is a many-many relationship table meaning if article X is associated to 5 categories C1....C5 it will have 5 entries in that table, this table is from out-of-the-box drupal.
Dealing with heavy DB content is something new to me and going through some of the similar queries (
When ordering by date desc, "Using temporary" slows down query,
MySQL performance optimization: order by datetime field) I tried to create a composite index for the content_type_article whose datetime field is used in the ORDER BY clause along with another key (nid) in it and tried to FORCE INDEX.
SELECT n.nid, n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
INNER JOIN content_type_article ma FORCE INDEX (ix_article_date) ON n.nid=ma.nid
INNER JOIN term_node tn ON n.nid=tn.nid
WHERE tn.tid= 153
AND n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
The result and the following EXPLAIN query did not seem to help much
+----+-------------+-------+--------+--------------------------+-----------------+---------+----------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+-----------------+---------+----------------------+-------+---------------------------------+
| 1 | SIMPLE | tn | ref | PRIMARY,nid | PRIMARY | 4 | const | 18748 | Using temporary; Using filesort |
| 1 | SIMPLE | ma | ref | ix_article_date | ix_article_date | 4 | drupal_mm_stg.tn.nid | 1 | |
| 1 | SIMPLE | n | eq_ref | PRIMARY,node_status_type | PRIMARY | 4 | drupal_mm_stg.ma.nid | 1 | Using where |
+----+-------------+-------+--------+--------------------------+-----------------+---------+----------------------+-------+---------------------------------+
The fields n.nid, ca.nid, ma.field_article_date_format_value are all indexed. Querying the DB with Limit 0,11 takes approximately 7-10 seconds with the ORDER BY clause but without it the query barely takes a second. The database engine is MyISAM. Any help on this would be greatly appreciated.
Any answer that could help me in getting this query like a normal one (at the same speed as a query without sort by date) would be great. My attempts with creating a composite query as a combination of nid and field_article_date_format_value and use in the query did not help the cause. I'm open to providing additional info on the problem and any new suggestions.

Taking a look at your query and the explain, it seems like having the n.status=1 in the where clause is making the search very inefficient because you need to return the whole set defined by the joins and then apply the status = 1. Try starting the join from the term_node table that is inmediately filtered by the WHERE and then make the joins adding the status condition immediately. Give it a try and please tell me how it goes.
SELECT n.nid, n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM term_node tn
INNER JOIN node n ON n.nid=tn.nid AND n.status=1
INNER JOIN content_type_article ma FORCE INDEX (ix_article_date) ON n.nid=ma.nid
WHERE tn.tid= 153
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;

Using temporary; Using filesort means only that MySQL needs to construct a temporary result table and sort it to get the result you need. This is often a consequence of the ORDER BY ... DESC LIMIT 0,n construct you're using to get the latest postings. In itself it's not a sign of failure. See this: http://www.mysqlperformanceblog.com/2009/03/05/what-does-using-filesort-mean-in-mysql/
Here are some things to try. I am not totally sure they'll work; it's hard to know without having your data to experiment with.
Is there a BTREE index on content_type_article.field_article_date_format_value ? If so, that may help.
Do you HAVE to display the 11 most recent articles? Or can you display the 11 most recent articles that have appeared in the last week or month? If so you could add this line to your WHERE clause. It would filter your stuff by date rather than having to look all the way back to the beginning of time for matching articles. This will be especially helpful if you have a long-established Drupal site.
AND ma.field_article_date_format_value >= (CURRENT_TIME() - INTERVAL 1 MONTH)
First, try to flip the order of the INNER JOIN operations. Second, incorporate the tid=153 into the join criterion. This MAY reduce the size of the temp table you need to sort. All together my suggestions are as follows:
SELECT n.nid,
n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
INNER JOIN term_node tn ON (n.nid=tn.nid AND tn.tid = 153)
INNER JOIN content_type_article ma ON n.nid=ma.nid
WHERE n.status=1
AND ma.field_article_date_format_value >= (CURRENT_TIME() - INTERVAL 1 MONTH)
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
Those are

1) Covering indexes
I think the simple answer may be "covering indexes".
Especially on the content_type_article table. The "covering index" has the expression in the ORDER BY as the leading column, and includes all of the columns that are being referenced by the query. Here's the index I created (on my test table):
CREATE INDEX ct_article_ix9
ON content_type_article
(field_article_date_format_value, nid, field_article_summary_value);
And here's an excerpt of the EXPLAIN I get from the query (after I build example tables, using the InnoDB engine, including a covering index on each table):
_type table type key ref Extra
------ ----- ----- -------------- ----------- ------------------------
SIMPLE ma index ct_article_ix9 NULL Using index
SIMPLE n ref node_ix9 ma.nid Using where; Using index
SIMPLE tn ref term_node_ix9 n.nid,const Using where; Using index
Note that there's no 'Using filesort' shown in the plan, and the plan shows 'Using index' for each table referenced in the query, which basically means that all of the data needed by the query is retrieved from the index pages, with no need to reference any pages from the underlying table. (Your tables have a lot more rows than my test tables, but if you can get an explain plan that looks like this, you may get better performance.)
For completeness, here's the entire EXPLAIN output:
+----+-------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------------+---------+-------- ------------+------+--------------------------+
| 1 | SIMPLE | ma | index | NULL | ct_article_ix9 | 27 | NULL | 1 | Using index |
| 1 | SIMPLE | n | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 11 | Using where; Using index |
| 1 | SIMPLE | tn | ref | term_node_ix9 | term_node_ix9 | 10 | testps.n.nid,const | 11 | Using where; Using index |
+----+-------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
3 rows in set (0.00 sec)
I made no changes to your query, except to omit the FORCE INDEX hint. Here are the other two "covering indexes" that I created on the other two tables referenced in the query:
CREATE INDEX node_ix9
ON node (`nid`,`status`,`title`);
CREATE INDEX term_node_ix9
ON term_node (nid,tid);
(Note that if nid is the clustering key on the node table, you may not need the covering index on the node table.)
2) Use correlated subqueries in place of joins?
If the previous idea doesn't improve anything, then, as another alternative, since the original query is returning a maximum of 11 rows, you might try rewriting the query to avoid the join operations, and instead make use of correlated subqueries. Something like the query below.
Note that this query differs significantly from the original query. The difference is that with this query, a row from the context_type_article table will be returned only one time. With the query using the joins, a row from that table could be matched to multiple rows from node and term_node tables, which would return that same row more than once. This may be viewed as either desirable or undesirable, it really depends on the cardinality, and whether the resultset meets the specification.
SELECT ( SELECT n2.nid
FROM node n2
WHERE n2.nid = ma.nid
AND n2.status = 1
LIMIT 1
) AS `nid`
, ( SELECT n3.title
FROM node n3
WHERE n3.nid = ma.nid
AND n3.status = 1
LIMIT 1
) AS `title`
, ma.field_article_date_format_value
, ma.field_article_summary_value
FROM content_type_article ma
WHERE EXISTS
( SELECT 1
FROM node n1
WHERE n1.nid = ma.nid
AND n1.status = 1
)
AND EXISTS
( SELECT 1
FROM term_node tn
WHERE tn.nid = ma.nid
AND tn.tid = 153
)
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0,11
(Sometimes, a query using this type of "orrelated subquery" can have considerably WORSE performance than an equivalent query that does join operations. But in some cases, a query like this can actually perform better, especially given a very limited number of rows being returned.)
Here's the explain output for that query:
+----+--------------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
| 1 | PRIMARY | ma | index | NULL | ct_article_ix9 | 27 | NULL | 11 | Using where; Using index |
| 5 | DEPENDENT SUBQUERY | tn | ref | term_node_ix9 | term_node_ix9 | 10 | testps.ma.nid,const | 13 | Using where; Using index |
| 4 | DEPENDENT SUBQUERY | n1 | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 12 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | n3 | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 12 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | n2 | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 12 | Using where; Using index |
+----+--------------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
5 rows in set (0.00 sec)
Note that again, each access is 'Using index', which means the query is satisfied directly from index pages, rather than having to visit any data pages in the underlying table.
Example tables
Here are the example tables (along with the indexes) that I built and populated, based on the information from your question:
CREATE TABLE `node` (`id` INT PRIMARY KEY, `nid` INT, `title` VARCHAR(10),`status` INT);
CREATE INDEX node_ix9 ON node (`nid`,`status`,`title`);
INSERT INTO `node` VALUES (1,1,'foo',1),(2,2,'bar',0),(3,3,'fee',1),(4,4,'fi',0),(5,5,'fo',1),(6,6,'fum',0),(7,7,'derp',1);
INSERT INTO `node` SELECT id+7,nid+7,title,`status` FROM node;
INSERT INTO `node` SELECT id+14,nid+14,title,`status` FROM node;
INSERT INTO `node` SELECT id+28,nid+28,title,`status` FROM node;
INSERT INTO `node` SELECT id+56,nid+56,title,`status` FROM node;
CREATE TABLE content_type_article (id INT PRIMARY KEY, nid INT, field_article_date_format_value DATETIME, field_article_summary_value VARCHAR(10));
CREATE INDEX ct_article_ix9 ON content_type_article (field_article_date_format_value, nid, field_article_summary_value);
INSERT INTO content_type_article VALUES (1001,1,'2012-01-01','foo'),(1002,2,'2012-01-02','bar'),(1003,3,'2012-01-03','fee'),(1004,4,'2012-01-04','fi'),(1005,5,'2012-01-05','fo'),(1006,6,'2012-01-06','fum'),(1007,7,'2012-01-07','derp');
INSERT INTO content_type_article SELECT id+7,nid+7, DATE_ADD(field_article_date_format_value,INTERVAL 7 DAY),field_article_summary_value FROM content_type_article;
INSERT INTO content_type_article SELECT id+14,nid+14, DATE_ADD(field_article_date_format_value,INTERVAL 14 DAY),field_article_summary_value FROM content_type_article;
INSERT INTO content_type_article SELECT id+28,nid+28, DATE_ADD(field_article_date_format_value,INTERVAL 28 DAY),field_article_summary_value FROM content_type_article;
INSERT INTO content_type_article SELECT id+56,nid+56, DATE_ADD(field_article_date_format_value,INTERVAL 56 DAY),field_article_summary_value FROM content_type_article;
CREATE TABLE term_node (id INT, tid INT, nid INT);
CREATE INDEX term_node_ix9 ON term_node (nid,tid);
INSERT INTO term_node VALUES (2001,153,1),(2002,153,2),(2003,153,3),(2004,153,4),(2005,153,5),(2006,153,6),(2007,153,7);
INSERT INTO term_node SELECT id+7, tid, nid+7 FROM term_node;
INSERT INTO term_node SELECT id+14, tid, nid+14 FROM term_node;
INSERT INTO term_node SELECT id+28, tid, nid+28 FROM term_node;
INSERT INTO term_node SELECT id+56, tid, nid+56 FROM term_node;

MySQL is "optimizing" your query so that it selects from the term_node table first, even though you are specifying to select from node first. Not knowing the data, I'm not sure which is the optimal way. The term_node table is certainly where your performance issues are since ~19,000 records is being selected from there.
Limits without ORDER BY are almost always faster because MySQL stops as soon as it finds the specified limit. With an ORDER BY, it first has to find all the records and sort them, then get the specified limit.
The simple thing to try is moving your WHERE condition into the JOIN clause, which is where it should be. That filter is specific to the table being joined. This will make sure MySQL doesn't optimize it incorrectly.
INNER JOIN term_node tn ON n.nid=tn.nid AND tn.tid=153
A more complicated thing is to do a SELECT on the term_node table and JOIN on that. That's called a DERIVED TABLE and you will see it defined as such in the EXPLAIN. Since you said it was a many-to-many, I added a DISTINCT parameter to reduce the numbers of records to join on.
SELECT ...
FROM node n
INNER JOIN content_type_article ma FORCE INDEX (ix_article_date) ON n.nid=ma.nid
INNER JOIN (SELECT DISTINCT nid FROM term_node WHERE tid=153) tn ON n.nid=tn.nid
WHERE n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0,11
MySQL 5.0 has some limitations with derived tables, so this may not work. Although there are work arounds.

You really want to avoid the sort operation happening at all if you can by taking advantage of a pre-sorted index.
To find out if this is possible, imagine your data denormalised into a single table, and ensure that everything that must be included in your WHERE clause is specifiable with a SINGLE VALUE. e.g. if you must use an IN clause on one of the columns, then sorting is inevitable.
Here's a screenshot of some sample data:
So, if you DID have your data denormalised, you could query on tid and status using single values and then sort by date descending. That would mean the following index in that case would work perfectly:
create index ix1 on denormalisedtable(tid, status, date desc);
If you had this, your query would only hit the top 10 rows and would never need to sort.
So - how do you get the same performance WITHOUT denormalising...
I think you should be able to use the STRAIGHT_JOIN clause to force the order that MySQL selects from the tables - you want to get it to select from the table you are SORTING last.
Try this:
SELECT n.nid,
n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
STRAIGHT_JOIN term_node tn ON n.nid=tn.nid
STRAIGHT_JOIN content_type_article ma ON n.nid=ma.nid
WHERE tn.tid= 153
AND n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
The idea is to get MySQL to select from the node table and then from the term_node table and THEN FINALLY from the content_type_article table (the table containing the column you are sorting on).
This last join is your most important one and you want it to happen using an index so that the LIMIT clause can work without needing to sort the data.
This single index MIGHT do the trick:
create index ix1 on content_type_article(nid, field_article_date_format_value desc);
or
create index ix1 on content_type_article(nid, field_article_date_format_value desc, field_article_summary_value);
(for a covering index)
I say MIGHT, because I don't know enough about the MySQL optimiser to know if it's clever enough to handle the multiple 'nid' column values that will be getting fed into the content_type_article without having to resort the data.
Logically, it should be able to work quickly - e.g. if 5 nid values are getting fed into the final content_type_article table, then it should be able to get the top 10 of each directly from the index and merge the results together then pick the final top 10, meaning a total of 50 rows read from this table insted of the full 19006 that you're seeing currently.
Let me know how it goes.
If it works for you, further optimisation will be possible using covering indexes on the other tables to speed up the first two joins.

Related

Optimizing ORDER BY and WHERE on sql queries with JOIN

I am currently working on an online e-commerce platform back-office.
I currently have around 70 000 products and I would like to speed up the display of data so that the employees can work more efficiently.
I am using MySQL "Ver 14.14 Distrib 5.7.28".
Basically for my back office (I will not explicitly list the details of the columns because I don't think it really matters), I have:
A main table node_node containing basic information for all data like creation_date, last_modification_date for example (date fields)
A main table staff_node_staffnode containing basic information for all data created by employees (like products, brands, etc ...). It contains mainly the fields owner_id (foreign key to the staff table that I will not detail here) and is_verified (boolean field) and a foreign key staffnode_ptr_id poiting to node_node
Data structure tables like product_merchandise, product_brand which contain their own fields and a foreign key staffnode_ptr_id poiting to staff_node_staffnode
I first run a query to retrieve all the IDs of the products I want to display (given the large amount of data I prefer first retrieving only the ids of the product of my list which will be limited to 30 per page, and then on this subset retrieve more data with more joins on other tables)
SELECT id from product_merchandise pm
INNER JOIN staff_node_staffnode sns ON sns.node_ptr_id = pm.staffnode_ptr_id
INNER JOIN node_node nn ON nn.id = sns.node_ptr_id
ORDER BY creation_date DESC LIMIT 30;
There is an index on product_merchandise(staffnode_ptr_id) and staff_node_staffnode(node_ptr_id) and node_node(id).
It takes between 2 and 3 seconds on average to run this query which is too long.
EDIT: as suggested in the comments, here is the output of the EXPLAIN query. EXPLAIN ANALYZE is not working on my Mysql version.
+----+-------------+-------+------------+--------+---------------+------------------------------+---------+------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+------------------------------+---------+------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | pm | NULL | index | PRIMARY | product_merchandise_447d3092 | 5 | NULL | 69623 | 100.00 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | sns | NULL | eq_ref | PRIMARY | PRIMARY | 4 | db.pm.staffnode_ptr_id | 1 | 100.00 | Using index |
| 1 | SIMPLE | nn | NULL | eq_ref | PRIMARY | PRIMARY | 4 | db.pm.staffnode_ptr_id | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+---------------+------------------------------+---------+------------------------+-------+----------+----------------------------------------------+
I decided to add an index creation_date_idx on node_node(creation_date) and when I force the use of it, I get between 0.10s and 0.15s, which is perfect:
SELECT id from product_merchandise pm
INNER JOIN staff_node_staffnode sns ON sns.node_ptr_id = pm.staffnode_ptr_id
INNER JOIN node_node nn FORCE INDEX(creation_date_idx) ON nn.id = sns.node_ptr_id
ORDER BY creation_date DESC LIMIT 30;
The problem now is that the staff working on the products should be able to filter according to different parameters, for example owner_id.
SELECT id from product_merchandise pm
INNER JOIN staff_node_staffnode sns ON sns.node_ptr_id = pm.staffnode_ptr_id
INNER JOIN node_node nn FORCE INDEX(creation_date_idx) ON nn.id = sns.node_ptr_id
WHERE sns.owner_id = [NUMBER]
ORDER BY creation_date DESC LIMIT 30;
The result is terrible (I stopped the query around 30s but I assume it could have taken much more time) and it makes sense because I force the use of the index creation_date_index which is not relevant here.
If I remove the use of this index, I get better results (1-2 s.) but I come back to the first issue which is: the calculation time is too long.
EDIT: as suggested, here is the output of the EXPLAIN for
SELECT id from product_merchandise pm
INNER JOIN staff_node_staffnode sns ON sns.node_ptr_id = pm.staffnode_ptr_id
INNER JOIN node_node nn ON nn.id = sns.node_ptr_id
WHERE sns.owner_id = [NUMBER]
ORDER BY creation_date DESC LIMIT 30;
+----+-------------+-------+------------+--------+---------------------------------------+------------------------------+---------+------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------------------------------+------------------------------+---------+------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | pm | NULL | index | PRIMARY | product_merchandise_447d3092 | 5 | NULL | 69220 | 100.00 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | sns | NULL | eq_ref | PRIMARY,staff_node_staffnode_5e7b1936 | PRIMARY | 4 | db.pm.staffnode_ptr_id | 1 | 19.00 | Using where |
| 1 | SIMPLE | nn | NULL | eq_ref | PRIMARY | PRIMARY | 4 | db.pm.staffnode_ptr_id | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+---------------------------------------+------------------------------+---------+------------------------+-------+----------+----------------------------------------------+
I guess I should create another index but I don't really know on what columns.
Moreover, the staff should be able to filter on 5 different fields (let's say they are all VARCHAR or FOREIGN KEY or BOOLEAN) and order by this different fields as well. Those fields could be from the table product_merchandise (product_name for example) or staff_node_staffnode (creator or is_verified) or event node_node (creation_date for example).
I hope I made myself clear enough.
Thank you for your time, I would appreciate any help !
Have a great day.
I put it here since it doean't fit in comments ,
here is the list of indexes you need to improve your performance:
product_merchandise(id,staffnode_ptr_id)
staff_node_staffnode(node_ptr_id,owner_id)
node_node(id,creation_date DESC)
change/add your indexes to above list and let's see how it change the performance
Thank you eshirvana for your suggestion.
I post an answer instead of editing my original question because the results of my tests are quite long. I hope this will not be a problem.
First of all I forgot to mention that staffnode_ptr_id was the primary key of product_merchandise and that node_ptr_id was the primary key of staff_node_staffnode.
Then here are the indexes I have besides the PRIMARY indexes:
CREATE INDEX node_creationdate_idx ON node_node(creation_date);
CREATE INDEX node_id_creationdate_idx ON node_node(id,creation_date);
CREATE INDEX staffnode_nodeptrid_ownerid_idx ON staff_node_staffnode(node_ptr_id,owner_id);
I didn't specify the DESC for the index node_id_creationdate_idx because the ordering could be ASC or DESC depending the cases.
Here are the results of the speed tests I ran (I executed the queries 10 times for each case):
The details can be found on this link
No index forced, ordering by 'creation_date' only
average: 2.4473010037094354 fastest: 2.0254166573286057 slowest: 2.891202986240387
Forcing index 'node_creationdate_idx', ordering by 'creation_date' only
average: 0.045951709523797034 fastest: 0.03917844220995903 slowest: 0.06625311821699142
No index forced, ordering by 'creation_date' and filtering on 'owner_id'
average: 1.7595138054341077 fastest: 1.08128846809268 slowest: 2.858897101134062
Forcing index 'node_creationdate_idx', ordering by 'creation_date' and filtering on 'owner_id'
average: infinity
The results above correspond to what I was stating in my original post.
If I try ordering by sku which is a VARCHAR column of the product_merchandise table, the calculation is very fast no matter what
No index forced, ordering by 'sku' only
average: 0.0022248398512601853 fastest: 0.0017771385610103607 slowest: 0.0032510906457901
No index forced, ordering by 'sku' and filtering on 'owner_id'
average: 0.00639396645128727 fastest: 0.0025643371045589447 slowest: 0.0197000615298748
On the results below, I tried to force the use of the new indexes staffnode_nodeptrid_ownerid_idx and node_id_creationdate_idx
Forcing index 'staffnode_nodeptrid_ownerid_idx', ordering by 'creation_date' only
average: 2.1846631478518246 fastest: 1.665839608758688 slowest: 2.5894345454871655
Forcing index 'staffnode_nodeptrid_ownerid_idx', ordering by 'creation_date' and filtering on 'owner_id'
average: 0.9459988728165627 fastest: 0.726978026330471 slowest: 1.1611059792339802
Forcing index 'node_id_creationdate_idx', ordering by 'creation_date' only
average: 1.7628929097205401 fastest: 1.5384734570980072 slowest: 1.9222845435142517
Forcing index 'node_id_creationdate_idx', ordering by 'creation_date' and filtering on 'owner_id'
average: 1.2311949148774146 fastest: 0.9017647355794907 slowest: 1.4749027229845524
Forcing indexes 'node_id_creationdate_idx' and 'staffnode_nodeptrid_ownerid_idx', ordering by 'creation_date' only
average: 1.5638799782842399 fastest: 1.3537045568227768 slowest: 1.8629941195249557
Forcing indexes 'node_id_creationdate_idx' and 'staffnode_nodeptrid_ownerid_idx', ordering by 'creation_date' and filtering on 'owner_id'
average: 1.6410113696008921 fastest: 1.2819141708314419 slowest: 2.3169863671064377
In conclusion:
I get slightly better results with those indexes, although it's still too long in my opinion
It seems that the problem lies in the fact that creation_date does not belong to the table product_merchandise and therefore indexing on it is not really efficient
What do you suggest ? Should I change the structure of my tables ?
Thank you for your help !

Self join on a huge table with conditions is taking a lot of time , optimize query

I have a master table which has details.
I wanted to find all the combinations for a product in that session with every other product in that particular sessions for all sessions.
create table combinations as
select
a.main_id,
a.sub_id as sub_id_x,
b.sub_id as sub_id_y,
count(*) as count1,
a.dates as rundate
from
master_table a
left join
master_table b
on a.session_id = b.session_id
and a.visit_number = b.visit_number
and a.main_id = b.main_id
and a.sub_id != b.sub_id
where
a.sub_id is not null
and b.sub_id is not null
group by
a.main_id,
a.sub_id,
b.sub_id,
rundate;
I did a explain on query
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | a | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 90.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | b | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 0.08 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
The main issue is, my master table consists of 80 million rows. This query is taking more than 24 hours to execute.
All the columns are indexed and I am doing a self join.
Would creating a like table first 'master_table_2' and then doing a join would make my query faster?
Is there any way to optimize the query time?
As your table consists of a lot of rows, the join query will take a lot of time if it is not optimized properly and the WHERE clause is not used properly. But an optimized query could save your time and effort. The following link has a good explanation about the optimization of the join queries and its facts -
Optimization of Join Queries
#Marcus Adams has already provided a similar answer here
Another option is you can select individually and process in the code end for the optimization. But it is only applicable in some specific conditions only. You will have to try to compare both processes (join query and code end execution) and check the performance. I have got better performance once using this method.
Suppose a join query is like as the following -
SELECT A.a1, B.b1, A.a2
FROM A
INNER JOIN B
ON A.a3=B.b3
WHERE B.b3=C;
What I am trying to say is query individually from A and B satisfying the necessary conditions and then try to get your desired result from the code end.
N.B. : It is an unorthodox way and it could not be taken as granted to be applicable in all criteria.
Hope it helps.

MySQL InnoDB indexes slowing down sorts

I am using MySQL 5.6 on FreeBSD and have just recently switched from using MyISAM tables to InnoDB to gain advances of foreign key constraints and transactions.
After the switch, I discovered that a query on a table with 100,000 rows that was previously taking .003 seconds, was now taking 3.6 seconds. The query looked like this:
SELECT *
-> FROM USERS u
-> JOIN MIGHT_FLOCK mf ON (u.USER_ID = mf.USER_ID)
-> WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
I noticed that if I removed the ORDER BY clause, the execution time dropped back down to .003 seconds, so the problem is obviously in the sorting.
I then discovered that if I added back the ORDER BY but removed indexes on the columns referred to in the query (STATUS and ACCESS_ID), the query execution time would take the normal .003 seconds.
Then I discovered that if I added back the indexes on the STATUS and ACCESS_ID columns, but used IGNORE INDEX (STATUS,ACCESS_ID), the query would still execute in the normal .003 seconds.
Is there something about InnoDB and sorting results when referencing an indexed column in a WHERE clause that I don't understand?
Or am I doing something wrong?
EXPLAIN for the slow query returns the following results:
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | u | ref | PRIMARY,STATUS,ACCESS_ID | STATUS | 2 | const | 53902 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | mf | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.u.USER_ID | 1 | NULL |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
EXPLAIN for the fast query returns the following results:
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| 1 | SIMPLE | mf | index | PRIMARY | STREAK | 2 | NULL | 100 | NULL |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.mf.USER_ID | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
Any help would be greatly appreciated.
In the slow case, MySQL is making an assumption that the index on STATUS will greatly limit the number of users it has to sort through. MySQL is wrong. Presumably most of your users are ACTIVE. MySQL is picking up 50k user rows, checking their ACCESS_ID, joining to MIGHT_FLOCK, sorting the results and taking the first 100 (out of 50k).
In the fast case, you have told MySQL it can't use either index on USERS. MySQL is using its next-best index, it is taking the first 100 rows from MIGHT_FLOCK using the STREAK index (which is already sorted), then joining to USERS and picking up the user rows, then checking that your users are ACTIVE and have an ACCESS_ID at or above 8. This is much faster because only 100 rows are read from disk (x2 for the two tables).
I would recommend:
drop the index on STATUS unless you frequently need to retrieve INACTIVE users (not ACTIVE users). This index is not helping you.
Read this question to understand why your sorts are so slow. You can probably tune InnoDB for better sort performance to prevent these kind of problems.
If you have very few users with ACCESS_ID at or above 8 you should see a dramatic improvement already. If not you might have to use STRAIGHT_JOIN in your select clause.
Example below:
SELECT *
FROM MIGHT_FLOCK mf
STRAIGHT_JOIN USERS u ON (u.USER_ID = mf.USER_ID)
WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
STRAIGHT_JOIN forces MySQL to access the MIGHT_FLOCK table before the USERS table based on the order in which you specify those two tables in the query.
To answer the question "Why did the behaviour change" you should start by understanding the statistics that MySQL keeps on each index: http://dev.mysql.com/doc/refman/5.6/en/myisam-index-statistics.html. If statistics are not up to date or if InnoDB is not providing sufficient information to MySQL, the query optimiser can (and does) make stupid decisions about how to join tables.

Select parent row only if it has no children

I have a MySQL database in which table A has a one-to-many relation to table B, and I would like to select all rows in table B that have no children in table A. I have tried using
SELECT id FROM A WHERE NOT EXISTS (SELECT * FROM B WHERE B.id=A.id)
and
SELECT id FROM A LEFT JOIN B ON A.id=B.id WHERE B.id IS NULL
Both of these seem slow. Is there a faster query to achieve the same thing?
In case this is relevant, in my database table A has about 500,000 rows and table B has about 3 to 4 million rows.
Edit: For the actual tables in my database, explain gives me:
+----+--------------------+------------------+-------+---------------+---------------------------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------------+-------+---------------+---------------------------+---------+------+---------+--------------------------+
| 1 | PRIMARY | frontend_form471 | index | NULL | frontend_form471_61a633e8 | 32 | NULL | 671927 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | SchoolData | index | PRIMARY | PRIMARY | 49 | NULL | 3121110 | Using where; Using index |
+----+--------------------+------------------+-------+---------------+---------------------------+---------+------+---------+--------------------------+
for
select number from frontend_form471 where not exists (select * from SchoolData where SchoolData.`f471 Application Number`=frontend_form471.number)
and
+----+-------------+------------------+-------+---------------+---------------------------+---------+------+---------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+---------------------------+---------+------+---------+------------------------------------------------+
| 1 | SIMPLE | frontend_form471 | index | NULL | frontend_form471_61a633e8 | 32 | NULL | 671927 | Using index; Using temporary |
| 1 | SIMPLE | SchoolData | index | PRIMARY | PRIMARY | 49 | NULL | 3121110 | Using where; Using index; Not exists; Distinct |
+----+-------------+------------------+-------+---------------+---------------------------+---------+------+---------+------------------------------------------------+
for
select distinct number from frontend_form471 left join SchoolData on frontend_form471.number=SchoolData.`f471 Application Number` where SchoolData.`f471 Application Number` is NULL
where in my case frontend_form471 is table A and SchoolData is table B
Edit2: In table B (SchoolData) in my database, id is the first part of a two part primary key, so it is indexed and there are still multiple entries in B with the same id.
SELECT id FROM A LEFT OUTER JOIN B ON A.id=B.id WHERE B.id IS NULL
you can do this. the outer join should bring a little performance, but not much.
new database systems will probably optimize your query anyway so that there wont be any difference.
the correct way here is caching! try the query cacher and application level caching if possible.
of course you need proper indexes.
and by proper i mean on both tables and preferably a hash index as it will have static lookup time in comparision to any tree that has logarithmic
Try putting an explain before the query to see what really slows this down.
if you really need this to be fast you may re-facture your data structure.
you could possibly create a trigger to mark a flag in table A whether there is a corresponding entry in table be. of course this id data redundancy, but sometimes its worth it. just think of it as caching.
one last thought: you could try SELECT id FROM A WHERE id NOT IN (SELECT id FROM B) it may be a little faster because no actual joining is necessary, however it may also be slower because the lookup in the set of be will be a full scan. I am not really sure how this will be processed but it may be worth a try.
It's going to be slow no matter how you look at it. Worst case performance is going to be a full cross join creating 2 trillion potential matches (4 mill * 500k).
The second one will most likely perform faster, since it's a single query.
Your indexing is poor.
For all forms (EXISTS, IN, LEFT JOIN) you should have indexes on id in both tables
You could try
SELECT id FROM A WHERE A.id NOT IN (SELECT id FROM B)
but i don't know if this will be any faster. I would have tried the left join first. I think your problem is more to do with indexes. Do you have indexes on both id fields.
Be sure to have an index on A.id and another one on B.id.
What seems a like bit strange is that you join A.id with B.id. Is B.id the foreign key to A or is it the primary key of B?
If your schema is something like this:
CREATE TABLE b(
id int,
value varchar(255)
)
CREATE TABLE a(
id int,
father_id int,
value varchar(255)
)
If you want all the rows of table A that don't have child in table A why you don't try something like:
SELECT * FROM B WHERE id NOT IN (SELECT father_id FROM A GROUP BY father_id)
I haven't tested but i think that it's fester. Remember to put an index over id
Hope this helps
Why not try empty value instead of NULL. In SQL, the NULL value is never true in comparison to any other value, even NULL. An expression that contains NULL always produces a NULL value unless otherwise indicated in the documentation for the operators and functions involved in the expression.

understanding mysql explain

So, I've never understood the explain of MySQL. I understand the gross concepts that you should have at least one entry in the possible_keys column for it to use an index, and that simple queries are better. But what is the difference between ref and eq_ref? What is the best way to be optimizing queries.
For example, this is my latest query that I'm trying to figure out why it takes forever (generated from django models) :
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| 1 | SIMPLE | T6 | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_alias_id | 4 | const | 244 | Using temporary; Using filesort |
| 1 | SIMPLE | T5 | eq_ref | PRIMARY | PRIMARY | 4 | paul.T6.achievement_id | 1 | Using index |
| 1 | SIMPLE | T4 | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_achievement_id | 4 | paul.T6.achievement_id | 298 | |
| 1 | SIMPLE | yourock_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.T4.alias_id | 1 | Using index |
| 1 | SIMPLE | yourock_achiever | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_alias_id | 4 | paul.T4.alias_id | 152 | |
| 1 | SIMPLE | yourock_achievement | eq_ref | PRIMARY | PRIMARY | 4 | paul.yourock_achiever.achievement_id | 1 | |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
6 rows in set (0.00 sec)
I had hoped to learn enough about mysql explain that the query wouldn't be needed. Alas, it seems that you can't get enough information from the explain statement and you need the raw SQL. Query :
SELECT `yourock_achievement`.`id`,
`yourock_achievement`.`modified`,
`yourock_achievement`.`created`,
`yourock_achievement`.`string_id`,
`yourock_achievement`.`owner_id`,
`yourock_achievement`.`name`,
`yourock_achievement`.`description`,
`yourock_achievement`.`owner_points`,
`yourock_achievement`.`url`,
`yourock_achievement`.`remote_image`,
`yourock_achievement`.`image`,
`yourock_achievement`.`parent_achievement_id`,
`yourock_achievement`.`slug`,
`yourock_achievement`.`true_points`
FROM `yourock_achievement`
INNER JOIN
`yourock_achiever`
ON `yourock_achievement`.`id` = `yourock_achiever`.`achievement_id`
INNER JOIN
`yourock_alias`
ON `yourock_achiever`.`alias_id` = `yourock_alias`.`id`
INNER JOIN
`yourock_achiever` T4
ON `yourock_alias`.`id` = T4.`alias_id`
INNER JOIN
`yourock_achievement` T5
ON T4.`achievement_id` = T5.`id`
INNER JOIN
`yourock_achiever` T6
ON T5.`id` = T6.`achievement_id`
WHERE
T6.`alias_id` = 6
ORDER BY
`yourock_achievement`.`modified` DESC
Paul:
eq_ref
One row is read from this table for each combination of rows from the previous tables. Other than the system and const types, this is the best possible join type. It is used when all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE index.
eq_ref can be used for indexed columns that are compared using the = operator. The comparison value can be a constant or an expression that uses columns from tables that are read before this table. In the following examples, MySQL can use an eq_ref join to process ref_table:
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
ref
All rows with matching index values are read from this table for each combination of rows from the previous tables. ref is used if the join uses only a leftmost prefix of the key or if the key is not a PRIMARY KEY or UNIQUE index (in other words, if the join cannot select a single row based on the key value). If the key that is used matches only a few rows, this is a good join type.
ref can be used for indexed columns that are compared using the = or <=> operator. In the following examples, MySQL can use a ref join to process ref_table:
SELECT * FROM ref_table WHERE key_column=expr;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
These are copied verbatim from the MySQL manual: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
If you could post your query that is taking forever, I could help pinpoint what is slowing it down. Also, please specify what your definition of forever is. Also, if you could provide your "SHOW CREATE TABLE xxx;" statements for these tables, I could help in optimizing your query as much as possible.
What jumps out at me immediately as a possible point of improvement is the "Using temporary; Using filesort;". This means that a temporary table was created to satisfy the query (not necessarily a bad thing), and that the GROUP BY/ORDER BY you designated could not be retrieved from an index, thus resulting in a filesort.
You query seems to process (244 * 298 * 152) = 11,052,224 records, which according to Using temporary; Using filesort need to be sorted.
This can take long.
If you post your query here, we probably will be able to optimize it somehow.
Update:
You query indeed does a number of nested loops and seems to yield lots of values which need to be sorted then.
Could you please run the following query:
SELECT COUNT(*)
FROM `yourock_achievement`
INNER JOIN
`yourock_achiever`
ON `yourock_achievement`.`id` = `yourock_achiever`.`achievement_id`
INNER JOIN
`yourock_alias`
ON `yourock_achiever`.`alias_id` = `yourock_alias`.`id`
INNER JOIN
`yourock_achiever` T4
ON `yourock_alias`.`id` = T4.`alias_id`
INNER JOIN
`yourock_achievement` T5
ON T4.`achievement_id` = T5.`id`
INNER JOIN
`yourock_achiever` T6
ON T5.`id` = T6.`achievement_id`
WHERE
T6.`alias_id` = 6