Slow MySQL query after adding multiple OR conditions to indexed column - mysql

I have a query running on MySQL (v5.5 -- I know it's old but it's what I have to work with for now). The table A below has ~16 million rows and B has ~700,000. The query looks something like this:
SELECT A.id, A.x, A.y, A.z, B.foo FROM A STRAIGHT_JOIN B ON A.id = B.id
where A.x = 53 ORDER BY A.y desc LIMIT 0, 30;
There's an index setup on A.id as well as on B.id.
There's also an index setup on (A.x, A.y) (this key/index is called DocsByType).
This query has worked great so far, it's performance has always been sub-second or thereabouts. Recently though, I have a need to occasionally check against an additional possible value for A.x in the where clause. The following query is now performing very poorly, on average taking ~15 secs to complete:
SELECT A.id, A.x, A.y, A.z, B.foo FROM A STRAIGHT_JOIN B ON A.id = B.id
where (A.x = 18 or A.x = 53) ORDER BY A.y desc LIMIT 0, 30;
The explain for the fast query with only one comparison looks like this:
+----+-------------+-------+------+-----------------------------------------------------+------------+---------+----------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------------------------------------+------------+---------+----------------+---------+-------------+
| 1 | SIMPLE | A | ref | Documents1,Documents2,Documents3,DocsByType,KEY_AID | DocsByType | 4 | const | 1870603 | Using where |
| 1 | SIMPLE | B | ref | KEY_BID | KEY_BID | 4 | mydb.B.id | 1 | |
+----+-------------+-------+------+-----------------------------------------------------+------------+---------+----------------+---------+-------------+
The explain for the multi-comparison query looks like this:
+----+-------------+-------+-------+-----------------------------------------------------+------------+---------+----------------+---------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------------------------------------------+------------+---------+----------------+---------+-----------------------------+
| 1 | SIMPLE | A | range | Documents1,Documents2,Documents3,DocsByType,KEY_AID | DocsByType | 4 | NULL | 1878693 | Using where; Using filesort |
| 1 | SIMPLE | B | ref | KEY_BID | KEY_BID | 4 | mydb.B.id | 1 | |
+----+-------------+-------+-------+-----------------------------------------------------+------------+---------+----------------+---------+-----------------------------+
I can see that there's a filesort operation that's not in the first query. Also the type is "range" instead of "ref", and the ref is "NULL" instead of "const". Removing the order by clause fixes it completely, so that it completes in less than a second, but it's important that the results are sorted.
Query optimization is not my strong suit. I would have thought that it would have worked exactly the same given that the column is already indexed. Can anyone explain why this behaves the way it does and suggest a way to optimize the query? Please also note that the new query might need to use 3, 4 or even 5 possible values for the where clause (but always against the same column).
I've also tried running the queries using MySQL 5.8 but the result is the same. My table is using the MyISAM engine.

Suppose you have a big list of people's names. And the goal is to find the first 30 Smiths (ordered by first name). The first query is fast because it is essentially doing the WHERE, ORDER BY and LIMIT all at once:
The second is messier because it is effectively done thus:
Find the first names of all the 'Smiths',
Find the first names of all the 'Joneses'
Sort the first names and show the first 30
There are two things to speed up your slow query:
( SELECT A.id, A.x, A.y, A.z, B.foo FROM A JOIN B ON A.id = B.id
where (A.x = 18)
ORDER BY A.y desc LIMIT 30 )
UNION ALL
( SELECT A.id, A.x, A.y, A.z, B.foo FROM A JOIN B ON A.id = B.id
where (A.x = 53) -- Note
ORDER BY A.y desc LIMIT 30 )
ORDER BY A.y desc LIMIT 0, 30 -- Yes, repeated
Comments:
STRAIGHT_JOIN is unnecessary, JOIN will happen to do the same thing
Each subquery will use INDEX(x,y) and make use of LIMIT.
ALL is faster than the default, and is appropriate in this case
If you need to "paginate", the limits need to be handled as described here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or
Any number of UNIONs can be tacked together. However, at some point, the cost of all the unions will outweigh the benefit. (It is not practical to try to predict where the cutoff is.)
It would be faster to do the LIMIT 30 before JOINing to B. That way, you would do only 30 lookups in B; my way needs 60; your original query needed lots more.

Related

Self join on a huge table with conditions is taking a lot of time , optimize query

I have a master table which has details.
I wanted to find all the combinations for a product in that session with every other product in that particular sessions for all sessions.
create table combinations as
select
a.main_id,
a.sub_id as sub_id_x,
b.sub_id as sub_id_y,
count(*) as count1,
a.dates as rundate
from
master_table a
left join
master_table b
on a.session_id = b.session_id
and a.visit_number = b.visit_number
and a.main_id = b.main_id
and a.sub_id != b.sub_id
where
a.sub_id is not null
and b.sub_id is not null
group by
a.main_id,
a.sub_id,
b.sub_id,
rundate;
I did a explain on query
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | a | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 90.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | b | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 0.08 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
The main issue is, my master table consists of 80 million rows. This query is taking more than 24 hours to execute.
All the columns are indexed and I am doing a self join.
Would creating a like table first 'master_table_2' and then doing a join would make my query faster?
Is there any way to optimize the query time?
As your table consists of a lot of rows, the join query will take a lot of time if it is not optimized properly and the WHERE clause is not used properly. But an optimized query could save your time and effort. The following link has a good explanation about the optimization of the join queries and its facts -
Optimization of Join Queries
#Marcus Adams has already provided a similar answer here
Another option is you can select individually and process in the code end for the optimization. But it is only applicable in some specific conditions only. You will have to try to compare both processes (join query and code end execution) and check the performance. I have got better performance once using this method.
Suppose a join query is like as the following -
SELECT A.a1, B.b1, A.a2
FROM A
INNER JOIN B
ON A.a3=B.b3
WHERE B.b3=C;
What I am trying to say is query individually from A and B satisfying the necessary conditions and then try to get your desired result from the code end.
N.B. : It is an unorthodox way and it could not be taken as granted to be applicable in all criteria.
Hope it helps.

Why is my MySQL query is so slow?

I'm trying to figure out why that query so slow (take about 6 second to get result)
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
WHERE
c.id NOT IN (... big list of ids which should be excluded)
This is execution plan
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| 1 | SIMPLE | z1 | index | PRIMARY | PRIMARY | 4 | NULL | 318563 | 99.85 | Using where; Using index; Using temporary |
| 1 | SIMPLE | c | eq_ref | PRIMARY,member_id | PRIMARY | 4 | z1.id | 1 | 100.00 | |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | c.member_id | 1 | 100.00 | Using index |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
is it because mysql has to take out almost whole 1st table ? Can it be adjusted ?
You can try to replace c with a subquery.
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
(select c.id
from c
WHERE
c.id NOT IN (... big list of ids which should be excluded)) c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
to leave only necessary id's
It is imposible to say from the information you've provided whether there is a faster solution to obtaining the same data (we would need to know abou data distributions and what foreign keys are obligatory). However assuming that this is a hierarchical data set, then the plan is probably not optimal: the only predicate to reduce the number of rows is c.id NOT IN.....
The first question to ask yourself when optimizing any query is Do I need all the rows? How many rows is this returning?
I'm struggling to see any utlity in a query which returns a list of 'id' values (implying a set of autoincrement integers).
You can't use an index for a NOT IN (or <>) hence the most eficient solution is probably to start with a full table scan on 'c' - which should be the outcome of StanislavL's query.
Since you don't use the values from i and z, the joins could be replaced with 'exists' which may help performance.
I would consider creating a compound index for c(id, member_id). This way the query should work at index level only without scanning any rows in tables.

Unable to optimize MySQL query which uses a ORDER BY clause

I'm using Drupal 6 with MySQL version 5.0.95 and at an impasse where one of my queries which displays content based on most recent article date slows down and because of the frequency of being used kills the site performance altogether. The query in question is as below:
SELECT n.nid,
n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
INNER JOIN content_type_article ma ON n.nid=ma.nid
INNER JOIN term_node tn ON n.nid=tn.nid
WHERE tn.tid= 153
AND n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
The EXPLAIN of the query shows the below result:
+----+-------------+-------+--------+--------------------------+---------+---------+----------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+----------------------+-------+---------------------------------+
| 1 | SIMPLE | tn | ref | PRIMARY,nid | PRIMARY | 4 | const | 19006 | Using temporary; Using filesort |
| 1 | SIMPLE | ma | ref | nid,ix_article_date | nid | 4 | drupal_mm_stg.tn.nid | 1 | |
| 1 | SIMPLE | n | eq_ref | PRIMARY,node_status_type | PRIMARY | 4 | drupal_mm_stg.ma.nid | 1 | Using where |
+----+-------------+-------+--------+--------------------------+---------+---------+----------------------+-------+---------------------------------+
This query seemed relatively simple and straight forward and retrieves articles which belong to a category (term) 153 and are of status 1 (published). But apparently Using temporary table and Using filesort means the query is bound to fail from what I've learnt browsing about it.
Removing field_article_date_format_value from the ORDER BY clause solves the Using temporary; Using filesort reduces the query execution time but is required and cannot be traded off, unfortunately same holds equally true for the site performance.
My hunch is that most of the trouble comes from the term_node table which maps articles to categories and is a many-many relationship table meaning if article X is associated to 5 categories C1....C5 it will have 5 entries in that table, this table is from out-of-the-box drupal.
Dealing with heavy DB content is something new to me and going through some of the similar queries (
When ordering by date desc, "Using temporary" slows down query,
MySQL performance optimization: order by datetime field) I tried to create a composite index for the content_type_article whose datetime field is used in the ORDER BY clause along with another key (nid) in it and tried to FORCE INDEX.
SELECT n.nid, n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
INNER JOIN content_type_article ma FORCE INDEX (ix_article_date) ON n.nid=ma.nid
INNER JOIN term_node tn ON n.nid=tn.nid
WHERE tn.tid= 153
AND n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
The result and the following EXPLAIN query did not seem to help much
+----+-------------+-------+--------+--------------------------+-----------------+---------+----------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+-----------------+---------+----------------------+-------+---------------------------------+
| 1 | SIMPLE | tn | ref | PRIMARY,nid | PRIMARY | 4 | const | 18748 | Using temporary; Using filesort |
| 1 | SIMPLE | ma | ref | ix_article_date | ix_article_date | 4 | drupal_mm_stg.tn.nid | 1 | |
| 1 | SIMPLE | n | eq_ref | PRIMARY,node_status_type | PRIMARY | 4 | drupal_mm_stg.ma.nid | 1 | Using where |
+----+-------------+-------+--------+--------------------------+-----------------+---------+----------------------+-------+---------------------------------+
The fields n.nid, ca.nid, ma.field_article_date_format_value are all indexed. Querying the DB with Limit 0,11 takes approximately 7-10 seconds with the ORDER BY clause but without it the query barely takes a second. The database engine is MyISAM. Any help on this would be greatly appreciated.
Any answer that could help me in getting this query like a normal one (at the same speed as a query without sort by date) would be great. My attempts with creating a composite query as a combination of nid and field_article_date_format_value and use in the query did not help the cause. I'm open to providing additional info on the problem and any new suggestions.
Taking a look at your query and the explain, it seems like having the n.status=1 in the where clause is making the search very inefficient because you need to return the whole set defined by the joins and then apply the status = 1. Try starting the join from the term_node table that is inmediately filtered by the WHERE and then make the joins adding the status condition immediately. Give it a try and please tell me how it goes.
SELECT n.nid, n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM term_node tn
INNER JOIN node n ON n.nid=tn.nid AND n.status=1
INNER JOIN content_type_article ma FORCE INDEX (ix_article_date) ON n.nid=ma.nid
WHERE tn.tid= 153
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
Using temporary; Using filesort means only that MySQL needs to construct a temporary result table and sort it to get the result you need. This is often a consequence of the ORDER BY ... DESC LIMIT 0,n construct you're using to get the latest postings. In itself it's not a sign of failure. See this: http://www.mysqlperformanceblog.com/2009/03/05/what-does-using-filesort-mean-in-mysql/
Here are some things to try. I am not totally sure they'll work; it's hard to know without having your data to experiment with.
Is there a BTREE index on content_type_article.field_article_date_format_value ? If so, that may help.
Do you HAVE to display the 11 most recent articles? Or can you display the 11 most recent articles that have appeared in the last week or month? If so you could add this line to your WHERE clause. It would filter your stuff by date rather than having to look all the way back to the beginning of time for matching articles. This will be especially helpful if you have a long-established Drupal site.
AND ma.field_article_date_format_value >= (CURRENT_TIME() - INTERVAL 1 MONTH)
First, try to flip the order of the INNER JOIN operations. Second, incorporate the tid=153 into the join criterion. This MAY reduce the size of the temp table you need to sort. All together my suggestions are as follows:
SELECT n.nid,
n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
INNER JOIN term_node tn ON (n.nid=tn.nid AND tn.tid = 153)
INNER JOIN content_type_article ma ON n.nid=ma.nid
WHERE n.status=1
AND ma.field_article_date_format_value >= (CURRENT_TIME() - INTERVAL 1 MONTH)
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
Those are
1) Covering indexes
I think the simple answer may be "covering indexes".
Especially on the content_type_article table. The "covering index" has the expression in the ORDER BY as the leading column, and includes all of the columns that are being referenced by the query. Here's the index I created (on my test table):
CREATE INDEX ct_article_ix9
ON content_type_article
(field_article_date_format_value, nid, field_article_summary_value);
And here's an excerpt of the EXPLAIN I get from the query (after I build example tables, using the InnoDB engine, including a covering index on each table):
_type table type key ref Extra
------ ----- ----- -------------- ----------- ------------------------
SIMPLE ma index ct_article_ix9 NULL Using index
SIMPLE n ref node_ix9 ma.nid Using where; Using index
SIMPLE tn ref term_node_ix9 n.nid,const Using where; Using index
Note that there's no 'Using filesort' shown in the plan, and the plan shows 'Using index' for each table referenced in the query, which basically means that all of the data needed by the query is retrieved from the index pages, with no need to reference any pages from the underlying table. (Your tables have a lot more rows than my test tables, but if you can get an explain plan that looks like this, you may get better performance.)
For completeness, here's the entire EXPLAIN output:
+----+-------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------------+---------+-------- ------------+------+--------------------------+
| 1 | SIMPLE | ma | index | NULL | ct_article_ix9 | 27 | NULL | 1 | Using index |
| 1 | SIMPLE | n | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 11 | Using where; Using index |
| 1 | SIMPLE | tn | ref | term_node_ix9 | term_node_ix9 | 10 | testps.n.nid,const | 11 | Using where; Using index |
+----+-------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
3 rows in set (0.00 sec)
I made no changes to your query, except to omit the FORCE INDEX hint. Here are the other two "covering indexes" that I created on the other two tables referenced in the query:
CREATE INDEX node_ix9
ON node (`nid`,`status`,`title`);
CREATE INDEX term_node_ix9
ON term_node (nid,tid);
(Note that if nid is the clustering key on the node table, you may not need the covering index on the node table.)
2) Use correlated subqueries in place of joins?
If the previous idea doesn't improve anything, then, as another alternative, since the original query is returning a maximum of 11 rows, you might try rewriting the query to avoid the join operations, and instead make use of correlated subqueries. Something like the query below.
Note that this query differs significantly from the original query. The difference is that with this query, a row from the context_type_article table will be returned only one time. With the query using the joins, a row from that table could be matched to multiple rows from node and term_node tables, which would return that same row more than once. This may be viewed as either desirable or undesirable, it really depends on the cardinality, and whether the resultset meets the specification.
SELECT ( SELECT n2.nid
FROM node n2
WHERE n2.nid = ma.nid
AND n2.status = 1
LIMIT 1
) AS `nid`
, ( SELECT n3.title
FROM node n3
WHERE n3.nid = ma.nid
AND n3.status = 1
LIMIT 1
) AS `title`
, ma.field_article_date_format_value
, ma.field_article_summary_value
FROM content_type_article ma
WHERE EXISTS
( SELECT 1
FROM node n1
WHERE n1.nid = ma.nid
AND n1.status = 1
)
AND EXISTS
( SELECT 1
FROM term_node tn
WHERE tn.nid = ma.nid
AND tn.tid = 153
)
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0,11
(Sometimes, a query using this type of "orrelated subquery" can have considerably WORSE performance than an equivalent query that does join operations. But in some cases, a query like this can actually perform better, especially given a very limited number of rows being returned.)
Here's the explain output for that query:
+----+--------------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
| 1 | PRIMARY | ma | index | NULL | ct_article_ix9 | 27 | NULL | 11 | Using where; Using index |
| 5 | DEPENDENT SUBQUERY | tn | ref | term_node_ix9 | term_node_ix9 | 10 | testps.ma.nid,const | 13 | Using where; Using index |
| 4 | DEPENDENT SUBQUERY | n1 | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 12 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | n3 | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 12 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | n2 | ref | node_ix9 | node_ix9 | 10 | testps.ma.nid,const | 12 | Using where; Using index |
+----+--------------------+-------+-------+---------------+----------------+---------+---------------------+------+--------------------------+
5 rows in set (0.00 sec)
Note that again, each access is 'Using index', which means the query is satisfied directly from index pages, rather than having to visit any data pages in the underlying table.
Example tables
Here are the example tables (along with the indexes) that I built and populated, based on the information from your question:
CREATE TABLE `node` (`id` INT PRIMARY KEY, `nid` INT, `title` VARCHAR(10),`status` INT);
CREATE INDEX node_ix9 ON node (`nid`,`status`,`title`);
INSERT INTO `node` VALUES (1,1,'foo',1),(2,2,'bar',0),(3,3,'fee',1),(4,4,'fi',0),(5,5,'fo',1),(6,6,'fum',0),(7,7,'derp',1);
INSERT INTO `node` SELECT id+7,nid+7,title,`status` FROM node;
INSERT INTO `node` SELECT id+14,nid+14,title,`status` FROM node;
INSERT INTO `node` SELECT id+28,nid+28,title,`status` FROM node;
INSERT INTO `node` SELECT id+56,nid+56,title,`status` FROM node;
CREATE TABLE content_type_article (id INT PRIMARY KEY, nid INT, field_article_date_format_value DATETIME, field_article_summary_value VARCHAR(10));
CREATE INDEX ct_article_ix9 ON content_type_article (field_article_date_format_value, nid, field_article_summary_value);
INSERT INTO content_type_article VALUES (1001,1,'2012-01-01','foo'),(1002,2,'2012-01-02','bar'),(1003,3,'2012-01-03','fee'),(1004,4,'2012-01-04','fi'),(1005,5,'2012-01-05','fo'),(1006,6,'2012-01-06','fum'),(1007,7,'2012-01-07','derp');
INSERT INTO content_type_article SELECT id+7,nid+7, DATE_ADD(field_article_date_format_value,INTERVAL 7 DAY),field_article_summary_value FROM content_type_article;
INSERT INTO content_type_article SELECT id+14,nid+14, DATE_ADD(field_article_date_format_value,INTERVAL 14 DAY),field_article_summary_value FROM content_type_article;
INSERT INTO content_type_article SELECT id+28,nid+28, DATE_ADD(field_article_date_format_value,INTERVAL 28 DAY),field_article_summary_value FROM content_type_article;
INSERT INTO content_type_article SELECT id+56,nid+56, DATE_ADD(field_article_date_format_value,INTERVAL 56 DAY),field_article_summary_value FROM content_type_article;
CREATE TABLE term_node (id INT, tid INT, nid INT);
CREATE INDEX term_node_ix9 ON term_node (nid,tid);
INSERT INTO term_node VALUES (2001,153,1),(2002,153,2),(2003,153,3),(2004,153,4),(2005,153,5),(2006,153,6),(2007,153,7);
INSERT INTO term_node SELECT id+7, tid, nid+7 FROM term_node;
INSERT INTO term_node SELECT id+14, tid, nid+14 FROM term_node;
INSERT INTO term_node SELECT id+28, tid, nid+28 FROM term_node;
INSERT INTO term_node SELECT id+56, tid, nid+56 FROM term_node;
MySQL is "optimizing" your query so that it selects from the term_node table first, even though you are specifying to select from node first. Not knowing the data, I'm not sure which is the optimal way. The term_node table is certainly where your performance issues are since ~19,000 records is being selected from there.
Limits without ORDER BY are almost always faster because MySQL stops as soon as it finds the specified limit. With an ORDER BY, it first has to find all the records and sort them, then get the specified limit.
The simple thing to try is moving your WHERE condition into the JOIN clause, which is where it should be. That filter is specific to the table being joined. This will make sure MySQL doesn't optimize it incorrectly.
INNER JOIN term_node tn ON n.nid=tn.nid AND tn.tid=153
A more complicated thing is to do a SELECT on the term_node table and JOIN on that. That's called a DERIVED TABLE and you will see it defined as such in the EXPLAIN. Since you said it was a many-to-many, I added a DISTINCT parameter to reduce the numbers of records to join on.
SELECT ...
FROM node n
INNER JOIN content_type_article ma FORCE INDEX (ix_article_date) ON n.nid=ma.nid
INNER JOIN (SELECT DISTINCT nid FROM term_node WHERE tid=153) tn ON n.nid=tn.nid
WHERE n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0,11
MySQL 5.0 has some limitations with derived tables, so this may not work. Although there are work arounds.
You really want to avoid the sort operation happening at all if you can by taking advantage of a pre-sorted index.
To find out if this is possible, imagine your data denormalised into a single table, and ensure that everything that must be included in your WHERE clause is specifiable with a SINGLE VALUE. e.g. if you must use an IN clause on one of the columns, then sorting is inevitable.
Here's a screenshot of some sample data:
So, if you DID have your data denormalised, you could query on tid and status using single values and then sort by date descending. That would mean the following index in that case would work perfectly:
create index ix1 on denormalisedtable(tid, status, date desc);
If you had this, your query would only hit the top 10 rows and would never need to sort.
So - how do you get the same performance WITHOUT denormalising...
I think you should be able to use the STRAIGHT_JOIN clause to force the order that MySQL selects from the tables - you want to get it to select from the table you are SORTING last.
Try this:
SELECT n.nid,
n.title,
ma.field_article_date_format_value,
ma.field_article_summary_value
FROM node n
STRAIGHT_JOIN term_node tn ON n.nid=tn.nid
STRAIGHT_JOIN content_type_article ma ON n.nid=ma.nid
WHERE tn.tid= 153
AND n.status=1
ORDER BY ma.field_article_date_format_value DESC
LIMIT 0, 11;
The idea is to get MySQL to select from the node table and then from the term_node table and THEN FINALLY from the content_type_article table (the table containing the column you are sorting on).
This last join is your most important one and you want it to happen using an index so that the LIMIT clause can work without needing to sort the data.
This single index MIGHT do the trick:
create index ix1 on content_type_article(nid, field_article_date_format_value desc);
or
create index ix1 on content_type_article(nid, field_article_date_format_value desc, field_article_summary_value);
(for a covering index)
I say MIGHT, because I don't know enough about the MySQL optimiser to know if it's clever enough to handle the multiple 'nid' column values that will be getting fed into the content_type_article without having to resort the data.
Logically, it should be able to work quickly - e.g. if 5 nid values are getting fed into the final content_type_article table, then it should be able to get the top 10 of each directly from the index and merge the results together then pick the final top 10, meaning a total of 50 rows read from this table insted of the full 19006 that you're seeing currently.
Let me know how it goes.
If it works for you, further optimisation will be possible using covering indexes on the other tables to speed up the first two joins.

MySQL Query Optimization; SELECT multiple fields vs. JOIN

We've got a relatively straightforward query that does LEFT JOINs across 4 tables. A is the "main" table or the top-most table in the hierarchy. B links to A, C links to B. Furthermore, X links to A. So the hierarchy is basically
A
C => B => A
X => A
The query is essentially:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
WHERE
b.flag = true
ORDER BY
x.date DESC
LIMIT 25
Via EXPLAIN, I've confirmed that the correct indexes are in place, and that the built-in MySQL query optimizer is using those indexes correctly and properly.
So here's the strange part...
When we run the query as is, it takes about 1.1 seconds to run.
However, after doing some checking, it seems that if I removed most of the SELECT fields, I get a significant speed boost.
So if instead we made this into a two-step query process:
First query same as above except change the SELECT clause to only SELECT a.id instead of SELECT *
Second query also same as above, except change the WHERE clause to only do an a.id IN agains the result of Query 1 instead of what we have before
The result is drastically different. It's .03 seconds for the first query and .02 for the second query.
Doing this two-step query in code essentially gives us a 20x boost in performance.
So here's my question:
Shouldn't this type of optimization already be done within the DB engine? Why does the difference in which fields that are actually SELECTed make a difference on the overall performance of the query?
At the end of the day, it's merely selecting the exact same 25 rows and returning the exact same full contents of those 25 rows. So, why the wide disparity in performance?
ADDED 2012-08-24 13:02 PM PDT
Thanks eggyal and invertedSpear for the feedback. First off, it's not a caching issue -- I've run tests running both queries multiple times (about 10 times) alternating between each approach. The result averages at 1.1 seconds for the first (single query) approach and .03+.02 seconds for the second (2 query) approach.
In terms of indexes, I thought I had done an EXPLAIN to ensure that we're going thru the keys, and for the most part we are. However, I just did a quick check again and one interesting thing to note:
The slower "single query" approach doesn't show the Extra note of "Using index" for the third line:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
While it does show "Using index" for when we query for just the IDs:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | Using index |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
The strange thing is that both do list the correct index being used... but I guess it begs the questions:
Why are they different (considering all the other clauses are the exact same)? And is this an indication of why it's slower?
Unfortunately, the MySQL docs do not give much information for when the "Extra" column is blank/null in the EXPLAIN results.
More important than speed, you have a flaw in your query logic. When you test a LEFT JOINed column in the WHERE clause (other than testing for NULL), you force that join to behave as if it were an INNER JOIN. Instead, you'd want:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
AND b.flag = true
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
ORDER BY
x.date DESC
LIMIT 25
My next suggestion would be to examine all of those .*'s in your SELECT. Do you really need all the columns from all the tables?

MySQL performance using where

A simple query like the one below, properly indexed on a table populated with roughly 2M rows is taking 95 rows in set (2.06 sec) a lot longer to complete than I was hoping for.
As this is my first experience with tables this size, am I looking into normal behavior?
Query:
SELECT t.id, t.symbol, t.feed, t.time,
FLOOR(UNIX_TIMESTAMP(t.time)/(60*15)) as diff
FROM data as t
WHERE t.symbol = 'XYZ'
AND DATE(t.time) = '2011-06-02'
AND t.feed = '1M'
GROUP BY diff
ORDER BY t.time ASC;
...and Explain:
+----+-------------+-------+------+--------------------+--------+---------+-------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+--------------------+--------+---------+-------+--------+----------------------------------------------+
| 1 | SIMPLE | t | ref | unique,feed,symbol | symbol | 1 | const | 346392 | Using where; Using temporary; Using filesort |
+----+-------------+-------+------+--------------------+--------+---------+-------+--------+----------------------------------------------+
Try this:
...
AND t.time >= '2011-06-02' AND t.time < '2011-06-03'
...
Otherwise, your index(es) are wrong for this query. I'd expect one on (symbol, feed, time, id) or (feed, symbol, time, id) to cover it.
Edit, after comment:
If you put a function or processing on a column, any index is liable to be ignored. The index is on x not f(x) basically.
This change allows the index to be used because we now do a <= x < y to ignore the time part, not takeofftime(x)