Speed up self join in MYSQL - mysql

I have a table of connections between different objects and I'm basically trying a graph traversal using self joins.
My table is defined as:
CREATE TABLE `connections` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`position` int(11) NOT NULL,
`dId` bigint(20) NOT NULL,
`sourceId` bigint(20) NOT NULL,
`targetId` bigint(20) NOT NULL,
`type` bigint(20) NOT NULL,
`weight` float NOT NULL DEFAULT '1',
`refId` bigint(20) NOT NULL,
`ts` bigint(20) NOT NULL,
PRIMARY KEY (`id`),
KEY `sourcetype` (`type`,`sourceId`,`targetId`),
KEY `targettype` (`type`,`targetId`,`sourceId`),
KEY `complete` (`dId`,`sourceId`,`targetId`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The table contains about 3M entries (~1K of type 1, 1M of type 2, and 2M of type 3).
The queries over 2 or 3 hops are actually quite fast (of course it takes a while to receive all the results), but getting the count of a query for 3 hops is terribly slow (> 30s).
Here's the query (which returns 2M):
SELECT
count(*)
FROM
`connections` AS `t0`
JOIN
`connections` AS `t1` ON `t1`.`targetid`=`t0`.`sourceid`
JOIN
`connections` AS `t2` ON `t2`.`targetid`=`t1`.`sourceid`
WHERE
`t2`.dId = 1
AND
`t2`.`sourceid` = 1
AND
`t2`.`type` = 1
AND
`t1`.`type` = 2
AND
`t0`.`type` = 3;
Here's the corresponding EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t2 ref targettype,complete,sourcetype complete 16 const,const 100 Using where; Using index
1 SIMPLE t1 ref targettype,sourcetype targettype 8 const 2964 Using where; Using index
1 SIMPLE t0 ref targettype,sourcetype sourcetype 16 const,travtest.t1.targetId 2964 Using index
Edit: Here is the EXPLAIN after adding and index to type:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t2 ref type,complete,sourcetype,targettype complete 16 const,const 100 Using where; Using index
1 SIMPLE t1 ref type,sourcetype,targettype sourcetype 16 const,travtest.t2.targetId 2 Using index
1 SIMPLE t0 ref type,sourcetype,targettype sourcetype 16 const,travtest.t1.targetId 2 Using index
Is there a way to improve this?
2nd edit:
EXPLAN EXTENDED:
+----+-------------+-------+------+-------------------------------------+------------+---------+----------------------------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+-------------------------------------+------------+---------+----------------------------+------+----------+--------------------------+
| 1 | SIMPLE | t2 | ref | type,complete,sourcetype,targettype | complete | 16 | const,const | 100 | 100.00 | Using where; Using index |
| 1 | SIMPLE | t1 | ref | type,sourcetype,targettype | sourcetype | 16 | const,travtest.t2.targetId | 1 | 100.00 | Using index |
| 1 | SIMPLE | t0 | ref | type,sourcetype,targettype | sourcetype | 16 | const,travtest.t1.targetId | 1 | 100.00 | Using index |
+----+-------------+-------+------+-------------------------------------+------------+---------+----------------------------+------+----------+--------------------------+
SHOW WARNINGS;
+-------+------+--------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+--------------------------------------------------------------------------------------------+
| Note | 1003 | /* select#1 */ select count(0) AS `count(*)` from `travtest`.`connections` `t0` |
| | | join `travtest`.`connections` `t1` join `travtest`.`connections` `t2` |
| | | where ((`travtest`.`t0`.`sourceId` = `travtest`.`t1`.`targetId`) and |
| | | (`travtest`.`t1`.`sourceId` = `travtest`.`t2`.`targetId`) and (`travtest`.`t0`.`type` = 3) |
| | | and (`travtest`.`t1`.`type` = 2) and (`travtest`.`t2`.`type` = 1) and |
| | | (`travtest`.`t2`.`sourceId` = 1) and (`travtest`.`t2`.`dId` = 1)) |
+-------+------+--------------------------------------------------------------------------------------------+

Create an index for sourceid,targetidand type columns then try with this query:
SELECT
count(*)
FROM
`connections` AS `t0`
JOIN
`connections` AS `t1` ON `t1`.`targetid`=`t0`.`sourceid` and `t1`.`type` = 2
JOIN
`connections` AS `t2` ON `t2`.`targetid`=`t1`.`sourceid` and `t2`.dId = 1 AND `t2`.`sourceid` = 1 AND `t2`.`type` = 1
WHERE
`t0`.`type` = 3;
-------UPDATE-----
I think that those indices are right and with those big tables you reached the best optimization you can have. I don't think you can improve this query with other optimization like table partitioning/sharding.
You could implement some kind of caching if those data does not change often or the only way I see is to scale vertically

Probably, you are handling graph data, am i right?
Your 3 hop query has a little chance to optimize. It is bushy tree. A lot of connections made. I think JOIN order and INDEX is right.
EXPLAIN tells me t2 produce about 100 targetId. If you get rid of t2 from join and add t1.sourceId IN (100 targetId). This will take same time with 3 times self join.
But what about break down 100 target to 10 sub IN list. If this reduce response time, with multi thread run 10 queries at once.
MySQL has no parallel feafure. So you do your self.
And did you tried graph databases like jena, sesame? I am not sure graph datdabase is faster than MYSQL.

This is not the answer. just FYI.
If MySQL or the other DBs is slow for you, You can implement your own graph database. then this paper Literature Survey of Graph Databases[1] is a nice work on Graph Database. surveyed several Graph Database, let us know many technique.
SCALABLE SEMANTIC WEB DATA MANAGEMENT USING VERTICAL PARTITIONING [2] introduce vertical partitioning but, your 3M edges is not big, vertical partitioning can't help you. [2] introduces another concept Materialized Path Expressions. I think this can help you.
[1] http://www.systap.com/pubs/graph_databases.pdf
[2] http://db.csail.mit.edu/projects/cstore/abadirdf.pdf

You comment that your query returns 2M (assuming you mean 2 million) is that the final count, or just that its going through the 2 million. Your query appears to be specifically looking for a single T2.ID, Source AND Type joined TO the other tables, but starting with connection 0.
I would drop your existing indexes and have the following so the engine does not try to use the others and cause confusion in how it joins. Also, by having the target ID in both (as you already had) means these will be covering indexes and the engine does not have to go to the actual raw data on the page to confirm any other criteria or pull values from.
The only index is based on your ultimate criteria as in the T2 by the source, type and ID. Since the targetID (part of the index) is the source of the next in the daisy-chain, you are using the same index for source and type going up the chain. No confusion on any indexes
INDEX ON (sourceId, type, dId, targetid )
I would try by reversing to what hopefully would be the smallest set and working OUT... Something like
SELECT
COUNT(*)
FROM
`connections` t2
JOIN `connections` t1
ON t2.targetID = t1.sourceid
AND t1.`type` = 2
JOIN `connections` t0
ON t1.targetid = t0.sourceid
AND t0.`type` = 3
where
t2.sourceid = 1
AND t2.type = 1
AND t2.dID = 1

Related

Optimize join query with SUM, Group By and ORDER By Clauses

I have the following database schema
keywords(id, keyword, lang) :( about 8M records)
topics(id, topic, lang) : ( about 2.6M records)
topic_keywords(topic_id, keyword_id, weight) : (200M records)
In a script, I have about 50-100 keywords with an additional field keyword_score and I want to retrieve the top 20 topics that corresponds to those keywords based on the following formula : SUM(keyword_score * topic_weight)
A solution I implemented currently in my script is :
I create a temporary table as follow temporary_keywords(keyword_id, keyword_score )
Insert all 50-100 keywords to it with their keyword_score
Then execute the following query to retrieve topics
SELECT topic_id, SUM(weight * keyword_score) AS score
FROM temporary_keywords
JOIN topic_keywords USING keyword_id
GROUP BY topic_id
ORDER BY score DESC
LIMIT 20
This solution works, but it takes in some cases up to 3 seconds to execute, which is too much for me.
I'm asking if there is a way to optimize this query? or should I redesign the data structure into a NoSQL database?
Any other solutions or ideas beyond what is listed above are most appreciated
UPDATE (SHOW CREATE TABLE)
CREATE TABLE `topic_keywords` (
`topic_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
`weight` float DEFAULT '0',
PRIMARY KEY (`topic_id`,`keyword_id`),
KEY `keyword_id_idx` (`keyword_id`,`topic_id`,`weight`)
)
CREATE TEMPORARY TABLE temporary_keywords
( keyword_id INT PRIMARY KEY NOT NULL,
keyword_score DOUBLE
)
EXPLAIN QUERY
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | temporary_keywords | ALL | PRIMARY | NULL | NULL | NULL | 100 | Using temporary; Using filesort |
| 1 | SIMPLE | topic_keywords | ref | keyword_id_idx | keyword_id_idx | 4 | topics.temporary_keywords.keyword_id | 10778853 | Using index |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
Incorrect, but uncaught, syntax.
JOIN topic_keywords USING keyword_id
-->
JOIN topic_keywords USING(keyword_id)
If that does not fix it, please provide EXPLAIN FORMAT=JSON SELECT ...

what is the fastest way to join several tables matching specific column values in MySQL

I have 3 tables that look like this:
CREATE TABLE big_table_1 (
id INT(11),
col1 TINYINT(1),
col2 TINYINT(1),
col3 TINYINT(1),
PRIMARY KEY (`id`)
)
And so on for big_table_2 and big_table_3. The col1, col2, col3 values are either 0, 1 or null.
I'm looking for id's whose col1 value equals 1 in each table. I join them as follows, using the simplest method I can think of:
SELECT t1.id
FROM big_table_1 AS t1
INNER JOIN big_table_2 AS t2 ON t2.id = t1.id
INNER JOIN big_table_3 AS t3 ON t3.id = t1.id
WHERE t1.col1 = 1
AND t2.col1 = 1
AND t3.col1 = 1;
With 10 million rows per table, the query takes about 40 seconds to execute on my machine:
407231 rows in set (37.19 sec)
Explain results:
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| 1 | SIMPLE | t3 | ALL | PRIMARY | NULL | NULL | NULL | 10999387 | Using where |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
If I declare index on col1, the result is slightly slower:
407231 rows in set (40.84 sec)
I have also tried the following query:
SELECT t1.id
FROM (SELECT distinct ta1.id FROM big_table_1 ta1 WHERE ta1.col1=1) as t1
WHERE EXISTS (SELECT ta2.id FROM big_table_2 ta2 WHERE ta2.col1=1 AND ta2.id = t1.id)
AND EXISTS (SELECT ta3.id FROM big_table_3 ta3 WHERE ta3.col1=1 AND ta3.id = t1.id);
But it's slower:
407231 rows in set (44.01 sec) [with index on col1]
407231 rows in set (1 min 36.52 sec) [without index on col1]
Is the aforementioned simple method basically the fastest way to do this in MySQL? Would it be necessary to shard the table onto multiple servers in order to get the result faster?
Addendum: EXPLAIN results for Andrew's code as requested (I trimmed the tables down to 1 million rows only, and the index is on id and col1):
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 332814 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 333237 | Using where; Using join buffer |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 333505 | Using where; Using join buffer |
| 4 | DERIVED | big_table_3 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
| 3 | DERIVED | big_table_2 | index | NULL | PRIMARY | 5 | NULL | 1000507 | Using where; Using index |
| 2 | DERIVED | big_table_1 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
INNER JOIN (same as JOIN) lets the optimizer pick whether to use the table to its left or the table to its right. The simplified SELECT you presented could start with any of the three tables.
The optimizer likes to start with the table with the WHERE clause. Your simplified example implies that each table is equally good IF there is an INDEX starting with col1. (See retraction below.)
The second and subsequent tables need a different rule for indexing. In your simplified example, col1 is used for filtering and id is used for JOINing. INDEX(col1, id) and INDEX(id, col1) work equally well for getting to the second table.
I keep saying "your simplified example" because as soon as you change anything, most of the advice in these answers is up for grabs.
(The retraction) When you have a column with "low cardinality" such as your col%, with only 0,1,NULL possibilities, INDEX(col1) is essentially useless since it it faster to blindly scan the table rather than use the index.
On the other hand, INDEX(col1, ...) may be useful, as mentioned for the second table.
However neither is useful for the first table. If you have such an INDEX, it will be ignored.
Then comes "covering". Again, your example is unrealistically simplistic because there are essentially no fields touched other than id and col1. A "covering" index includes all the fields of a table that are touched in the query. A covering index is virtually always smaller than the data, so it takes less effort to run through a covering index, hence faster.
(Retract the retraction) INDEX(col1, id), in that order is a useful covering index for the first table.
Imagine how my discussion had gone if you had not mentioned that col1 had only 3 values. Quite different.
And we have not gotten to ORDER BY, IN(...), BETWEEN...AND..., engine differences, tricks with the PRIMARY KEY, LEFT JOIN, etc.
More insight into building indexes from Selects.
ANALYZE TABLE should not be necessary.
For kicks try it with a covered index (a composite of id,col1)
So 1 index make it primary composite. No other indexes.
Then run analyze table xxx (3 times total, once per table)
Then fire it off hoping the mysql cbo isnt to dense to figure it out.
Second idea is to see results without a where clause. Convert it all inside of join on clause
Have you tried this:
SELECT t1.id
FROM
(SELECT id from big_table_1 where col1 = 1) AS t1
INNER JOIN (SELECT id from big_table_2 where col1 = 1) AS t2 ON t2.id = t1.id
INNER JOIN (SELECT id from big_table_3 where col1 = 1) AS t3 ON t3.id = t1.id

Why is COUNT() query from large table much faster than SUM()

I have a data warehouse with the following tables:
main
about 8 million records
CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1
period
This table has about 50 records and has the following fields
id
month
year
customer
This has about 23,000 records and the following fileds
id
number //This field is unique
name //This is simply a description field
The following query runs very fast (less than 1 second) and returns about 2,000:
select count(*)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:
select sum(sales)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
When I explain each query, the ONLY difference I see is that on the 'count()'
query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.
Explain count() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | Using index |
Explain sum() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | NULL |
Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
What can I do to make the sum() go faster?
Thanks in advance!
EDIT
All the tables show that it is using Engine InnoDB
Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...
SOLVED
This is what I've learned:
Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!
Thanks everybody!
The index is the list of key rows.
When you do the count() query the actual data from the database can be ignored and just the index used.
When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.
Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.
Finally, the index itself may have summaries of the counts (to help with the plan generation)
Update
You actually have three indexes on your table:
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
So you only have indexes on the columns id, pid, cid, iid. (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)
If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing
The simple answer is that count() is only counting rows. This can be satisfied by the index.
The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.
If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.

mysql does not use primary bigint index

iam fighting with some performance problems on a very simple table which seems to be very slow when fetching data by using its primary key (bigint)
I have this table with 124 million entries:
CREATE TABLE `nodes` (
`id` bigint(20) NOT NULL,
`lat` float(13,7) NOT NULL,
`lon` float(13,7) NOT NULL,
PRIMARY KEY (`id`),
KEY `lat_index` (`lat`),
KEY `lon_index` (`lon`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
and a simple query which takes some id from another table using the IN clause to fetch data from the nodes tables, but it takes like 1 hour only to fetch a few rows from this table.
EXPLAIN shows me its not using the PRIMARY key as index, its simply scanning the whole table. Why that? id and the id column from the other table are both from type bigint(20).
mysql> EXPLAIN SELECT lat, lon FROM nodes WHERE id IN (SELECT node_id FROM ways_elements WHERE way_id = '4962890');
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| 1 | PRIMARY | nodes | ALL | NULL | NULL | NULL | NULL | 124035228 | Using where |
| 2 | DEPENDENT SUBQUERY | ways_elements | ref | way_id | way_id | 8 | const | 2 | Using where |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
The query SELECT node_id FROM ways_elements WHERE way_id = '4962890' simply returns two node ids, so the whole query should only return two rows, but it takes more or less 1 hour.
Using "force index (PRIMARY)" didnt help, even if it would help, why does MySQL not take that index since its a primary key? EXPLAIN doesnt even mention anything in the possible_keys columns but select_type shows PRIMARY.
Am i doing something wrong?
How does this perform?
SELECT lat, lon FROM nodes t1 join ways_elements t2 on (t1.id=t2.node_id) WHERE t2.way_id = '4962890'
I suspect that your query is checking each row in nodes against each item in the "IN" clause.
This is what is called a correlated subquery. You can see this as reference or this popular question posted on Stackoverflow. A better query to use is:
SELECT lat,
lon
FROM nodes n
JOIN ways_elements w ON n.id = w.node_id
WHERE way_id = '4962890'

MySQL performance with GROUP BY and JOIN

After spending a lot of time with variants to this question I'm wondering if someone can help me optimize this query or indexes.
I have three temp tables ref1, ref2, ref3 all defined as below, with ref1 and ref2 each having about 6000 rows and ref3 only 3 rows:
CREATE TEMPORARY TABLE ref1 (
id INT NOT NULL AUTO_INCREMENT,
val INT,
PRIMARY KEY (id)
)
ENGINE = MEMORY;
The slow query is against a table like so, with about 1M rows:
CREATE TABLE t1 (
d DATETIME NOT NULL,
id1 INT NOT NULL,
id2 INT NOT NULL,
id3 INT NOT NULL,
x INT NULL,
PRIMARY KEY (id1, d, id2, id3)
)
ENGINE = INNODB;
The query in question:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
The temp tables are used to filter the result set down to just the items a user is looking for.
EXPLAIN
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| 1 | SIMPLE | ref1 | ALL | PRIMARY | NULL | NULL | NULL | 6000 | Using temporary; Using filesort |
| 1 | SIMPLE | t1 | ref | PRIMARY | PRIMARY | 4 | med31new.ref1.id | 38 | Using where |
| 1 | SIMPLE | ref3 | ALL | PRIMARY | NULL | NULL | NULL | 3 | Using where; Using join buffer |
| 1 | SIMPLE | ref2 | eq_ref | PRIMARY | PRIMARY | 4 | med31new.t1.id2 | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
(on a different system with ~5M rows EXPLAIN show t1 first in the list, with "Using where; Using index; Using temporary; Using filesort")
Is there something obvious I'm missing that would prevent the temporary table from being used?
First filesort does not mean a file is writtent on disk to perform the sort, it's the name of the quicksort algorithm in mySQL, check what-does-using-filesort-mean-in-mysql.
So the problematic keyword in your explain is Using temporary, not Using filesort. For that you can play with tmp_table_size & max_heap_table_size(put the same values on both) to allow more in-memory work and avoid temporary table creation, check this link on the subject with remarks about documentation mistakes.
Then you could try different index policy, and see the results, but do not try to avoid filesort.
Last thing, not related, you make a SUM(x) but x can takes NULL values, SUM(COALESCE(x) , 0) is maybe better if you do not want any NULL value on the Group to make your sum being NULL.
Add an index on JUST the DATE. Since that is the criteria of the first table, and the others are just joins, it will be optimized against the DATE first... the joins are secondary.
Isn't this:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
exactly equivalent to:
select id1, SUM(x)
FROM t1
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
group by id1;
What are the extra tables being used for? I think the temp table mentioned in another answer is referring to MySQL creating a temp table during query execution. If you're hoping to create a sub-query (or table) that will minimize number of operations required in a join, that might speed up the query, but I don't see joined data being selected.