After spending a lot of time with variants to this question I'm wondering if someone can help me optimize this query or indexes.
I have three temp tables ref1, ref2, ref3 all defined as below, with ref1 and ref2 each having about 6000 rows and ref3 only 3 rows:
CREATE TEMPORARY TABLE ref1 (
id INT NOT NULL AUTO_INCREMENT,
val INT,
PRIMARY KEY (id)
)
ENGINE = MEMORY;
The slow query is against a table like so, with about 1M rows:
CREATE TABLE t1 (
d DATETIME NOT NULL,
id1 INT NOT NULL,
id2 INT NOT NULL,
id3 INT NOT NULL,
x INT NULL,
PRIMARY KEY (id1, d, id2, id3)
)
ENGINE = INNODB;
The query in question:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
The temp tables are used to filter the result set down to just the items a user is looking for.
EXPLAIN
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| 1 | SIMPLE | ref1 | ALL | PRIMARY | NULL | NULL | NULL | 6000 | Using temporary; Using filesort |
| 1 | SIMPLE | t1 | ref | PRIMARY | PRIMARY | 4 | med31new.ref1.id | 38 | Using where |
| 1 | SIMPLE | ref3 | ALL | PRIMARY | NULL | NULL | NULL | 3 | Using where; Using join buffer |
| 1 | SIMPLE | ref2 | eq_ref | PRIMARY | PRIMARY | 4 | med31new.t1.id2 | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
(on a different system with ~5M rows EXPLAIN show t1 first in the list, with "Using where; Using index; Using temporary; Using filesort")
Is there something obvious I'm missing that would prevent the temporary table from being used?
First filesort does not mean a file is writtent on disk to perform the sort, it's the name of the quicksort algorithm in mySQL, check what-does-using-filesort-mean-in-mysql.
So the problematic keyword in your explain is Using temporary, not Using filesort. For that you can play with tmp_table_size & max_heap_table_size(put the same values on both) to allow more in-memory work and avoid temporary table creation, check this link on the subject with remarks about documentation mistakes.
Then you could try different index policy, and see the results, but do not try to avoid filesort.
Last thing, not related, you make a SUM(x) but x can takes NULL values, SUM(COALESCE(x) , 0) is maybe better if you do not want any NULL value on the Group to make your sum being NULL.
Add an index on JUST the DATE. Since that is the criteria of the first table, and the others are just joins, it will be optimized against the DATE first... the joins are secondary.
Isn't this:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
exactly equivalent to:
select id1, SUM(x)
FROM t1
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
group by id1;
What are the extra tables being used for? I think the temp table mentioned in another answer is referring to MySQL creating a temp table during query execution. If you're hoping to create a sub-query (or table) that will minimize number of operations required in a join, that might speed up the query, but I don't see joined data being selected.
Related
Can anyone suggest a good index to make this query run quicker?
SELECT
s.*,
sl.session_id AS session_id,
sl.lesson_id AS lesson_id
FROM
cdu_sessions s
INNER JOIN cdu_sessions_lessons sl ON sl.session_id = s.id
WHERE
(s.sort = '1') AND
(s.enabled = '1') AND
(s.teacher_id IN ('193', '1', '168', '1797', '7622', '19951'))
Explain:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
1 | SIMPLE | s | NULL | ALL | PRIMARY | NULL | NULL | NULL | 2993 | 0.50 | Using where
1 | SIMPLE | sl | NULL | ref | session_id,ix2 | ix2 | 4 | ealteach_main.s.id | 5 | 100.00 | Using index
cdu_sessions looks like this:
------------------------------------------------
id | int(11)
name | varchar(255)
map_location | enum('classroom', 'school'...)
sort | tinyint(1)
sort_titles | tinyint(1)
friend_gender | enum('boy', 'girl'...)
friend_name | varchar(255)
friend_description | varchar(2048)
friend_description_format | varchar(128)
friend_description_audio | varchar(255)
friend_description_audio_fid | int(11)
enabled | tinyint(1)
created | int(11)
teacher_id | int(11)
custom | int(1)
------------------------------------------------
cdu_sessions_lessons contains 3 fields - id, session_id and lesson_id
Thanks!
Without looking at the query plan, row count and distribution on each table, is hard to predict a good index to make it run faster.
But, I would say that this might help:
> create index sessions_teacher_idx on cdu_sessions(teacher_id);
looking at where condition you could use a composite index for table cdu_sessions
create index idx1 on cdu_sessions(teacher_id, sort, enabled);
and looking to join and select for table cdu_sessions_lessons
create index idx2 on cdu_sessions_lessons(session_id, lesson_id);
First, write the query so no type conversions are necessary. All the comparisons in the where clause are to numbers, so use numeric constants:
SELECT s.*,
sl.session_id, -- unnecessary because s.id is in the result set
sl.lesson_id
FROM cdu_sessions s INNER JOIN
cdu_sessions_lessons sl
ON sl.session_id = s.id
WHERE s.sort = 1 AND
s.enabled = 1 AND
s.teacher_id IN (193, 1, 168, 1797, 7622, 19951);
Although it might not be happening in this specific case, mixing types can impede the use of indexes.
I removed the column as aliases (as session_id for instance). These were redundant because the column name is the alias and the query wasn't changing the name.
For this query, first look at the WHERE clause. All the column references are from one table. These should go in the index, with the equality comparisons first:
create index idx_cdu_sessions_4 on cdu_sessions(sort, enabled, teacher_id, id)
I added id because it is also used in the JOIN.
Formally, id is not needed in the index if it is the primary key. However, I like to be explicit if I want it there.
Next you want an index for the second table. Only two columns are referenced from there, so they can both go in the index. The first column should be the one used in the join:
create index idx_cdu_sessions_lessons_2 on cdu_sessions_lessons(session_id, lesson_id);
I have two tables both having a column called key and I like to something like SELECT key FROM table1 MINUS SELECT key FROM table2 in MySQL (5.6.19). (table1 contains about 1.5 million rows, table2 about 100,000.) The things I tried are
SELECT key FROM table1 WHERE key NOT IN (SELECT key FROM table2);
SELECT a.key FROM table1 a LEFT JOIN table2 b USING (key) WHERE b.key IS NULL;
SELECT a.key FROM table1 a LEFT JOIN table2 b ON a.key=b.key WHERE b.key IS NULL;
But both are unbelievable inefficient! (After waiting for a result for the first query I stopped it and started all queries in parallel during the night. The first one took insane 12 hours 51 min, the second and third one 7 hours 32 min and 7 hours 53 min, respectively).
How can this be done efficiently? Is it only a problem of MySQL or of all SQL implementations in general? (In case of importance: key is of type char(32), table1 contains many more columns (also a lot of strings), table2 apart from key only some integer columns). Thank you very much for any hint in advance!
#Steve Rukuts: The EXPLAIN SELECT a.key FROM table1 a LEFT JOIN table2 b ON a.key=b.key WHERE b.key IS NULL; gives:
+----+-------------+-------+-------+---------------+------+---------+------+---------+-----------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-----------------------------------------------------------------+
| 1 | SIMPLE | a | index | NULL | Key | 97 | NULL | 1372811 | Using index |
| 1 | SIMPLE | b | index | NULL | Key | 33 | NULL | 101580 | Using where; Using index; Using join buffer (Block Nested Loop) |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-----------------------------------------------------------------+
I have 3 tables that look like this:
CREATE TABLE big_table_1 (
id INT(11),
col1 TINYINT(1),
col2 TINYINT(1),
col3 TINYINT(1),
PRIMARY KEY (`id`)
)
And so on for big_table_2 and big_table_3. The col1, col2, col3 values are either 0, 1 or null.
I'm looking for id's whose col1 value equals 1 in each table. I join them as follows, using the simplest method I can think of:
SELECT t1.id
FROM big_table_1 AS t1
INNER JOIN big_table_2 AS t2 ON t2.id = t1.id
INNER JOIN big_table_3 AS t3 ON t3.id = t1.id
WHERE t1.col1 = 1
AND t2.col1 = 1
AND t3.col1 = 1;
With 10 million rows per table, the query takes about 40 seconds to execute on my machine:
407231 rows in set (37.19 sec)
Explain results:
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| 1 | SIMPLE | t3 | ALL | PRIMARY | NULL | NULL | NULL | 10999387 | Using where |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
If I declare index on col1, the result is slightly slower:
407231 rows in set (40.84 sec)
I have also tried the following query:
SELECT t1.id
FROM (SELECT distinct ta1.id FROM big_table_1 ta1 WHERE ta1.col1=1) as t1
WHERE EXISTS (SELECT ta2.id FROM big_table_2 ta2 WHERE ta2.col1=1 AND ta2.id = t1.id)
AND EXISTS (SELECT ta3.id FROM big_table_3 ta3 WHERE ta3.col1=1 AND ta3.id = t1.id);
But it's slower:
407231 rows in set (44.01 sec) [with index on col1]
407231 rows in set (1 min 36.52 sec) [without index on col1]
Is the aforementioned simple method basically the fastest way to do this in MySQL? Would it be necessary to shard the table onto multiple servers in order to get the result faster?
Addendum: EXPLAIN results for Andrew's code as requested (I trimmed the tables down to 1 million rows only, and the index is on id and col1):
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 332814 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 333237 | Using where; Using join buffer |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 333505 | Using where; Using join buffer |
| 4 | DERIVED | big_table_3 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
| 3 | DERIVED | big_table_2 | index | NULL | PRIMARY | 5 | NULL | 1000507 | Using where; Using index |
| 2 | DERIVED | big_table_1 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
INNER JOIN (same as JOIN) lets the optimizer pick whether to use the table to its left or the table to its right. The simplified SELECT you presented could start with any of the three tables.
The optimizer likes to start with the table with the WHERE clause. Your simplified example implies that each table is equally good IF there is an INDEX starting with col1. (See retraction below.)
The second and subsequent tables need a different rule for indexing. In your simplified example, col1 is used for filtering and id is used for JOINing. INDEX(col1, id) and INDEX(id, col1) work equally well for getting to the second table.
I keep saying "your simplified example" because as soon as you change anything, most of the advice in these answers is up for grabs.
(The retraction) When you have a column with "low cardinality" such as your col%, with only 0,1,NULL possibilities, INDEX(col1) is essentially useless since it it faster to blindly scan the table rather than use the index.
On the other hand, INDEX(col1, ...) may be useful, as mentioned for the second table.
However neither is useful for the first table. If you have such an INDEX, it will be ignored.
Then comes "covering". Again, your example is unrealistically simplistic because there are essentially no fields touched other than id and col1. A "covering" index includes all the fields of a table that are touched in the query. A covering index is virtually always smaller than the data, so it takes less effort to run through a covering index, hence faster.
(Retract the retraction) INDEX(col1, id), in that order is a useful covering index for the first table.
Imagine how my discussion had gone if you had not mentioned that col1 had only 3 values. Quite different.
And we have not gotten to ORDER BY, IN(...), BETWEEN...AND..., engine differences, tricks with the PRIMARY KEY, LEFT JOIN, etc.
More insight into building indexes from Selects.
ANALYZE TABLE should not be necessary.
For kicks try it with a covered index (a composite of id,col1)
So 1 index make it primary composite. No other indexes.
Then run analyze table xxx (3 times total, once per table)
Then fire it off hoping the mysql cbo isnt to dense to figure it out.
Second idea is to see results without a where clause. Convert it all inside of join on clause
Have you tried this:
SELECT t1.id
FROM
(SELECT id from big_table_1 where col1 = 1) AS t1
INNER JOIN (SELECT id from big_table_2 where col1 = 1) AS t2 ON t2.id = t1.id
INNER JOIN (SELECT id from big_table_3 where col1 = 1) AS t3 ON t3.id = t1.id
I have a table of connections between different objects and I'm basically trying a graph traversal using self joins.
My table is defined as:
CREATE TABLE `connections` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`position` int(11) NOT NULL,
`dId` bigint(20) NOT NULL,
`sourceId` bigint(20) NOT NULL,
`targetId` bigint(20) NOT NULL,
`type` bigint(20) NOT NULL,
`weight` float NOT NULL DEFAULT '1',
`refId` bigint(20) NOT NULL,
`ts` bigint(20) NOT NULL,
PRIMARY KEY (`id`),
KEY `sourcetype` (`type`,`sourceId`,`targetId`),
KEY `targettype` (`type`,`targetId`,`sourceId`),
KEY `complete` (`dId`,`sourceId`,`targetId`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The table contains about 3M entries (~1K of type 1, 1M of type 2, and 2M of type 3).
The queries over 2 or 3 hops are actually quite fast (of course it takes a while to receive all the results), but getting the count of a query for 3 hops is terribly slow (> 30s).
Here's the query (which returns 2M):
SELECT
count(*)
FROM
`connections` AS `t0`
JOIN
`connections` AS `t1` ON `t1`.`targetid`=`t0`.`sourceid`
JOIN
`connections` AS `t2` ON `t2`.`targetid`=`t1`.`sourceid`
WHERE
`t2`.dId = 1
AND
`t2`.`sourceid` = 1
AND
`t2`.`type` = 1
AND
`t1`.`type` = 2
AND
`t0`.`type` = 3;
Here's the corresponding EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t2 ref targettype,complete,sourcetype complete 16 const,const 100 Using where; Using index
1 SIMPLE t1 ref targettype,sourcetype targettype 8 const 2964 Using where; Using index
1 SIMPLE t0 ref targettype,sourcetype sourcetype 16 const,travtest.t1.targetId 2964 Using index
Edit: Here is the EXPLAIN after adding and index to type:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t2 ref type,complete,sourcetype,targettype complete 16 const,const 100 Using where; Using index
1 SIMPLE t1 ref type,sourcetype,targettype sourcetype 16 const,travtest.t2.targetId 2 Using index
1 SIMPLE t0 ref type,sourcetype,targettype sourcetype 16 const,travtest.t1.targetId 2 Using index
Is there a way to improve this?
2nd edit:
EXPLAN EXTENDED:
+----+-------------+-------+------+-------------------------------------+------------+---------+----------------------------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+-------------------------------------+------------+---------+----------------------------+------+----------+--------------------------+
| 1 | SIMPLE | t2 | ref | type,complete,sourcetype,targettype | complete | 16 | const,const | 100 | 100.00 | Using where; Using index |
| 1 | SIMPLE | t1 | ref | type,sourcetype,targettype | sourcetype | 16 | const,travtest.t2.targetId | 1 | 100.00 | Using index |
| 1 | SIMPLE | t0 | ref | type,sourcetype,targettype | sourcetype | 16 | const,travtest.t1.targetId | 1 | 100.00 | Using index |
+----+-------------+-------+------+-------------------------------------+------------+---------+----------------------------+------+----------+--------------------------+
SHOW WARNINGS;
+-------+------+--------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+--------------------------------------------------------------------------------------------+
| Note | 1003 | /* select#1 */ select count(0) AS `count(*)` from `travtest`.`connections` `t0` |
| | | join `travtest`.`connections` `t1` join `travtest`.`connections` `t2` |
| | | where ((`travtest`.`t0`.`sourceId` = `travtest`.`t1`.`targetId`) and |
| | | (`travtest`.`t1`.`sourceId` = `travtest`.`t2`.`targetId`) and (`travtest`.`t0`.`type` = 3) |
| | | and (`travtest`.`t1`.`type` = 2) and (`travtest`.`t2`.`type` = 1) and |
| | | (`travtest`.`t2`.`sourceId` = 1) and (`travtest`.`t2`.`dId` = 1)) |
+-------+------+--------------------------------------------------------------------------------------------+
Create an index for sourceid,targetidand type columns then try with this query:
SELECT
count(*)
FROM
`connections` AS `t0`
JOIN
`connections` AS `t1` ON `t1`.`targetid`=`t0`.`sourceid` and `t1`.`type` = 2
JOIN
`connections` AS `t2` ON `t2`.`targetid`=`t1`.`sourceid` and `t2`.dId = 1 AND `t2`.`sourceid` = 1 AND `t2`.`type` = 1
WHERE
`t0`.`type` = 3;
-------UPDATE-----
I think that those indices are right and with those big tables you reached the best optimization you can have. I don't think you can improve this query with other optimization like table partitioning/sharding.
You could implement some kind of caching if those data does not change often or the only way I see is to scale vertically
Probably, you are handling graph data, am i right?
Your 3 hop query has a little chance to optimize. It is bushy tree. A lot of connections made. I think JOIN order and INDEX is right.
EXPLAIN tells me t2 produce about 100 targetId. If you get rid of t2 from join and add t1.sourceId IN (100 targetId). This will take same time with 3 times self join.
But what about break down 100 target to 10 sub IN list. If this reduce response time, with multi thread run 10 queries at once.
MySQL has no parallel feafure. So you do your self.
And did you tried graph databases like jena, sesame? I am not sure graph datdabase is faster than MYSQL.
This is not the answer. just FYI.
If MySQL or the other DBs is slow for you, You can implement your own graph database. then this paper Literature Survey of Graph Databases[1] is a nice work on Graph Database. surveyed several Graph Database, let us know many technique.
SCALABLE SEMANTIC WEB DATA MANAGEMENT USING VERTICAL PARTITIONING [2] introduce vertical partitioning but, your 3M edges is not big, vertical partitioning can't help you. [2] introduces another concept Materialized Path Expressions. I think this can help you.
[1] http://www.systap.com/pubs/graph_databases.pdf
[2] http://db.csail.mit.edu/projects/cstore/abadirdf.pdf
You comment that your query returns 2M (assuming you mean 2 million) is that the final count, or just that its going through the 2 million. Your query appears to be specifically looking for a single T2.ID, Source AND Type joined TO the other tables, but starting with connection 0.
I would drop your existing indexes and have the following so the engine does not try to use the others and cause confusion in how it joins. Also, by having the target ID in both (as you already had) means these will be covering indexes and the engine does not have to go to the actual raw data on the page to confirm any other criteria or pull values from.
The only index is based on your ultimate criteria as in the T2 by the source, type and ID. Since the targetID (part of the index) is the source of the next in the daisy-chain, you are using the same index for source and type going up the chain. No confusion on any indexes
INDEX ON (sourceId, type, dId, targetid )
I would try by reversing to what hopefully would be the smallest set and working OUT... Something like
SELECT
COUNT(*)
FROM
`connections` t2
JOIN `connections` t1
ON t2.targetID = t1.sourceid
AND t1.`type` = 2
JOIN `connections` t0
ON t1.targetid = t0.sourceid
AND t0.`type` = 3
where
t2.sourceid = 1
AND t2.type = 1
AND t2.dID = 1
I have the following mysql query which is taking about 3 minutes to run. It does have 2 sub queries, but the tables have very few rows. When doing an explain, it looks like the "using temporary" might be the culprit. Apparently, it looks like the database is creating a temporary table for all three queries as noted in the "using temporary" designation below.
What confused me is that the MySQL documentation says, that using temporary is generally caused by group by and order by, neither of which I'm using. Do the subqueries cause an implicit group by or order by? Are the sub-queries causing a temporary table to be necessary regardless of group by or order by? Any recommendations of how to restructure this query so MySQL can handle it more efficiently? Any other tuning ideas in the MySQL settings?
mysql> explain
SELECT DISTINCT COMPANY_ID, COMPANY_NAME
FROM COMPANY
WHERE ID IN (SELECT DISTINCT ID FROM CAMPAIGN WHERE CAMPAIGN_ID IN (SELECT
DISTINCT CAMPAIGN_ID FROM AD
WHERE ID=10 AND (AD_STATUS='R' OR AD_STATUS='T'))
AND (STATUS_CODE='L' OR STATUS_CODE='A' OR STATUS_CODE='C'));
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
| 1 | PRIMARY | COMPANY | ALL | NULL | NULL | NULL | NULL | 1207 | Using where; Using temporary |
| 2 | DEPENDENT SUBQUERY | CAMPAIGN | ALL | NULL | NULL | NULL | NULL | 880 | Using where; Using temporary |
| 3 | DEPENDENT SUBQUERY | AD | ALL | NULL | NULL | NULL | NULL | 264 | Using where; Using temporary |
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
thanks!
Phil
I don't know the structure of your schema, but I would try the following:
CREATE INDEX i_company_id ON company(id); -- should it be a Primary Key?..
CREATE INDEX i_campaign_id ON campaign(id); -- same, PK here?
CREATE INDEX i_ad_id ON ad(id); -- the same question applies
ANALYZE TABLE company, campaign, ad;
And your query can be simplified like this:
SELECT DISTINCT c.company_id, c.company_name
FROM company c
JOIN campaign cg ON c.id = cg.id
JOIN ad ON cg.campaign_id = ad.campaign_id
WHERE ad.id = 10
AND ad.ad_status IN ('R', 'T')
AND ad.status_code IN ('L', 'A', 'C');
DISTINCT clauses in the subqueries are slowing down things significantly for you, the final one is sufficient.