Explaining MySQL Explain chapter in O'reilly Optimizing SQL Statments Book, has this question at the end.
The following is an example of a business need that retrieves orphaned parent records in a parent/child relationship. This SQL query can be written in three different ways. While the output produces the same results, the QEP shows three different paths.
mysql> EXPLAIN SELECT p.*
-> FROM parent p
-> WHERE p.id NOT IN (SELECT c.parent_id FROM child c)\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: p
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 160
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: c
type: index_subquery
possible_keys: parent_id
key: parent_id
key_len: 4
ref: func
rows: 1
Extra: Using index
2 rows in set (0.00 sec)
mysql> EXPLAIN SELECT p.*
-> FROM parent p
-> LEFT JOIN child c ON p.id = c.parent_id
-> WHERE c.child_id IS NULL\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 160
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: parent_id
key: parent_id
key_len: 4
ref: test.p.id
rows: 1
Extra: Using where; Using index; Not exists
2 rows in set (0.00 sec)
mysql> EXPLAIN SELECT p.*
-> FROM parent p
-> WHERE NOT EXISTS
-> SELECT parent_id FROM child c WHERE c.parent_id = p.id)\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: p
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 160
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: c
type: ref
possible_keys: parent_id
key: parent_id
key_len: 4
ref: test.p.id
rows: 1
Extra: Using index
2 rows in set (0.00 sec)
Which is best? Will data growth over time cause a different QEP to perform better?
There is no answer in the book or internet as far as I could research.
There is an old article from 2009 which I've seen linked on stackoverflow many times. The test there shows, that the NOT EXISTS query is 27% (it's actually 26%) slower than the other two queries (LEFT JOIN and NOT IN).
However, the optimizer has been improved from version to version. And the perfect optimizer would create the same execution plan for all three queries. But as long as the optimizer is not perfect, the answer on "Which query is faster?" can depend on actual setup (which includes version, settings and data).
I've run similar tests in the past, and all I remember is that the LEFT JOIN has never been significantly slower than any other method. But out of curiosity I've just created a new test on MariaDB 10.3.13 portable Windows version with default settings.
Dummy data:
set #parents = 1000;
drop table if exists parent;
create table parent(
parent_id mediumint unsigned primary key
);
insert into parent(parent_id)
select seq
from seq_1_to_1000000
where seq <= #parents
;
drop table if exists child;
create table child(
child_id mediumint unsigned primary key,
parent_id mediumint unsigned not null,
index (parent_id)
);
insert into child(child_id, parent_id)
select seq as child_id
, floor(rand(1)*#parents)+1 as parent_id
from seq_1_to_1000000
;
NOT IN:
set #start = TIME(SYSDATE(6));
select count(*) into #cnt
from parent p
where p.parent_id not in (select parent_id from child c);
select #cnt, TIMEDIFF(TIME(SYSDATE(6)), #start);
LEFT JOIN:
set #start = TIME(SYSDATE(6));
select count(*) into #cnt
from parent p
left join child c on c.parent_id = p.parent_id
where c.parent_id is null;
select #cnt, TIMEDIFF(TIME(SYSDATE(6)), #start);
NOT EXISTS:
set #start = TIME(SYSDATE(6));
select count(*) into #cnt
from parent p
where not exists (
select *
from child c
where c.parent_id = p.parent_id
);
select #cnt, TIMEDIFF(TIME(SYSDATE(6)), #start);
Execution time in milliseconds:
#parents | 1000 | 10000 | 100000 | 1000000
-----------|------|-------|--------|--------
NOT IN | 21 | 38 | 175 | 4459
LEFT JOIN | 24 | 40 | 183 | 1508
NOT EXISTS | 26 | 44 | 180 | 4463
I've executed the queries multiple times and took the least time value. And SYSDATE is probably not the best method to measure execution time - So don't take these numbers as accurate. However, we can see that up to 100K parent rows, there is not much difference, and the NOT IN method is a bit faster. But with 1M parent rows the LEFT JOIN is three times faster.
Conclusion
So what is the answer? I could just say: "LEFT JOIN" wins. But the truth is - This test proves nothing. And the answer is (as many times): "It depends". When performance matters, best you can do, is to run your own tests with real queries against real data. If you don't have real data (yet), you should create dummy data with the amount and distribution you expect to have in the future.
It depends on what version of MySQL you are using. In older versions, IN ( SELECT ...) performed terribly. In the latest version, it is often as good as the other variants. Also, MariaDB has some optimization differences, probably in this area.
EXISTS( SELECT 1 ... ) is perhaps the clearest in stating the intent. And it perhaps has always (once it came into existence) been fast.
NOT IN and NOT EXISTS are a different animal.
Some things in your Question that may have impact: func and index_subquery. In similar queries, you may not see these, and that difference may lead to performance differences.
Or, to repeat myself:
"There have been a number of improvements in the Optimizer since 2009.
"To the Author (Quassnoi): Please rerun your tests, and specify which version they are being run against. Note also that MySQL and MariaDB may yield different results.
"To the Reader: Test the variants yourself, do not blindly trust the conclusions in this blog."
Related
I have Mysql query, something like this:
SELECT
Main.Code,
Nt,
Ss,
Nac,
Price,
Ei,
Quant,
Dateadded,
Sh,
Crit,
CAST(Ss * Quant AS DECIMAL (10 , 2 )) AS Qss,
CAST(Price * Quant AS DECIMAL (10 , 2 )) AS Qprice,
`Extra0`.`Value`
FROM
Main
LEFT OUTER JOIN
`Extra_fields` AS `Extra0` ON `Extra0`.`Code` = `Main`.`Code`
AND `Extra0`.`Nf` = 2
ORDER BY `Code`
The query is very slow (about 10 sec.). The query without this part:
LEFT OUTER JOIN Extra_fields AS Extra0 ON Extra0.Code = Main.Code AND Extra0.Nf=2
is fast.
Is there some way to optimize first query?
You want to add an index on the joined table to help look up values by Code and Nf, then add the Value column so it can satisfy the column you need for the select-list:
ALTER TABLE Extra_fields ADD KEY (Code, Nf, Value);
You may benefit by adding an index on Main.Code so it reads the table in sorted order without having to do a filesort:
ALTER TABLE Main ADD KEY (Code);
I ran EXPLAIN on your query and got this:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Main
partitions: NULL
type: index
possible_keys: NULL
key: Code
key_len: 5
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: Extra0
partitions: NULL
type: ref
possible_keys: code
key: code
key_len: 10
ref: test.Main.Code,const
rows: 1
filtered: 100.00
Extra: Using index
The first table has no filesort. I had to use ...FROM Main FORCE INDEX(Code)... but it could be because I tested with no rows in the table.
The second table shows it is using an index-only access method ("Extra: Using index"). I assume only three columns from Extra_fields are referenced, and all other columns are from Main.
I have a table sample with two columns id and cnt and another table PostTags with two columns postid and tagid
I want to update all cnt values with their corresponding counts and I have written the following query:
UPDATE sample SET
cnt = (SELECT COUNT(tagid)
FROM PostTags
WHERE sample.postid = PostTags.postid
GROUP BY PostTags.postid)
I intend to update entire column at once and I seem to accomplish this. But performance-wise, is this the best way? Or is there a better way?
EDIT
I've been running this query (without GROUP BY) for over 1 hour for ~18m records. I'm looking for a query that is better in performance.
That query should not take an hour. I just did a test, running a query like yours on a table of 87520 keywords and matching rows in a many-to-many table of 2776445 movie_keyword rows. In my test, it took 32 seconds.
The crucial part that you're probably missing is that you must have an index on the lookup column, which is PostTags.postid in your example.
Here's the EXPLAIN from my test (finally we can do EXPLAIN on UPDATE statements in MySQL 5.6):
mysql> explain update kc1 set count =
(select count(*) from movie_keyword
where kc1.keyword_id = movie_keyword.keyword_id) \G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: kc1
type: index
possible_keys: NULL
key: PRIMARY
key_len: 4
ref: NULL
rows: 98867
Extra: Using temporary
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: movie_keyword
type: ref
possible_keys: k_m
key: k_m
key_len: 4
ref: imdb.kc1.keyword_id
rows: 17
Extra: Using index
Having an index on keyword_id is important. In my case, I had a compound index, but a single-column index would help too.
CREATE TABLE `movie_keyword` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`movie_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `k_m` (`keyword_id`,`movie_id`)
);
The difference between COUNT(*) and COUNT(movie_id) should be immaterial, assuming movie_id is NOT NULLable. But I use COUNT(*) because it'll still count as an index-only query if my index is defined only on the keyword_id column.
Remove the unnecessary GROUP BY and the statement looks good. If however you expect many sample.set already to contain the correct value, then you would update many records that need no update. This may create some overhead (larger rollback segments, triggers executed etc.) and thus take longer.
In order to only update the records that need be updated, join:
UPDATE sample
INNER JOIN
(
SELECT postid, COUNT(tagid) as cnt
FROM PostTags
GROUP BY postid
) tags ON tags.postid = sample.postid
SET sample.cnt = tags.cnt
WHERE sample.cnt != tags.cnt OR sample.cnt IS NULL;
Here is the SQL fiddle: http://sqlfiddle.com/#!2/d5e88.
I have a query that is giving me problems and I can't understand why MySQL's query optimizer is behaving the way it is. Here is the background info:
I have 3 tables. Two are relatively small and one is large.
Table 1 (very small, 727 rows):
CREATE TABLE ipa (
ipa_id int(11) NOT NULL AUTO_INCREMENT,
ipa_code int(11) DEFAULT NULL,
ipa_name varchar(100) DEFAULT NULL,
payorcode varchar(2) DEFAULT NULL,
compid int(11) DEFAULT '2'
PRIMARY KEY (ipa_id),
KEY ipa_code (ipa_code) )
ENGINE=MyISAM
Table 2 (smallish, 59455 rows):
CREATE TABLE assign_ipa (
assignid int(11) NOT NULL AUTO_INCREMENT,
ipa_id int(11) NOT NULL,
userid int(11) NOT NULL,
username varchar(20) DEFAULT NULL,
compid int(11) DEFAULT NULL,
PayorCode char(10) DEFAULT NULL
PRIMARY KEY (assignid),
UNIQUE KEY assignid (assignid,ipa_id),
KEY ipa_id (ipa_id)
) ENGINE=MyISAM
Table 3 (large, 24,711,730 rows):
CREATE TABLE master_final (
IPA int(11) DEFAULT NULL,
MbrCt smallint(6) DEFAULT '0',
PayorCode varchar(4) DEFAULT 'WC',
KEY idx_IPA (IPA)
) ENGINE=MyISAM DEFAULT
Now for the query. I'm doing a 3-way join using the first two smaller tables to essentially subset the big table on one of it's indexed values. Basically, I get a list of IDs for a user, SJOnes and query the big file for those IDs.
mysql> explain
SELECT master_final.PayorCode, sum(master_final.Mbrct) AS MbrCt
FROM master_final
INNER JOIN ipa ON ipa.ipa_code = master_final.IPA
INNER JOIN assign_ipa ON ipa.ipa_id = assign_ipa.ipa_id
WHERE assign_ipa.username = 'SJones'
GROUP BY master_final.PayorCode, master_final.ipa\G;
************* 1. row *************
id: 1
select_type: SIMPLE
table: master_final
type: ALL
possible_keys: idx_IPA
key: NULL
key_len: NULL
ref: NULL
rows: 24711730
Extra: Using temporary; Using filesort
************* 2. row *************
id: 1
select_type: SIMPLE
table: ipa
type: ref
possible_keys: PRIMARY,ipa_code
key: ipa_code
key_len: 5
ref: wc_test.master_final.IPA
rows: 1
Extra: Using where
************* 3. row *************
id: 1
select_type: SIMPLE
table: assign_ipa
type: ref
possible_keys: ipa_id
key: ipa_id
key_len: 4
ref: wc_test.ipa.ipa_id
rows: 37
Extra: Using where
3 rows in set (0.00 sec)
This query takes forever (like 30 minutes!). The explain statement tells me why, it's doing a full table scan on the big table even though there is a perfectly good index. It's not using it. I don't understand this. I can look at the query and see that it's only needs to query a couple of IDs from the big table. If I can do it, why can't MySQL's optimizer do it?
To illustrate, here are the IDs associated with 'SJones':
mysql> select username, ipa_id from assign_ipa where username='SJones';
+----------+--------+
| username | ipa_id |
+----------+--------+
| SJones | 688 |
| SJones | 689 |
+----------+--------+
2 rows in set (0.02 sec)
Now, I can rewrite the query substituting the ipa_id values for the username in the where clause. To me this is equivalent to the original query. MySQL sees it differently. If I do this, the optimizer makes use of the index on the big table.
mysql> explain
SELECT master_final.PayorCode, sum(master_final.Mbrct) AS MbrCt
FROM master_final
INNER JOIN ipa ON ipa.ipa_code = master_final.IPA
INNER JOIN assign_ipa ON ipa.ipa_id = assign_ipa.ipa_id
*WHERE assign_ipa.ipa_id in ('688','689')*
GROUP BY master_final.PayorCode, master_final.ipa\G;
************* 1. row *************
id: 1
select_type: SIMPLE
table: ipa
type: range
possible_keys: PRIMARY,ipa_code
key: PRIMARY
key_len: 4
ref: NULL
rows: 2
Extra: Using where; Using temporary; Using filesort
************* 2. row *************
id: 1
select_type: SIMPLE
table: assign_ipa
type: ref
possible_keys: ipa_id
key: ipa_id
key_len: 4
ref: wc_test.ipa.ipa_id
rows: 37
Extra: Using where
************* 3. row *************
id: 1
select_type: SIMPLE
table: master_final
type: ref
possible_keys: idx_IPA
key: idx_IPA
key_len: 5
ref: wc_test.ipa.ipa_code
rows: 34953
Extra: Using where
3 rows in set (0.00 sec)
The only thing I've changed is a where clause that doesn't even directly hit the big table. And yet, the optimizer uses the index 'idx_IPA' on the big table and the full table scan is no longer used. The query when re-written like this is very fast.
OK, that's a lot of background. Now my question. Why should the where clause matter to the optimizer? Either where clause will return the same result set from the smaller table, and yet I'm getting dramatically different results depending on which one I use. Obviously, I want to use the where clause containing the username rather than trying to pass all associated IDs to the query. As written though, this is not possible?
Can someone explain why this is happening?
How might I rewrite my query to avoid the full table scan?
Thanks for sticking with me. I know its a very longish question.
Not quite sure if I'm right, but I think the following is happening here. This:
WHERE assign_ipa.username = 'SJones'
may create a temporary table, since it requires a full table scan. Temporary tables have no indexes, and they tend to slow down things down a lot.
The second case
INNER JOIN ipa ON ipa.ipa_code = master_final.IPA
INNER JOIN assign_ipa ON ipa.ipa_id = assign_ipa.ipa_id
WHERE assign_ipa.ipa_id in ('688','689')
on the other hand allows for joining of indexes, which is fast. Additionally, it can be transformed to
SELECT .... FROM master_final WHERE IDA IN (688, 689) ...
and I think MySQL is doing that, too.
Creating an index on assign_ipa.username may help.
Edit
I rethought the problem and now have a different explanation.
The reason of course is the missing index. This means that MySQL has no clue how large the result of querying assign_ipa would be (MySQL does not store counts), so it starts with the joins first, where it can relay on keys.
That's what row 2 and 3 of explain log tell us.
And after that, it tries to filter the result by assign_ipa.username, which has no key, as stated in row 1.
As soon as there is an index, it filters assign_ipa first, and joins afterwards, using the according indexes.
This is probably not a direct answer to your question, but here are few things that you can do:
Run ANALYZE_TABLE ...it will update table statistics which has a great impact on what optimizer will decide to do.
If you still think that joins are not in order you wish them to be (which happens in your case, and thus optimizer is not using indexes as you expect it to do), you can use STRAIGHT_JOIN ... from here: "STRAIGHT_JOIN forces the optimizer to join the tables in the order in which they are listed in the FROM clause. You can use this to speed up a query if the optimizer joins the tables in nonoptimal order"
For me, putting "where part" right into join sometimes makes a difference and speeds things up. For example, you can write:
...t1 INNER JOIN t2 ON t1.k1 = t2.k2 AND t2.k2=something...
instead of
...t1 INNER JOIN t2 ON t1.k1 = t2.k2 .... WHERE t2.k2=something...
So this is definitely not an explanation on why you have that behavior but just few hints. Query optimizer is a strange beast, but fortunately there is EXPLAIN command which can help you to trick it to behave in a way you want.
I have the following query:
SELECT t.id
FROM account_transaction t
JOIN transaction_code tc ON t.transaction_code_id = tc.id
JOIN account a ON t.account_number = a.account_number
GROUP BY tc.id
When I do an EXPLAIN the first row shows, among other things, this:
table: t
type: ALL
possible_keys: account_id,transaction_code_id,account_transaction_transaction_code_id,account_transaction_account_number
key: NULL
rows: 465663
Why is key NULL?
Another issue you may be encountering is a data type mis-match. For example, if your column is a string data type (CHAR, for ex), and your query is not quoting a number, then MySQL won't use the index.
SELECT * FROM tbl WHERE col = 12345; # No index
SELECT * FROM tbl WHERE col = '12345'; # Index
Source: Just fought this same issue today, and learned the hard way on MySQL 5.1. :)
Edit: Additional information to verify this:
mysql> desc das_table \G
*************************** 1. row ***************************
Field: das_column
Type: varchar(32)
Null: NO
Key: PRI
Default:
Extra:
*************************** 2. row ***************************
[SNIP!]
mysql> explain select * from das_table where das_column = 189017 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: das_column
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 874282
Extra: Using where
1 row in set (0.00 sec)
mysql> explain select * from das_table where das_column = '189017' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: das_column
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 34
ref: const
rows: 1
Extra:
1 row in set (0.00 sec)
It might be because the statistics is broken, or because it knows that you always have a 1:1 ratio between the two tables.
You can force an index to be used in the query, and see if that would speed up things. If it does, try to run ANALYZE TABLE to make sure statistics are up to date.
By specifying USE INDEX (index_list), you can tell MySQL to use only one of the named indexes to find rows in the table. The alternative syntax IGNORE INDEX (index_list) can be used to tell MySQL to not use some particular index or indexes. These hints are useful if EXPLAIN shows that MySQL is using the wrong index from the list of possible indexes.
You can also use FORCE INDEX, which acts like USE INDEX (index_list) but with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the given indexes to find rows in the table.
Each hint requires the names of indexes, not the names of columns. The name of a PRIMARY KEY is PRIMARY. To see the index names for a table, use SHOW INDEX.
From http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Index for the group by (=implicit order by)
...
GROUP BY tc.id
The group by does an implicit sort on tc.id.
tc.id is not listed a a possible key.
but t.transaction_id is.
Change the code to
SELECT t.id
FROM account_transaction t
JOIN transaction_code tc ON t.transaction_code_id = tc.id
JOIN account a ON t.account_number = a.account_number
GROUP BY t.transaction_code_id
This will put the potential index transaction_code_id into view.
Indexes for the joins
If the joins (nearly) fully join the three tables, there's no need to use the index, so MySQL doesn't.
Other reasons for not using an index
If a large % of the rows under consideration (40% IIRC) are filled with the same value. MySQL does not use an index. (because not using the index is faster)
table:
foreign_id_1
foreign_id_2
integer
date1
date2
primary(foreign_id_1, foreign_id_2)
Query: delete from table where (foreign_id_1 = ? or foreign_id_2 = ?) and date2 < ?
Without date query takes about 40 sec. That's too high :( With date much more longer..
The options are:
create another table and insert select, then rename
use limit and run query multiple times
split query to run for foreign_id_1 then foreign_id_2
use select then delete by single row
Is there any faster way?
mysql> explain select * from compatibility where user_id = 193 or person_id = 193 \G
id: 1
select_type: SIMPLE
table: compatibility
type: index_merge
possible_keys: PRIMARY,compatibility_person_id_user_id
key: PRIMARY,compatibility_person_id_user_id
key_len: 4,4
ref: NULL
rows: 2
Extra: Using union(PRIMARY,compatibility_person_id_user_id); Using where
1 row in set (0.00 sec)
mysql> explain select * from compatibility where (user_id = 193 or person_id = 193) and updated_at < '2010-12-02 22:55:33' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: compatibility
type: index_merge
possible_keys: PRIMARY,compatibility_person_id_user_id
key: PRIMARY,compatibility_person_id_user_id
key_len: 4,4
ref: NULL
rows: 2
Extra: Using union(PRIMARY,compatibility_person_id_user_id); Using where
1 row in set (0.00 sec)
Having an OR in your WHERE makes MySQL reluctant (if not completely refuse) to use indexes on your user_id and/or person_id fields (if there is any -- showing the CREATE TABLE would indicate if there was).
If you can add indexes (or modify existing ones since I'm thinking of compound indexes), I'd likely add two:
ALTER TABLE compatibility
ADD INDEX user_id_updated_at (user_id, updated_at),
ADD INDEX persona_id_updated_at (person_id, updated_at);
Correspondingly, assuming the rows to DELETE didn't have to be be deleted atomically (i.e. occur at the same instant).
DELETE FROM compatibility WHERE user_id = 193 AND updated_at < '2010-12-02 22:55:33';
DELETE FROM compatibility WHERE person_id = 193 AND updated_at < '2010-12-02 22:55:33';
By now data amount is 40M (+33%) and rapidly growing. So I've started looking for other, some no-sql, solution.
Thanks.