MySQL WHERE EXISTS subquery performance between 5.7 and 8.0 - mysql

I have an animals table (~2.7m records) and a breeds table (~2.7m records) that have a one to many relationship (one animal can have multiple breeds). I'm trying to query all distinct breeds for a specific species. As I'm not a SQL expert, my initial thought was to go with a simple SELECT DISTINCT breed ... JOIN, but this query took about 10 seconds which seemed much longer than I'd expect. So I changed this to a SELECT DISTINCT ... WHERE EXISTS subquery and it executed in about 100ms in 5.7, which is much more reasonable. But now I'm migrating to MySQL 8 and this exact query takes anywhere from 10-30 seconds. Here are the table definitions:
CREATE TABLE `animals` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`species` varchar(50) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`sex` enum('Male','Female') CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`dob` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `animals_name_index` (`name`),
KEY `animals_dob_index` (`dob`),
KEY `animals_sex_index` (`sex`),
KEY `animals_species_index` (`species`,`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2807152 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_unicode_ci
CREATE TABLE `animal_breeds` (
`animal_id` int unsigned DEFAULT NULL,
`breed` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
KEY `animal_breeds_breed_animal_id_index` (`breed`,`animal_id`) USING BTREE,
CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Here is the query I'm running:
SELECT SQL_NO_CACHE *
FROM
(
SELECT DISTINCT `breed`
FROM `animal_breeds`
) AS `subQuery`
WHERE
EXISTS (
SELECT `breed`
FROM `animal_breeds`
INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id`
WHERE `animals`.`species` = 'Dog'AND `animal_breeds`.`breed` = `subQuery`.`breed`
);
Here are the two EXPLAIN statements from 5.7 and 8.0
MySQL 5.7
284 rows in set, 1 warning (0.02 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
<derived2>
NULL
ALL
NULL
NULL
NULL
NULL
7775
100.00
Using where
3
DEPENDENT SUBQUERY
animal_breeds
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_breed_animal_id_index
1022
allBreeds.breed
348
100.00
Using where; Using index
3
DEPENDENT SUBQUERY
animals
NULL
eq_ref
PRIMARY,animals_species_index
PRIMARY
4
animal_breeds.animal_id
1
50.00
Using where
2
DERIVED
animal_breeds
NULL
range
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_breed_animal_id_index
1022
NULL
7775
100.00
Using index for group-by
MySQL 8.0.27
284 rows in set, 1 warning (27.92 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
<derived2>
NULL
ALL
NULL
NULL
NULL
NULL
7776
100.00
NULL
1
PRIMARY
<subquery3>
NULL
eq_ref
<auto_distinct_key>
<auto_distinct_key>
1022
allBreeds.breed
1
100.00
NULL
3
MATERIALIZED
animals
NULL
ref
PRIMARY,animals_species_index
animals_species_index
153
const
1390666
100.00
Using index
3
MATERIALIZED
animal_breeds
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_animal_id_breed_unique
5
animals.id
1
100.00
Using index
2
DERIVED
animal_breeds
NULL
range
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_breed_animal_id_index
1022
NULL
7776
100.00
Using index for group-by
Lastly, both of these databases are using the base docker image with no changes to the configuration. Although the query still runs poorly on an VPS running MySQL 8 with some tweaked settings. I also read through a thread about someone having a similar problem but the comments/answer didn't seem to help in my case.
Any help would be much appreciated!
EDIT:
Here is the execution plan for the SELECT DISTINCT ... JOIN:
SELECT DISTINCT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'
MySQL 5.7
284 rows in set (25.27 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
NULL
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1385271
100.00
Using index; Using temporary
1
SIMPLE
ab
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index
MySQL 8.0
284 rows in set (29.45 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
NULL
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1390666
100.00
Using index; Using temporary
1
SIMPLE
ab
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index
SELECT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'
MySQL 5.7
2722722 rows in set (26.69 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1385271
100.00
Using index
1
SIMPLE
ab
ref
animal_breeds_animal_id_breed_unique
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index
MySQL 8.0
2722722 rows in set (32.49 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
NULL
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1390666
100.00
Using index
1
SIMPLE
ab
NULL
ref
animal_breeds_animal_id_breed_unique
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index

Filtering animals before joining it to breeds will improve performance (10x faster in some cases):
SELECT DISTINCT ab.breed
FROM animal_breeds ab
WHERE ab.animal_id IN (
SELECT a.id
FROM animals a
WHERE a.species = 'Dog');

Try to write query without inner join and start with a table that contains columns from where conditions. Here is the one of possible variants:
SELECT DISTINCT ab.breed
FROM animals a
LEFT JOIN animal_breeds ab on a.id = ab.animal_id
WHERE a.species = 'Dog'

Consider this:
Change the table name animals to pets. (I think of "Fido" as a pet and "dog" as an animal. Because of this, it took me a long time to figure out the schema.)
Move species to the other table, now calledpet_species. Though I cannot imagine a "cat" breed being called "retriever", it does make sense that (species, breed) is a hierarchical pair of terms.
Another confusion: Technically speaking a "dog" is "Canis familiaris", which is [technically] two terms 'genus' and 'species'.
Moving species to the other table will lead to some changes to the queries. You could have it in both tables, though DB purists say that redundant information is a "no-no". I have not thought of a compromise between the two stands.

Related

mysql: index on json array not used when joining another table

I have 2 tables:
CREATE TABLE `directory` (
`id` bigint NOT NULL,
`datasets` json DEFAULT NULL
PRIMARY KEY (`id`) USING BTREE,
KEY `idx_datasets` ((cast(`datasets` as unsigned array)))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
CREATE TABLE `dataset` (
`id` bigint NOT NULL,
`name` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE,
KEY `idx_name` (`name`,`id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
The query below uses indexes on the two tables as expected:
explain
SELECT * FROM dataset d inner join `directory` dir
on JSON_CONTAINS(dir.datasets, cast(d.id as json))
where d.id = 111;
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE d const PRIMARY PRIMARY 8 const 1 100.00
1 SIMPLE dir range idx_datasets idx_datasets 9 2 100.00 Using where
However, this query uses index only on the left table
explain
SELECT * FROM dataset d inner join `directory` dir
on JSON_CONTAINS(dir.datasets, cast(d.id as json))
where d.name like '111';
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE d range idx_name idx_name 259 1 100.00 Using index condition
1 SIMPLE dir ALL 1000 100.00 Using where; Using join buffer (hash join)
Could someone explain the difference between the two queries?
I change the condition "like" to "=", the result is the same:
explain
SELECT * FROM dataset d inner join catalog dir
on JSON_CONTAINS(dir.datasets, cast(d.id as json))
where d.name = '111';
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE d ref idx_name idx_name 259 const 1 100.00
1 SIMPLE dir ALL 1000 100.00 Using where; Using join buffer (hash join)
this is caused by the like expression in the WHERE clause in the 2nd query
here is why, explained in some threads:
1- Equals(=) vs. LIKE
2- SQL 'like' vs '=' performance
EDIT looks like the issue here is caused by the fact that in the first query you are searching against the Primary Key, while in the second you dont
More details in this question:
Behavior of WHERE clause on a primary key field

Why would an indexed column return results slowly when querying for `IS NULL`?

I have a table with 25 million rows, indexed appropriately.
But adding the clause AND status IS NULL turns a super fast query into a crazy slow query.
Please help me speed it up.
Query:
SELECT
student_id,
grade,
status
FROM
grades
WHERE
class_id = 1
AND status IS NULL -- This line delays results from <200ms to 40-70s!
AND grade BETWEEN 0 AND 0.7
LIMIT 25;
Table:
CREATE TABLE IF NOT EXISTS `grades` (
`student_id` BIGINT(20) NOT NULL,
`class_id` INT(11) NOT NULL,
`grade` FLOAT(10,6) DEFAULT NULL,
`status` INT(11) DEFAULT NULL,
UNIQUE KEY `unique_key` (`student_id`,`class_id`),
KEY `class_id` (`class_id`),
KEY `status` (`status`),
KEY `grade` (`grade`)
) ENGINE=INNODB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Local development shows results instantly (<200ms). Production server is huge slowdown (40-70 seconds!).
Can you point me in the right direction to debug?
Explain:
+----+-------------+--------+-------------+-----------------------+-----------------+---------+------+-------+--------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------------+-----------------------+-----------------+---------+------+-------+--------------------------------------------------------+
| 1 | SIMPLE | grades | index_merge | class_id,status,grade | status,class_id | 5,4 | NULL | 26811 | Using intersect(status,class_id); Using where |
+----+-------------+--------+-------------+-----------------------+-----------------+---------+------+-------+--------------------------------------------------------+
A SELECT statement can only use one index per table.
Presumably the query before just did a scan using the sole index class_id for your condition class_id=1. Which will probably filter your result set nicely before checking the other conditions.
The optimiser is 'incorrectly' choosing an index merge on class_id and status for the second query and checking 26811 rows which is probably not optimal. You could hint at the class_id index by adding USING INDEX (class_id) to the end of the FROM clause.
You may get some joy with a composite index on (class_id,status,grade) which may run the query faster as it can match the first two and then range scan the grade. I'm not sure how this works with null though.
I'm guessing the ORDER BY pushed the optimiser to choose the class_id index again and returned your query to it's original speed.

MySQL : Avoid Temporary/Filesort Caused by GROUP BY Clause

I've got a fairly simple query that seeks to display the number of email addresses that are subscribed along with the number unsubscribed, grouped by client.
The query:
SELECT
client_id,
COUNT(CASE WHEN subscribed = 1 THEN subscribed END) AS subs,
COUNT(CASE WHEN subscribed = 0 THEN subscribed END) AS unsubs
FROM
contacts_emailAddresses
LEFT JOIN contacts ON contacts.id = contacts_emailAddresses.contact_id
GROUP BY
client_id
Schema of relevant tables follows. contacts_emailAddresses is a junction table between contacts (which has the client_id) and emailAddresses (which is not actually used in this query).
CREATE TABLE `contacts` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`firstname` varchar(255) NOT NULL DEFAULT '',
`middlename` varchar(255) NOT NULL DEFAULT '',
`lastname` varchar(255) NOT NULL DEFAULT '',
`gender` varchar(5) DEFAULT NULL,
`client_id` mediumint(10) unsigned DEFAULT NULL,
`datasource` varchar(10) DEFAULT NULL,
`external_id` int(10) unsigned DEFAULT NULL,
`created` timestamp NULL DEFAULT NULL,
`trash` tinyint(1) NOT NULL DEFAULT '0',
`updated` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `client_id` (`client_id`),
KEY `external_id combo` (`client_id`,`datasource`,`external_id`),
KEY `trash` (`trash`),
KEY `lastname` (`lastname`),
KEY `firstname` (`firstname`),
CONSTRAINT `contacts_ibfk_1` FOREIGN KEY (`client_id`) REFERENCES `clients` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=14742974 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT
CREATE TABLE `contacts_emailAddresses` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`contact_id` int(10) unsigned NOT NULL,
`emailAddress_id` int(11) unsigned DEFAULT NULL,
`primary` tinyint(1) unsigned NOT NULL DEFAULT '0',
`subscribed` tinyint(1) unsigned NOT NULL DEFAULT '1',
`modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `contact_id` (`contact_id`),
KEY `subscribed` (`subscribed`),
KEY `combo` (`contact_id`,`emailAddress_id`) USING BTREE,
KEY `emailAddress_id` (`emailAddress_id`) USING BTREE,
CONSTRAINT `contacts_emailAddresses_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`id`),
CONSTRAINT `contacts_emailAddresses_ibfk_2` FOREIGN KEY (`emailAddress_id`) REFERENCES `emailAddresses` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24700918 DEFAULT CHARSET=utf8
Here's the EXPLAIN:
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | contacts_emailAddresses | ALL | NULL | NULL | NULL | NULL | 10176639 | Using temporary; Using filesort |
| 1 | SIMPLE | contacts | eq_ref | PRIMARY | PRIMARY | 4 | icarus.contacts_emailAddresses.contact_id | 1 | |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
2 rows in set (0.08 sec)
The problem here clearly is the GROUP BY clause, as I can remove the JOIN (and the items that depend on it) and the performance still is terrible (40+ seconds). There are 10m records in contacts_emailAddresses, 12m-some records in contacts, and 10–15 client records for the grouping.
From the doc:
Temporary tables can be created under conditions such as these:
If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.
DISTINCT combined with ORDER BY may require a temporary table.
If you use the SQL_SMALL_RESULT option, MySQL uses an in-memory temporary table, unless the query also contains elements (described later) that require on-disk storage.
I'm obviously not combining the GROUP BY with an ORDER BY, and I have tried multiple things to ensure that the GROUP BY is on a column that should be properly placed in the join queue (including rewriting the query to put contacts in the FROM and instead join to contacts_emailAddresses), all to no avail.
Any suggestions for performance tuning would be much appreciated!
I think the only real shot you have of getting away from a "Using temporary; Using filesort" operation (given the current schema, the current query, and the specified resultset) would be to use correlated subqueries in the SELECT list.
SELECT c.client_id
, (SELECT IFNULL(SUM(es.subscribed=1),0)
FROM contacts_emailAddresses es
JOIN contacts cs
ON cs.id = es.contact_id
WHERE cs.client_id = c.client_id
) AS subs
, (SELECT IFNULL(SUM(eu.subscribed=0),0)
FROM contacts_emailAddresses eu
JOIN contacts cu
ON cu.id = eu.contact_id
WHERE cu.client_id = c.client_id
) AS unsubs
FROM contacts c
GROUP BY c.client_id
This may run quicker than the original query, or it may not. Those correlated subqueries are going to get run for each returned by the outer query. If that outer query is returning a boatload of rows, that's a whole boatload of subquery executions.
Here's the output from an EXPLAIN:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo contact_id 4 cu.id Using where
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo contact_id 4 cs.id Using where
For optimum performance of this query, we'd really like to see "Using index" in the Extra column of the explain for the eu and es tables. But to get that, we'd need a suitable index, one with a leading column of contact_id and including the subscribed column. For example:
CREATE INDEX cemail_IX2 ON contacts_emailAddresses (contact_id, subscribed);
With the new index available, EXPLAIN output shows MySQL will use the new index:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo,cemail_IX2 cemail_IX2 4 cu.id Using where; Using index
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo,cemail_IX2 cemail_IX2 4 cs.id Using where; Using index
NOTES
This is the kind of problem where introducing a little redundancy can improve performance. (Just like we do in a traditional data warehouse.)
For optimum performance, what we'd really like is to have the client_id column available on the contacts_emailAddresses table, without a need to JOINI to the contacts table.
In the current schema, the foreign key relationship to contacts table gets us the client_id (rather, the JOIN operation in the original query is what gets it for us.) If we could avoid that JOIN operation entirely, we could satisfy the query entirely from a single index, using the index to do the aggregation, and avoiding the overhead of the "Using temporary; Using filesort" and JOIN operations...
With the client_id column available, we'd create a covering index like...
... ON contacts_emailAddresses (client_id, subscribed)
Then, we'd have a blazingly fast query...
SELECT e.client_id
, SUM(e.subscribed=1) AS subs
, SUM(e.subscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.client_id
That would get us a "Using index" in the query plan, and the query plan for this resultset just doesn't get any better than that.
But, that would require a change to your scheam, it doesn't really answer your question.
Without the client_id column, then the best we're likely to do is a query like the one Gordon posted in his answer (though you still need to add the GROUP BY c.client_id to get the specified result.) The index Gordon recommended will be of benefit...
... ON contacts_emailAddresses(contact_id, subscribed)
With that index defined, the standalone index on contact_id is redundant. The new index will be a suitable replacement to support the existing foreign key constraint. (The index on just contact_id could be dropped.)
Another approach would be to do the aggregation on the "big" table first, before doing the JOIN, since it's the driving table for the outer join. Actually, since that foreign key column is defined as NOT NULL, and there's a foreign key, it's not really an "outer" join at all.
SELECT c.client_id
, SUM(s.subs) AS subs
, SUM(s.unsubs) AS unsubs
FROM ( SELECT e.contact_id
, SUM(e.subscribed=1) AS subs
, SUM(e.eubscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.contact_id
) s
JOIN contacts c
ON c.id = s.contact_id
GROUP BY c.client_id
Again, we need an index with contact_id as the leading column and including the subscribed column, for best performance. (The plan for s should show "Using index".) Unfortunately, that's still going to materialize a fairly sizable resultset (derived table s) as a temporary MyISAM table, and the MyISAM table isn't going to be indexed.

simple mysql query working slower than nested Select

I am doing a simple Select from a single table.
CREATE TABLE `book` (
`Book_Id` int(10) NOT NULL AUTO_INCREMENT,
`Book_Name` varchar(100) COLLATE utf8_turkish_ci DEFAULT NULL ,
`Book_Active` bit(1) NOT NULL DEFAULT b'1' ,
`Author_Id` int(11) NOT NULL,
PRIMARY KEY (`Book_Id`),
KEY `FK_Author` (`Author_Id`),
CONSTRAINT `FK_Author` FOREIGN KEY (`Author_Id`) REFERENCES `author` (`Author_Id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=5947698 DEFAULT CHARSET=utf8 COLLATE=utf8_turkish_ci ROW_FORMAT=COMPACT
table : book
columns :
Book_Id (INTEGER 10) | Book_Name (VARCHAR 100) | Author_Id (INTEGER 10) | Book_Active (Boolean)
I have Indexes on three columns : Book_Id (PRIMARY key) , Author_Id (FK) , Book_Active .
first query :
SELECT * FROM book WHERE Author_Id = 1 AND Book_Active = 1
EXPLAIN :
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE book ref FK_Author,index_Book_Active FK_Author 4 const 4488510 Using where
second query :
SELECT b.* FROM book b
WHERE Book_Active=1
AND Book_Id IN (SELECT Book_Id FROM book WHERE Author_Id=1)
EXPLAIN :
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY book ref index_Book_Active index_Book_Active 1 const 9369399 Using where
2 DEPENDENT SUBQUERY book unique_subquery PRIMARY,FK_Author PRIMARY 4 func 1 Using where
The data statistics is like this :
16.8 million books
10.5 million Book_Active=true
6.3 million Book_Active = false
And For Author_Id=1
2.4 million Book_Active=false
5000 Book_Active=true
The first query takes 6.7 seconds . The second query takes 0.0002 seconds
What is the cause of this enormous difference ? is it the right thing to use the nested select query ?
edit: added "sql explain"
In the first case: MySQL uses FK_Author index (this gives us ~4.5M rows), then it has to match every row against Book_Active = 1 condition — index cannot be used here.
The second case: InnoDB implicitly adds primary key to every index. Thus, when MySQL executes this part: SELECT book.* FROM book WHERE Book_Active=1 it has Book_Id from the index. Then for the subquery it has to match Book_Id with Author_Id; Author_Id is a constant and is a prefix of the index; implicitly included primary key can be matched against implicit primary key from Book_Active index. In your case it is faster to intersect two indices than to use index to retrieve 4.5M rows and scan them sequentially.

MySQL 5.1 using filesort event when an index is present

Probably I'm missing some silly thing... Apparently MySQL 5.1 keeps doing a Filesort even when there is an index that matches exactly the column in the ORDER BY clause. To post it here, I've oversimplified the data model, but the issue is still happening:
Table definition:
CREATE TABLE `event` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`owner_id` int(11) DEFAULT NULL,
`date_created` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `owner_id` (`owner_id`),
KEY `date_created` (`date_created`),
CONSTRAINT `event_ibfk_1` FOREIGN KEY (`owner_id`) REFERENCES `user_profile` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8;
My problem is that event a simple SELECT is showing "Using filesort":
explain select * from event order by date_created desc;
And the result for the query explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE event ALL NULL NULL NULL NULL 6 Using filesort
Is there any way for this type of queries to use the index insteas of doing a filesort?
Thanks in advance to everybody.
Since your CREATE TABLE statement indicates that you have less than 10 rows (AUTO_INCREMENT=7) and using FORCE INDEX on my installation will make MySQL use the index, I'm guessing the optimizer thinks a table scan is faster (less random I/O) than an index scan (since you're selecting all columns, not just date_created). This is confirmed by the following:
mysql> explain select date_created from event order by date_created;
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| 1 | SIMPLE | event | index | NULL | date_created | 9 | NULL | 1 | Using index |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
1 row in set (0.00 sec)
In the above case, the index scan is faster because only the indexed column needs to be returned.
The MySQL documentation has some cases where using an index is considered slower: http://dev.mysql.com/doc/refman/5.1/en/how-to-avoid-table-scan.html
Q: Is there any way for this type of queries to use the index instead of doing a filesort?
A: To have MySQL use the index if at all possible, try:
EXPLAIN SELECT * FROM event FORCE INDEX (date_created) ORDER BY date_created DESC;
By using the FORCE INDEX (index_name), this tells MySQL to make use of the index if it's at all possible. Absent that directive, MySQL will choose the most efficient way to return the result set. A filesort may be more efficient than using the index.