I am doing a simple Select from a single table.
CREATE TABLE `book` (
`Book_Id` int(10) NOT NULL AUTO_INCREMENT,
`Book_Name` varchar(100) COLLATE utf8_turkish_ci DEFAULT NULL ,
`Book_Active` bit(1) NOT NULL DEFAULT b'1' ,
`Author_Id` int(11) NOT NULL,
PRIMARY KEY (`Book_Id`),
KEY `FK_Author` (`Author_Id`),
CONSTRAINT `FK_Author` FOREIGN KEY (`Author_Id`) REFERENCES `author` (`Author_Id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=5947698 DEFAULT CHARSET=utf8 COLLATE=utf8_turkish_ci ROW_FORMAT=COMPACT
table : book
columns :
Book_Id (INTEGER 10) | Book_Name (VARCHAR 100) | Author_Id (INTEGER 10) | Book_Active (Boolean)
I have Indexes on three columns : Book_Id (PRIMARY key) , Author_Id (FK) , Book_Active .
first query :
SELECT * FROM book WHERE Author_Id = 1 AND Book_Active = 1
EXPLAIN :
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE book ref FK_Author,index_Book_Active FK_Author 4 const 4488510 Using where
second query :
SELECT b.* FROM book b
WHERE Book_Active=1
AND Book_Id IN (SELECT Book_Id FROM book WHERE Author_Id=1)
EXPLAIN :
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY book ref index_Book_Active index_Book_Active 1 const 9369399 Using where
2 DEPENDENT SUBQUERY book unique_subquery PRIMARY,FK_Author PRIMARY 4 func 1 Using where
The data statistics is like this :
16.8 million books
10.5 million Book_Active=true
6.3 million Book_Active = false
And For Author_Id=1
2.4 million Book_Active=false
5000 Book_Active=true
The first query takes 6.7 seconds . The second query takes 0.0002 seconds
What is the cause of this enormous difference ? is it the right thing to use the nested select query ?
edit: added "sql explain"
In the first case: MySQL uses FK_Author index (this gives us ~4.5M rows), then it has to match every row against Book_Active = 1 condition — index cannot be used here.
The second case: InnoDB implicitly adds primary key to every index. Thus, when MySQL executes this part: SELECT book.* FROM book WHERE Book_Active=1 it has Book_Id from the index. Then for the subquery it has to match Book_Id with Author_Id; Author_Id is a constant and is a prefix of the index; implicitly included primary key can be matched against implicit primary key from Book_Active index. In your case it is faster to intersect two indices than to use index to retrieve 4.5M rows and scan them sequentially.
Related
I have an animals table (~2.7m records) and a breeds table (~2.7m records) that have a one to many relationship (one animal can have multiple breeds). I'm trying to query all distinct breeds for a specific species. As I'm not a SQL expert, my initial thought was to go with a simple SELECT DISTINCT breed ... JOIN, but this query took about 10 seconds which seemed much longer than I'd expect. So I changed this to a SELECT DISTINCT ... WHERE EXISTS subquery and it executed in about 100ms in 5.7, which is much more reasonable. But now I'm migrating to MySQL 8 and this exact query takes anywhere from 10-30 seconds. Here are the table definitions:
CREATE TABLE `animals` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`species` varchar(50) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`sex` enum('Male','Female') CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`dob` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `animals_name_index` (`name`),
KEY `animals_dob_index` (`dob`),
KEY `animals_sex_index` (`sex`),
KEY `animals_species_index` (`species`,`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=2807152 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_unicode_ci
CREATE TABLE `animal_breeds` (
`animal_id` int unsigned DEFAULT NULL,
`breed` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
KEY `animal_breeds_breed_animal_id_index` (`breed`,`animal_id`) USING BTREE,
CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Here is the query I'm running:
SELECT SQL_NO_CACHE *
FROM
(
SELECT DISTINCT `breed`
FROM `animal_breeds`
) AS `subQuery`
WHERE
EXISTS (
SELECT `breed`
FROM `animal_breeds`
INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id`
WHERE `animals`.`species` = 'Dog'AND `animal_breeds`.`breed` = `subQuery`.`breed`
);
Here are the two EXPLAIN statements from 5.7 and 8.0
MySQL 5.7
284 rows in set, 1 warning (0.02 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
<derived2>
NULL
ALL
NULL
NULL
NULL
NULL
7775
100.00
Using where
3
DEPENDENT SUBQUERY
animal_breeds
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_breed_animal_id_index
1022
allBreeds.breed
348
100.00
Using where; Using index
3
DEPENDENT SUBQUERY
animals
NULL
eq_ref
PRIMARY,animals_species_index
PRIMARY
4
animal_breeds.animal_id
1
50.00
Using where
2
DERIVED
animal_breeds
NULL
range
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_breed_animal_id_index
1022
NULL
7775
100.00
Using index for group-by
MySQL 8.0.27
284 rows in set, 1 warning (27.92 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
<derived2>
NULL
ALL
NULL
NULL
NULL
NULL
7776
100.00
NULL
1
PRIMARY
<subquery3>
NULL
eq_ref
<auto_distinct_key>
<auto_distinct_key>
1022
allBreeds.breed
1
100.00
NULL
3
MATERIALIZED
animals
NULL
ref
PRIMARY,animals_species_index
animals_species_index
153
const
1390666
100.00
Using index
3
MATERIALIZED
animal_breeds
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_animal_id_breed_unique
5
animals.id
1
100.00
Using index
2
DERIVED
animal_breeds
NULL
range
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_breed_animal_id_index
1022
NULL
7776
100.00
Using index for group-by
Lastly, both of these databases are using the base docker image with no changes to the configuration. Although the query still runs poorly on an VPS running MySQL 8 with some tweaked settings. I also read through a thread about someone having a similar problem but the comments/answer didn't seem to help in my case.
Any help would be much appreciated!
EDIT:
Here is the execution plan for the SELECT DISTINCT ... JOIN:
SELECT DISTINCT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'
MySQL 5.7
284 rows in set (25.27 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
NULL
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1385271
100.00
Using index; Using temporary
1
SIMPLE
ab
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index
MySQL 8.0
284 rows in set (29.45 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
NULL
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1390666
100.00
Using index; Using temporary
1
SIMPLE
ab
NULL
ref
animal_breeds_animal_id_breed_unique,animal_breeds_breed_animal_id_index
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index
SELECT ab.breed
FROM animal_breeds ab
INNER JOIN animals a on a.id=ab.animal_id
WHERE a.species='Dog'
MySQL 5.7
2722722 rows in set (26.69 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1385271
100.00
Using index
1
SIMPLE
ab
ref
animal_breeds_animal_id_breed_unique
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index
MySQL 8.0
2722722 rows in set (32.49 sec)
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
a
NULL
ref
PRIMARY,animals_species_index,id_species
animals_species_index
153
const
1390666
100.00
Using index
1
SIMPLE
ab
NULL
ref
animal_breeds_animal_id_breed_unique
animal_breeds_animal_id_breed_unique
5
a.id
1
100.00
Using index
Filtering animals before joining it to breeds will improve performance (10x faster in some cases):
SELECT DISTINCT ab.breed
FROM animal_breeds ab
WHERE ab.animal_id IN (
SELECT a.id
FROM animals a
WHERE a.species = 'Dog');
Try to write query without inner join and start with a table that contains columns from where conditions. Here is the one of possible variants:
SELECT DISTINCT ab.breed
FROM animals a
LEFT JOIN animal_breeds ab on a.id = ab.animal_id
WHERE a.species = 'Dog'
Consider this:
Change the table name animals to pets. (I think of "Fido" as a pet and "dog" as an animal. Because of this, it took me a long time to figure out the schema.)
Move species to the other table, now calledpet_species. Though I cannot imagine a "cat" breed being called "retriever", it does make sense that (species, breed) is a hierarchical pair of terms.
Another confusion: Technically speaking a "dog" is "Canis familiaris", which is [technically] two terms 'genus' and 'species'.
Moving species to the other table will lead to some changes to the queries. You could have it in both tables, though DB purists say that redundant information is a "no-no". I have not thought of a compromise between the two stands.
I have 2 tables:
CREATE TABLE `directory` (
`id` bigint NOT NULL,
`datasets` json DEFAULT NULL
PRIMARY KEY (`id`) USING BTREE,
KEY `idx_datasets` ((cast(`datasets` as unsigned array)))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
CREATE TABLE `dataset` (
`id` bigint NOT NULL,
`name` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE,
KEY `idx_name` (`name`,`id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
The query below uses indexes on the two tables as expected:
explain
SELECT * FROM dataset d inner join `directory` dir
on JSON_CONTAINS(dir.datasets, cast(d.id as json))
where d.id = 111;
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE d const PRIMARY PRIMARY 8 const 1 100.00
1 SIMPLE dir range idx_datasets idx_datasets 9 2 100.00 Using where
However, this query uses index only on the left table
explain
SELECT * FROM dataset d inner join `directory` dir
on JSON_CONTAINS(dir.datasets, cast(d.id as json))
where d.name like '111';
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE d range idx_name idx_name 259 1 100.00 Using index condition
1 SIMPLE dir ALL 1000 100.00 Using where; Using join buffer (hash join)
Could someone explain the difference between the two queries?
I change the condition "like" to "=", the result is the same:
explain
SELECT * FROM dataset d inner join catalog dir
on JSON_CONTAINS(dir.datasets, cast(d.id as json))
where d.name = '111';
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE d ref idx_name idx_name 259 const 1 100.00
1 SIMPLE dir ALL 1000 100.00 Using where; Using join buffer (hash join)
this is caused by the like expression in the WHERE clause in the 2nd query
here is why, explained in some threads:
1- Equals(=) vs. LIKE
2- SQL 'like' vs '=' performance
EDIT looks like the issue here is caused by the fact that in the first query you are searching against the Primary Key, while in the second you dont
More details in this question:
Behavior of WHERE clause on a primary key field
I have an animals table with about 3 million records. The table has, among a few other columns, an id, name, and owner_id column. I have an animal_breeds table with about 2.5 million records. The table only has an animal_id and breed column.
I'm trying to find the distinct breed values that are associated with a specific owner_id, but the query is taking 20 seconds or so. Here's the query:
SELECT DISTINCT `breed`
FROM `animal_breeds`
INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id`
WHERE `animals`.`owner_id` = ? ;
The tables have all appropriate indices. I can't denormalize the table by adding a breed column to the animals table because it is possible for animals to be assigned multiple breeds. I also have this problem with a few other large tables that have one-to-many relationships.
Is there a more performant way to achieve what I'm looking for? It seems like a pretty simple problem but I can't seem to figure out the best way to achieve this other than pre-calculating and caching the results.
Here is the explain output from my query. Notice the Using temporary
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 "SIMPLE" "a" NULL "ref" "PRIMARY,animals_animal_id_index" "animals_animal_id_index" "153" "const" 1126303 100.00 "Using index; Using temporary"
1 "SIMPLE" "ab" NULL "ref" "animal_breeds_animal_id_breed_unique,animal_breeds_animal_id_index,animal_breeds_breed_index" "animal_breeds_animal_id_breed_unique" "5" "pedigreeonline.a.id" 1 100.00 "Using index"
And as requested, here are the create table statements (I left off a few unrelated columns and indices from the animals table). I believe the animal_breeds_animal_id_index index on animal_breeds table is redundant because of the unique key on the table, but we can ignore that for now as long as it's not causing the problem :)
CREATE TABLE `animals` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`owner_id` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `animals_animal_id_index` (`owner_id`,`id`),
KEY `animals_name_index` (`name`),
) ENGINE=InnoDB AUTO_INCREMENT=2470843 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `animal_breeds` (
`animal_id` int(10) unsigned DEFAULT NULL,
`breed` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
KEY `animal_breeds_animal_id_index` (`animal_id`),
KEY `animal_breeds_breed_index` (`breed`),
CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Any help would be appreciated. Thanks!
With knowledge about your data you can try something like this:
SELECT
b.*
FROM
(
SELECT
DISTINCT `breed`
FROM
`animal_breeds`
) AS b
WHERE
EXISTS (
SELECT
*
FROM
animal_breeds AS ab
INNER JOIN animals AS a ON ab.animal_id = a.id
WHERE
b.breed = ab.breed
AND a.owner_id = ?
)
;
The idea is to get short list of distinct breeds without any filtering (for small list it would be quite fast) and then filter further the list with correlated subquery. As the list is short it would be only few subqueries executed and they will only check for existence that is much faster that any grouping (distinct == grouping).
This will only work if your distinct list is quite short.
With random generated data based on your answers the above query gave me the following execution plan:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 2 100.00
3 SUBQUERY a ref PRIMARY,animals_animal_id_index animals_animal_id_index 153 const 1011 100.00 Using index
3 SUBQUERY ab ref animal_breeds_animal_id_breed_unique,`animal_breeds_animal_id_index`,animal_breeds_animal_id_index `animal_breeds_animal_id_index` 5 test.a.id 2 100.00 Using index
2 DERIVED animal_breeds range animal_breeds_animal_id_breed_unique,`animal_breeds_breed_index`,animal_breeds_breed_index `animal_breeds_breed_index` 1022 2 100.00 Using index for group-by
Alternatively, you can try to create WHERE clause like this:
...
WHERE
b.breed IN (
SELECT
ab.breed
FROM
animal_breeds AS ab
INNER JOIN animals AS a ON ab.animal_id = a.id
WHERE
a.owner_id = ?
)
For this query:
SELECT DISTINCT ab.`breed`
FROM `animal_breeds` ab INNER JOIN
`animals` a
ON a.`id` = ab.`animal_id`
WHERE a.`owner_id` = ? ;
You want indexes on animals(owner_id, id) and animal_breeds(animal_id, breed). The order of the columns in the composite index is important.
With the right index, I imagine that this will be very fast.
EDIT:
According to the explain, there are 1,126,303 matches for the values you are using. The time is due to removing duplicates. Given the sizes of the tables, it is surprising that there would be so many matching one value.
Why are these two queries with the only difference being the campaign_id (a foreign key to another table) getting different performance and different EXPLAIN results?
Query 1 - Avg time: 0.21s
SELECT tx_time, campaign_id, tx_amount, tx_status FROM tx WHERE
campaign_id=6963 ORDER BY tx_time DESC LIMIT 2500;
Query 2 - Avg time: 0.29s
SELECT tx_time, campaign_id, tx_amount, tx_status FROM tx WHERE
campaign_id=6946 ORDER BY tx_time DESC LIMIT 2500;
Query 1 vs Query 2 EXPLAIN:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE tx NULL index tx_campaign_id tx_time 4 NULL 85591 2.92 Using where
1 SIMPLE tx NULL ref tx_campaign_id tx_campaign_id 4 const 106312 100 Using index condition; Using filesort
UPDATE: After adding (tx_id,tx_time,campaign_id) and (tx_id,tx_time) indexes and running ANALYZE, Query 1 has improved to 0.15s but Query 2 has slowed to 13s. Updated EXPLAINs:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE tx NULL index tx_campaign_id tx_time 4 NULL 75450 3.31 Using where
1 SIMPLE tx NULL ref tx_campaign_id tx_campaign_id 4 const 117400 100.00 Using index condition; Using filesort
Table tx:
CREATE TABLE tx (
tx_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
tx_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
campaign_id int(10) unsigned NOT NULL,
tx_amount decimal(12,5) unsigned NOT NULL,
tx_geo varchar(2) NOT NULL,
tx_langauge varchar(511) NOT NULL,
tx_ua varchar(511) NOT NULL,
tx_ip varchar(45) NOT NULL,
tx_status tinyint(255) DEFAULT NULL,
PRIMARY KEY (tx_id),
KEY tx_campaign_id (campaign_id),
KEY tx_time (tx_time) USING BTREE,
KEY tx_amount (tx_amount) USING BTREE,
KEY tx_time_campaign_id (tx_id,tx_time,campaign_id) USING BTREE,
KEY tx_id_time (tx_id,tx_time) USING BTREE,
CONSTRAINT campaign_idcampaign_id FOREIGN KEY (campaign_id) REFERENCES campaign (campaign_id) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=10855433 DEFAULT CHARSET=utf8
You need INDEX(campaign_id, tx_time) with the columns in that order.
In general, put the = column first, namely campaign_id. In this case, that takes care of the entire WHERE clause, so you can move on to the ORDER BY. Then add all the columns in the ORDER BY, namely tx_time.
Having successfully built an index that handles those, then the processing can stop at the LIMIT rows and avoid a 'filesort'.
Index Cookbook
Without seeing your schema, it's hard to be sure, but I'm guessing it's because the optimizer is trying to figure out which index is more useful.
I assume you don't have a compound index (transaction_id, tx_time) on the table; if you had, the optimizer would probably use that (and be faster).
If you think about how the query would work, you can either first find all the records based on transaction ID, and then sort them based on time, or you can sort the records based on time, and discard the ones that don't belong to the transaction id you care about.
The first option (find all the matching transactions, then sort them) is fastest if you have lots of transaction IDs, and not that many time stamps.
The second is fastest if you have lots of time stamps, and not that many transaction IDs. That's why the number of rows considered varies between query plans.
The best way to optimize this is to create the compound index, and to make sure you update the statistics the query optimizer uses.
I have a myisam table with a primary key spanning 5 columns. I do a select using a WHERE on every of those 5 columns ANDed. Using the primary key (multicolumn index) it takes 25s, using a single index in one of the columns it takes 1 sec. I did a profiling and most of the 25s is taken in “Sending data” stage. The primary key has cardinality of about 7M and the single column about 80. Am i missing somehting?
CREATE TABLE `mytable` (
`a` int(11) unsigned NOT NULL,
`b` varchar(2) NOT NULL,
`c` int(11) unsigned NOT NULL,
`d` varchar(560) NOT NULL,
`e` varchar(45) NOT NULL,
PRIMARY KEY (`a`,`e`,`d`,`b`,`c`),
KEY `d` (`d`),
KEY `e` (`e`),
KEY `b` (`b`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
EXPLAIN SELECT * FROM mytable USE INDEX (PRIMARY)
WHERE a=12 AND e=1319677200 AND d='69.171.242.53' AND b='*' AND c=0;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE i ref PRIMARY PRIMARY 4 const 5912231 Using where
EXPLAIN SELECT * FROM mytable
WHERE a=12 AND e=1319677200 AND d='69.171.242.53' AND b='*' AND c=0;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE i ref PRIMARY,d,e,b d 562 const 158951 Using where
The problem is caused by casting,
try quote every varchar column b,d,e
SELECT * FROM mytable USE INDEX (PRIMARY)
WHERE a=12 AND e='1319677200' AND d='69.171.242.53' AND b='*' AND c=0;