From what I understand about multi-column indexes, they are only useful if you're using columns starting from the left and not skipping any. As in, when you have an index for (a, b, c), you can query on a, a, b, or a, b, c.
But today I found out that when there's an index (BTREE on an InnoDB table) on:
some_varchar, some_bigint, other_varchar
I can query:
SELECT MAX(some_bigint) FROM the_table
and the plan for it says:
id: 1
select_type: SIMPLE
table: the_table
type: index
possible_keys: NULL
key: index_some_varchar_some_bigint_other_varchar
key_len: 175
ref: NULL
rows: 1
Extra: Using index
This seems to disagree with the docs. It's also confusing since the key is set, but possible_keys isn't.
How does this work in practice? If the key is ordered by some_varchar first, (or a prefix of it) how can MySQL get a MAX of the second column from it?
(a guess would be that MySQL collects some extra information about all columns in an index, but if that's true - is it possible to see it directly?)
My understanding about the indexes was correct, but the understanding of what Using index means was wrong.
Using index doesn't necessarily mean that the value was accessed via a fast lookup. It simply means that the row data was not accessed. When the type is index and the Extra is Using index, it still means that the whole index is being scanned:
From the documentation:
The index join type is the same as ALL, except that the index tree is scanned.
For a MAX lookup which is actually using a prefix of an index, the explain looks like this:
id: 1
select_type: SIMPLE
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away
The MySQL 5.7 documentation states:
The filtered column indicates an estimated percentage of table rows that will be filtered by the table condition. That is, rows shows the estimated number of rows examined and rows × filtered / 100 shows the number of rows that will be joined with previous tables.
To attempt to understand this better, I tried it out on a query using the MySQL Sakila Sample Database. The table in question has the following structure:
mysql> SHOW CREATE TABLE film \G
*************************** 1. row ***************************
Table: film
Create Table: CREATE TABLE `film` (
`film_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
`description` text,
`release_year` year(4) DEFAULT NULL,
`language_id` tinyint(3) unsigned NOT NULL,
`original_language_id` tinyint(3) unsigned DEFAULT NULL,
`rental_duration` tinyint(3) unsigned NOT NULL DEFAULT '3',
`rental_rate` decimal(4,2) NOT NULL DEFAULT '4.99',
`length` smallint(5) unsigned DEFAULT NULL,
`replacement_cost` decimal(5,2) NOT NULL DEFAULT '19.99',
`rating` enum('G','PG','PG-13','R','NC-17') DEFAULT 'G',
`special_features` set('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`film_id`),
KEY `idx_title` (`title`),
KEY `idx_fk_language_id` (`language_id`),
KEY `idx_fk_original_language_id` (`original_language_id`),
CONSTRAINT `fk_film_language` FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`) ON UPDATE CASCADE,
CONSTRAINT `fk_film_language_original` FOREIGN KEY (`original_language_id`) REFERENCES `language` (`language_id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1001 DEFAULT CHARSET=utf8
And this is the EXPLAIN plan for the query:
mysql> EXPLAIN SELECT * FROM film WHERE release_year=2006 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: film
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1000
filtered: 10.00
Extra: Using where
This table's sample dataset has 1,000 total rows, and all of them have release_year set to 2006. Using the formula in the MySQL documentation:
rows x filtered / 100 = "number of rows that will be joined with previous tables
So,
1,000 x 10 / 100 = 100 = "100 rows will be joined with previous tables"
Huh? What "previous table"? There is no JOIN going on here.
What about the first portion of the quote from the documentation? "Estimated percentage of table rows that will be filtered by the table condition." Well, the table condition is release_year = 2006, and all records have that value, so shouldn't filtered be either 0.00 or 100.00 (depending on what they mean by "filtered")?
Maybe it's behaving strangely because there's no index on release_year? So I created one:
mysql> CREATE INDEX test ON film(release_year);
The filtered column now shows 100.00. So, shouldn't it have shown 0.00 before I added the index? Hm. What if I make half the table have release_year be 2006, and the other half not?
mysql> UPDATE film SET release_year=2017 ORDER BY RAND() LIMIT 500;
Query OK, 500 rows affected (0.03 sec)
Rows matched: 500 Changed: 500 Warnings: 0
Now the EXPLAIN looks like this:
mysql> EXPLAIN SELECT * FROM film WHERE release_year=2006 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: film
partitions: NULL
type: ref
possible_keys: test
key: test
key_len: 2
ref: const
rows: 500
filtered: 100.00
Extra: Using index condition
And, since I decided to confuse myself even further:
mysql> EXPLAIN SELECT * FROM film WHERE release_year!=2006 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: film
partitions: NULL
type: ALL
possible_keys: test
key: NULL
key_len: NULL
ref: NULL
rows: 1000
filtered: 50.10
Extra: Using where
So, an estimate of 501 rows will be filtered by the table condition and "joined with previous tables"?
I simply do not understand.
I realize it's an "estimate", but on what is this estimate based? If an index being present moves the estimate to 100.00, shouldn't its absence be 0.00, not 10.00? And what's with that 50.10 result in the last query?
Is filtered at all useful in determining if a query can be optimized further, or how to optimize it further, or is it generally just "noise" that can be ignored?
…number of rows that will be joined with previous tables…
In the absence of any joins, I believe this can be taken to mean number of rows
UPDATE - the documentation, now at least, says "following tables" but the point still stands, thanks #WilsonHauck
To take each of your examples in turn
1000 rows, all from 2006, no index…
EXPLAIN SELECT * FROM film WHERE release_year = 2006
key: NULL
rows: 1000
filtered: 10.00
Extra: Using where
Here the engine expects to visit 1000 rows, and expects to return around 10% of these
As the query is not using an index, it makes sense to predict that every row will be checked, but unfortunately the filtered estimate is inaccurate. I don't know how the engine makes this prediction, but as it doesn't know all the rows are from 2006 (until it checks them).. it's not the craziest thing in the world
Perhaps in the absence of further information, the engine expects any simple = condition to reduce the result set to 10% of the available rows
1000 rows, half from 2006, with index…
EXPLAIN SELECT * FROM film WHERE release_year = 2006
key: test
rows: 500
filtered: 100.00
Extra: Using index condition
Here the engine expects to visit 500 rows and expects to return all of them
Now the query is using the new index, the engine can make more accurate predictions. It can very quickly see that 500 rows match the condition, and will have to visit only and exactly these to satisfy the query
EXPLAIN SELECT * FROM film WHERE release_year != 2006
key: NULL
rows: 1000
filtered: 50.10
Extra: Using where
Here the engine expects to visit 1000 rows and return 50.10% of them
The engine has opted not to use the index, maybe the != operation is not quite as simple as = in this case, and therefore it makes sense to predict that every row will be visited
The engine has, however, made a fairly accurate prediction on how many of these visited rows will be returned. I don't know where the .10% comes from, but perhaps the engine has used the index or the results of previous queries to recognise that around 50% of the rows will match the condition
It's a bit of a dark art, but the filtered value does give you some fairly useful information, and some insight into why the engine has made certain decisions
If the number of rows is high and the filtered rows estimate is low (and accurate), it may be a good indication that a carefully applied index could speed up the query
how can I make use of it?
High numbers (ideally filtered: 100.00) indicate, that the query is using a "good" index, or an index would be useless.
Consider a table with a deleted_at TIMESTAMP NULL column (soft deletion) without an index on it, and like 99% of rows contain NULL (are not deleted). Now with a query like
SELECT * FROM my_table WHERE deleted_at IS NULL
you might see
filtered: 99.00
In this case an index on deleted_at would be useless, due to the overhead of a second lookup (finding the filtered rows in the clustered index). In worst case the index might even hurt the performance, if the optimizer decides to use it.
But if you query for "deleted" rows with
SELECT * FROM my_table WHERE deleted_at IS NOT NULL
you should get something like
filtered: 1.00
The low number indicates, that the query could benefit from an index. If you now create the index on (deleted_at), EXPLAIN will show you
filtered: 100.00
I would say: Anything >= 10% is not worth creating an index. That at least for single-column conditions.
A different story, is when you have a condition on multiple columns like
WHERE a=1 AND b=2
Assuming 1M rows in the table and a cardinality of 10 for both columns (each column contains 10 distinct values) randomly distributed, with an index on (a) the engine would analize 100K rows (10% due to the index on a) and return 10K rows (10% of 10% due to condition on b). EXPLAIN should show you rows: 100000, filtered: 10.00. In this case extending the single column index on (a) to a composite index on (a, b) should improve the query time by factor 10. And EXPLAIN sould show you rows: 10000, filtered: 100.00.
However - That all is more a theory. The reason: I often see filtered: 100.00 when it should be rather 1.00, at least for low cardinality columns and at least on MariaDB. That might be different for MySQL (I can't test that right now), but your example shows a similar behavior (10.00 instead of 100.00).
Actually I don't remember when the filtered value has ever helped me. First things I look at are: The order of the tables (if it's a JOIN), the used key, the used key length and the number of examined rows.
From existing 5.7 documentation today at url
https://dev.mysql.com/doc/refman/5.7/en/explain-output.html
filtered (JSON name: filtered)
The filtered column indicates an estimated percentage of table rows that will be filtered by the table condition. The maximum value is 100, which means no filtering of rows occurred. Values decreasing from 100 indicate increasing amounts of filtering. rows shows the estimated number of rows examined and rows × filtered shows the number of rows that will be joined with the following table. For example, if rows is 1000 and filtered is 50.00 (50%), the number of rows to be joined with the following table is 1000 × 50% = 500.
So you have to write one of these to understand perfectly but the estimate is based not on the contents but meta data about the contents and statistics.
Let me give you a specific made up example I'm not saying any sql platform does what I describe here this is just an example:
You have a table with 1000 rows and max value for year column is 2010 and min value for year column is 2000 -- without any other information you can "guess" that where year = 2007 will take 10% of all items assuming an average distribution.
In this case it would return 1000 and 10.
To answer your final question filtered might be useful if (as shown above) you only have one "default" value that is throwing everything off -- you might decide to use say null instead of a default to get your queries to perform better. Or you might see that statistics needs to be run on your tables more often because the ranges change a lot. This depends a lot on a given platform and your data model.
I find the "filtered" column to be useless.
EXPLAIN (today) uses crude statistics to derive many of the numbers it shows. "Filtered" is an example of how bad they can be.
To get even deeper into numbers, run EXPLAIN FORMAT=JSON SELECT ... This, in newer versions of MySQL, will provide the "cost" for each possible execution plan. Hence, it gives you clues of what options it thought about and the "cost basis" for the plan that was picked. Unfortunately, it uses a constant for fetching a row -- giving no weighting to whether the row came from disk or was already cached.
A more precise metric of what work was done can be derived after the fact via the STATUS "Handler%" values. I discuss that, plus simple optimization techniques in http://mysql.rjweb.org/doc.php/index_cookbook_mysql .
Histograms exist in 8.0 and 10.0; they will provide more precision. They probably help make "filtered" be somewhat useful.
Is there a problem, issue, or performance hit when using duplicate WHERE clauses?
Example SQL code:
SELECT * FROM `table`
WHERE `field` = 1
AND `field` = 1
AND `field2` = 22
AND `field2` = 22
Does the optimizer eliminate the duplicates?
The WHERE clause works like an if condition in any programming language. This clause is used to compare given value with the field value available in MySQL table.
If given value from outside is equal to the available field value in MySQL table, then it returns that row.
You won't face any problems or issues by having duplicate conditions but in a bigger scale this might slightly decrease performance.
EDIT:
You can issue an EXPLAIN statement, which tells MySQL to display some information about how it would execute a SELECT query without actually executing it.
This way you can see exactly what is going to be executed.
To use EXPLAIN, just put the word EXPLAIN in front of the SELECT statement:
mysql> EXPLAIN SELECT * FROM table WHERE 0\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Impossible WHERE
Normally, EXPLAIN returns more information than that, including non-NULL information about the indexes that will be used to scan tables, the types of joins that will be used, and estimates of the number of rows that will need to be examined from each table.
I have a table ‘t_table1′ include 3 fields :
`field1` tinyint(1) unsigned default NULL,
`field2` tinyint(1) unsigned default NULL,
`field3` tinyint(1) unsigned NOT NULL default ’0′,
and a Index:
KEY `t_table1_index1` (`field1`,`field2`,`field3`),
When I run this SQL1:
EXPLAIN SELECT * FROM table1 AS c WHERE c.field1 = 1 AND c.field2 = 0 AND c.field3 = 0
Then is show:
Select type: Simple
tyle: All
possible key: t_table1_index1
key: NULL
key_len: NULL
rows: 1042
extra: Using where
I think it say that my index useless in this case.
But when I run this SQL2:
EXPLAIN SELECT * FROM table1 AS c WHERE c.field1 = 1 AND c.field2 = 1 AND c.field3 = 1
it shows:
Select type: Simple
tyle: ref
possible key: t_table1_index1
key: t_table1_index1
key_len: 5
ref: const, const, const
rows: 1
extra: Using where
This case it used my index.
So please explain for me:
why SQL1 can not use index ?
with SQL1, how can i edit index or rewrite SQL to performing more quickly ?
Thanks !
The query optimizer uses many data points to decide how to execute a query. One of those is the selectivity of the index. To use an index requires potentially more disk accesses per row returned than a table scan because the engine has to read the index and then fetch the actual row (unless the entire query can be satisfied from the index alone). As the index becomes less selective (i.e. more rows match the criteria) the efficiency of using that index goes down. At some point it becomes cheaper to do a full table scan.
In your second example the optimizer was able to ascertain that the values you provided would result in only one row being fetched, so the index lookup was the correct approach.
In the first example, the values were not very selective, with an estimate of 1042 rows being returned out of 1776. Using the index would result in searching the index, building a list of selected row references and then fetching each row. With so many rows being selected, to use the index would have resulted in more work than just scanning the entire table lineraly and filtering the rows.
From time to time I encounter a strange MySQL behavior. Let's assume I have indexes (type, rel, created), (type), (rel). The best choice for a query like this one:
SELECT id FROM tbl
WHERE rel = 3 AND type = 3
ORDER BY created;
would be to use index (type, rel, created).
But MySQL decides to intersect indexes (type) and (rel), and that leads to worse perfomance. Here is an example:
mysql> EXPLAIN
-> SELECT id FROM tbl
-> WHERE rel = 3 AND type = 3
-> ORDER BY created\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tbl
type: index_merge
possible_keys: idx_type,idx_rel,idx_rel_type_created
key: idx_type,idx_rel
key_len: 1,2
ref: NULL
rows: 4343
Extra: Using intersect(idx_type,idx_rel); Using where; Using filesort
And the same query, but with a hint added:
mysql> EXPLAIN
-> SELECT id FROM tbl USE INDEX (idx_type_rel_created)
-> WHERE rel = 3 AND type = 3
-> ORDER BY created\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tbl
type: ref
possible_keys: idx_type_rel_created
key: idx_type_rel_created
key_len: 3
ref: const,const
rows: 8906
Extra: Using where
I think MySQL takes an execution plan which contains less number in the "rows" column of the EXPLAIN command. From that point of view, index intersection with 4343 rows looks really better than using my combined index with 8906 rows. So, maybe the problem is within those numbers?
mysql> SELECT COUNT(*) FROM tbl WHERE type=3 AND rel=3;
+----------+
| COUNT(*) |
+----------+
| 3056 |
+----------+
From this I can conclude that MySQL is mistaken at calculating approximate number of rows for combined index.
So, what can I do here to make MySQL take the right execution plan?
I can not use optimizer hints, because I have to stick to Django ORM
The only solution I found yet is to remove those one-field indexes.
MySQL version is 5.1.49.
The table structure is:
CREATE TABLE tbl (
`id` int(11) NOT NULL AUTO_INCREMENT,
`type` tinyint(1) NOT NULL,
`rel` smallint(2) NOT NULL,
`created` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_type` (`type`),
KEY `idx_rel` (`rel`),
KEY `idx_type_rel_created` (`type`,`rel`,`created`)
) ENGINE=MyISAM;
It's hard to tell exactly why MySQL chooses index_merge_intersection over the index scan, but you should note that with the composite indexes, statistics up to the given column are stored for the composite indexes.
The value of information_schema.statistics.cardinality for the column type of the composite index will show the cardinality of (rel, type), not type itself.
If there is a correlation between rel and type, then cardinality of (rel, type) will be less than product of cardinalities of rel and type taken separately from the indexes on corresponding columns.
That's why the number of rows is calculated incorrectly (an intersection cannot be larger in size than a union).
You can forbid index_merge_intersection by setting it to off in ##optimizer_switch:
SET optimizer_switch = 'index_merge_intersection=off'
Another thing is worth mentioning: you would not have the problem if you deleted the index on type only. the index is not required since it duplicates a part of the composite index.
Some time the intersection on same table could be interesting, and you may not want to remove an index on a single colum so as some other query work well with intersection.
In such case, if the bad execution plan concerns only one single query, a solution is to exclude the unwanted index. Il will then prevent the usage of intersection only for that sepcific query...
In your example :
SELECT id FROM tbl IGNORE INDEX(idx_type)
WHERE rel = 3 AND type = 3
ORDER BY created;
enter code here