Why does removing this index in MySQL speed up my query 100x? - mysql

I have the following MySQL table (simplified):
CREATE TABLE `track` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(256) NOT NULL,
`is_active` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `is_active` (`is_active`, `id`)
) ENGINE=MyISAM AUTO_INCREMENT=7495088 DEFAULT CHARSET=utf8
The 'is_active' column marks rows that I want to ignore in most, but not all, of my queries. I have some queries that read chunks out of this table periodically. One of them looks like this:
SELECT id,title from track where (track.is_active=1 and track.id > 5580702) ORDER BY id ASC LIMIT 10;
This query takes over a minute to execute. Here's the execution plan:
> EXPLAIN SELECT id,title from track where (track.is_active=1 and track.id > 5580702) ORDER BY id ASC LIMIT 10;
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
| 1 | SIMPLE | t | ref | PRIMARY,is_active | is_active | 1 | const | 3747543 | Using where |
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
Now, if I tell MySQL to ignore the 'is_active' index, the query happens instantaneously.
> EXPLAIN SELECT id,title from track IGNORE INDEX(is_active) WHERE (track.is_active=1 AND track.id > 5580702) ORDER BY id ASC LIMIT 10;
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| 1 | SIMPLE | t | range | PRIMARY | PRIMARY | 4 | NULL | 1597518 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
Now, what's really strange is that if I FORCE MySQL to use the 'is_active' index, the query once again happens instantaneously!
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| 1 | SIMPLE | t | range | is_active |is_active| 5 | NULL | 1866730 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
I just don't understand this behavior. In the 'is_active' index, rows should be sorted by is_active, followed by id. I use both the 'is_active' and 'id' columns in my query, so it seems like it should only need to do a few hops around the tree to find the IDs, then use those IDs to retrieve the titles from the table.
What's going on?
EDIT: More info on what I'm doing:
Query cache is disabled
Running OPTIMIZE TABLE and ANALYZE TABLE had no effect
6,620,372 rows have 'is_active' set to True. 874,714 rows have 'is_active' set to False.
Using FORCE INDEX(is_active) once again speeds up the query.
MySQL version 5.1.54

It looks like MySQL is making a poor decision about how to use the index.
From that query plan, it is showing it could have used either the PRIMARY or is_active index, and it has chosen is_active in order to narrow by track.is_active first. However, it is only using the first column of the index (track.is_active). That gets it 3747543 results which then have to be filtered and sorted.
If it had chosen the PRIMARY index, it would be able to narrow down to 1597518 rows using the index, and they would be retrieved in order of track.id already, which should require no further sorting. That would be faster.
New information:
In the third case where you are using FORCE INDEX, MySQL is using the is_active index but now instead of only using the first column, it is using both columns (see key_len). It is therefore now able to narrow by is_active and sort and filter by id using the same index, and since is_active is a single constant, the ORDER BY is satisfied by the second column (ie the rows from a single branch of the index are already in sorted order). This seems to be an even better outcome than using PRIMARY - and probably what you intended in the first place, right?
I don't know why it wasn't using both columns of this index without FORCE INDEX, unless the query has changed in a subtle way in between. If not I'd put it down to MySQL making bad decisions.

I think the speedup is due to your where clause. I am assuming that it is only retrieving a small subset of the rows in the entire large table. It is faster to do a table scan of the retrieved data for is_active on the small subset than to do the filtering through a large index file. Traversing a single column index is much faster than traversing a combined index.

Few things you could try:
Do an OPTIMIZE and CHECK on your table, so mysql will re-calculate index values
have a look at http://dev.mysql.com/doc/refman/5.1/en/index-hints.html - you can tell mysql to choose the right index in different cases

Related

mysql doesn't use index. help me figure out why

there is the table test :
show create table test;
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`body` longtext NOT NULL,
`timestamp` int(11) NOT NULL,
`handle_after` datetime NOT NULL,
`status` varchar(100) NOT NULL,
`queue_id` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
KEY `idxTimestampStatus` (`timestamp`,`status`),
KEY `idxTimestampStatus2` (`status`,`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=80000 DEFAULT CHARSET=utf8
there is two select's
1) select * from test where status = 'in_queue' and timestamp > 1625721850;
2) select id from test where status = 'in_queue' and timestamp > 1625721850;
in the first select explain show me that no indexes are used
in the second select index idxTimestampStatus is used.
MariaDB [db]> explain select * from test where status = 'in_queue' and timestamp > 1625721850;
+------+-------------+-------+------+----------------------------------------+------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+----------------------------------------+------+---------+------+----------+-------------+
| 1 | SIMPLE | test | ALL | idxTimestampStatus,idxTimestampStatus2 | NULL | NULL | NULL | 80000 | Using where |
+------+-------------+-------+------+----------------------------------------+------+---------+------+----------+-------------+
MariaDB [db]> explain select id from test where status = 'in_queue' and timestamp > 1625721850;
+------+-------------+-------+------+----------------------------------------+---------------------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+----------------------------------------+---------------------+---------+-------+------+--------------------------+
| 1 | SIMPLE | test | ref | idxTimestampStatus,idxTimestampStatus2 | idxTimestampStatus2 | 302 | const | 4 | Using where; Using index |
+------+-------------+-------+------+----------------------------------------+---------------------+---------+-------+------+--------------------------+
Help me figure out what i'm doing wrong ?
How should i create index for first select?
why does the number of columns affect the index usage ?
What you saw is to be expected. (The "number of columns" did not cause what you saw.) Read all the points below; various combinations of them should address all the issues raised in both the Question and Comments.
Deciding between index and table scan:
The Optimizer uses statistics to decide between using an index and doing a full table scan.
If less than (about) 20% of the rows need to be fetched, the index will be used. This involves bouncing back and forth between the index's BTree and the data's BTree.
If more of the table is needed, then it is deemed more efficient to simply scan the table, ignoring any rows that don't match the WHERE.
The "20%" is not a hard-and-fast number.
SELECT id ... status ... timestamp;
In InnoDB, a secondary index implicitly includes the columns of the PRIMARY KEY.
If all the columns mentioned in the query are in an index, then that index is "covering". This means that all the work can be done in the index's BTree without touching the data's BTree.
Using index == "covering". (That is, EXPLAIN gives this clue.)
"Covering" overrides the "20%" discussion.
SELECT * ... status ... timestamp;
SELECT * needs to fetch all columns, so "covering" does not apply and the "20%" becomes relevant.
If 1625721850 were a larger number, the EXPLAIN would switch from ALL to Index.
idxTimestampStatus2 (status,timestamp)
The order of the clauses in WHERE does not matter.
The order of the columns in a "composite" index is important. ("Composite" == multi-column)
Put the = column(s) first, then one "range" (eg >) column.
More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

how the sql works by < Or > in a sql about using index

how can this sql use index and how can this sql not use index.
CREATE TABLE `testtable` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`a` int(11) NOT NULL,
`b` int(11) NOT NULL,
`c` int(11) NOT NULL,
`d` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_abd` (`a`,`b`,`d`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8;
explain select * from testtable where a > 1;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testtable | NULL | ALL | idx_abd | NULL | NULL | NULL | 10 | 80.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
explain select * from testtable where a < 1;
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
| 1 | SIMPLE | testtable | NULL | range | idx_abd | idx_abd | 4 | NULL | 1 | 100.00 | Using index condition |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
why the first one can't use index but the second use index.
how the index works inside?
In first case, MySQL optimizer (based on statistics) decided that it is better to do a Full Table Scan, instead of first doing Index Lookups, and then do a Data Lookup.
In the first query of yours, the condition used (a > 1) is effectively needing to access 10 out of 11 rows. Always remember that, MySQL does Cost based optimization (tries to minimize the cost). The process is basically:
Assign a cost to each operation.
Evaluate how many operations each possible plan would take.
Sum up the total.
Choose the plan with the lowest overall cost.
Now, default MySQL cost for io_block_read_cost is 1. In the first query, you are going to roughly have two times the I/O block reads (first for index lookups and then Data lookups). So, the cost would come out roughly as 20, in case MySQL decides to use the index. Instead, if it does the Table Scan directly, the cost would be roughly 11 (Data lookup on all the rows). That is why, it decided to use Table Scan instead of Range based Index Scan.
If you want to get details about the Cost breakup, please run each of this queries by appending EXPLAIN format=JSON to them and executing them, like below:
EXPLAIN format=JSON select * from testtable where a > 1;
You can also see how Optimizer compared various plans before locking into a particular strategy. To do this, execute the queries below:
/* Turn tracing on (it's off by default): */
SET optimizer_trace="enabled=on";
SELECT * FROM testtable WHERE a > 1; /* your query here */
SELECT * FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;
/* possibly more queries...
When done with tracing, disable it: */
SET optimizer_trace="enabled=off";
Check more details at MySQL documentation: https://dev.mysql.com/doc/internals/en/optimizer-tracing.html
The alternative is to read the both the index and the data pages. On such small data, that can be less efficient (although the difference in performance -- like the duration of each query -- is quite small).
Your table has 10 rows, which presumably are all on a single data page. MySQL considers it more efficient to just read the 10 rows directly and do the comparison.
The value of indexes is when you have larger tables, particularly tables that span many data pages. One primary use is to reduce the number of data pages being read.

Query performance on primary index vs index

I have a table on mysql and two queries whose performances are quite different. I have extracted plans of the queries, but I couldn't fully understand the reason behind the performance difference.
The table:
+-------------+----------------------------------------------+------------------------------------+
| TableA | | |
+-------------+----------------------------------------------+------------------------------------+
| id | int(10) unsigned NOT NULL AUTO_INCREMENT | |
| userId | int(10) | unsigned DEFAULT NULL |
| created | timestamp | NOT NULL DEFAULT CURRENT_TIMESTAMP |
| PRIMARY KEY | id | |
| KEY userId | userId | |
| KEY created | created | |
+-------------+----------------------------------------------+------------------------------------+
Keys/Indices: The primary key on id field, a key on userId field ASC
, another key on created field ASC.
tableA is a very big table, it contains millions of rows.
The query I run on this table is:
The user with id 1234 has 1.5M records in this table. I want to fetch its latest 100 rows. In order to achieve this, I have 2 different queries:
Query 1:
SELECT * FROM tableA USE INDEX (userId)
WHERE userId=1234 ORDER BY created DESC LIMIT 100;
Query 2:
SELECT * FROM tableA
WHERE userId=1234 ORDER BY id DESC LIMIT 100;
Since id field of tableA is auto increment, the condition of being latest is preserved. These 2 queries return the same result. However, there is a huge performance difference.
Query plans are:
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
| Query No | Operation | Params | Raws | Raw desc |
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
| Query 1 | Sort(using file sort) Unique index scan (ref) | table: tableA; index: userId; | 2.5M | Using index condition; Using filesort |
| Query 2 | Unique index scan (ref) | table: tableA; index: userId; | 2.5M | Using where |
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
+--------+-------------+
| | Performance |
+--------+-------------+
| Query1 | 7,5 s |
+--------+-------------+
| Query2 | 741 ms |
+--------+-------------+
I understand that there is a sorting operation on Query 1. In each query, the index used is userId. But why is sorting not used in Query 2? How does the primary index affect?
Mysql 5.7
Edit: There are more columns on the table, I have extracted them from the table definition above.
Since id field of tableA is auto increment, the condition of being latest is preserved.
That is usually a valid statement.
WHERE userId=1234 ORDER BY created DESC LIMIT 100
needs this 'composite' index: (userId, created). With that, it will hit only 100 rows, regardless of the table size or the number of rows for that user.
The same goes for
WHERE userId=1234 ORDER BY id DESC LIMIT 100;
Namely that it needs (userId, id). However, in InnoDB, when you say INDEX(x) it silently tacks on the PRIMARY KEY columns. So you effectively get INDEX(x,id). This is why your plain INDEX(userId) worked well.
EXPLAIN rarely (if ever) takes into account the LIMIT. This is why 'Rows' is "2.5M" for both queries.
The first query might (or might not) have used INDEX(userId) if you took out the USE INDEX hint. The choice depends on what percentage of the table has userId = 1234. If it is less than about 20%, the index would be used. But it would bounce back and forth between the secondary index and the data -- all 1.5 million times. If more than 20%, it would avoid the bouncing by simply reading all the "millions" of rows, ignoring those that don't apply.
Note: What you had for Q1 will still read at least 1.5M rows, sort them ("Using filesort"), then peel off the desired 100. But with INDEX(userId, created), it can skip the sort and look at only 100 rows.
I cannot explain "Unique index scan" without seeing SHOW CREATE TABLE and the un-annotated EXPLAIN. (EXPLAIN FORMAT=JSON SELECT... might provide more insight.)

Why is my query checking 1000's of rows even though the table is indexed?

The table has around 20K rows and the following create code:
CREATE TABLE `inventory` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`TID` int(11) DEFAULT NULL,
`RID` int(11) DEFAULT NULL,
`CID` int(11) DEFAULT NULL,
`value` text COLLATE utf8_unicode_ci,
PRIMARY KEY (`ID`),
KEY `index_TID_CID_value` (`TID`,`CID`,`value`(25))
);
and this is the result of the explain query
mysql> explain select rowID from inventory where TID=4 and CID=28 and value=3290843588097;
+----+-------------+------------+------+------------------------+-----------------------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+------------------------+-----------------------+---------+-------------+------+-------------+
| 1 | SIMPLE | inventory | ref | index_TID_CID_value | index_TID_CID_value | 10 | const,const | 9181 | Using where |
+----+-------------+------------+------+------------------------+-----------------------+---------+-------------+------+-------------+
1 row in set (0.00 sec)
The combination of TID=4 and CID=28 has around 13K rows in the table.
My questions are:
Why is the explain result telling me that around 9k rows will be
examined to get the final result?
Why is the column ref showing only const,const since 3 columns are included in the multi column index shouldn't ref be const,const,const ?
Update 7 Oct 2016
Query:
select rowID from inventory where TID=4 and CID=28 and value=3290843588097;
I ran it about 10 times and took the times of the last five (they were the same)
No index - 0.02 seconds
Index (TID, CID) - 0.03 seconds
Index (TID, CID, value) - 0.00 seconds
Also the same explain query looks different today, how?? note the key len has changed to 88 and the ref has changed to const,const,const also the rows to examine have reduced to 2.
mysql> explain select rowID from inventory where TID=4 and CID=28 and value='3290843588097';
+----+-------------+-----------+------+----------------------+---------------------+---------+-------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+----------------------+---------------------+---------+-------------------+------+-------------+
| 1 | SIMPLE | inventory | ref | index_TID_CID_value | index_TID_CID_value | 88 | const,const,const | 2 | Using where |
+----+-------------+-----------+------+----------------------+---------------------+---------+-------------------+------+-------------+
1 row in set (0.04 sec)
To explicitly answer your questions.
The explain plan is giving you ~9k rows queried due to the fact that the engine needs to search through the index tree to find the row IDs that match your where-clause criteria. The index is going to produce a mapping of each possible combination of the index column values to a list of the rowIDs associated with that combination. In effect, the engine is searching those combinations to find the right one; this is done by scanning the combination, hence the ~9k amount.
Since your where-clause criteria involves all three of the index columns, the engine is optimizing the search by leveraging the index for the first two columns, and then short-circuiting the third column and getting all rowID results for that combination.
In your specific use-case, I'm assuming you want to optimize performance of the search. I would recommend that you create just an index on TID and CID (not value). The reason for this is that you currently only have 2 combinations of these values out of ~20k records. This means that using an index with just 2 columns, the engine will be able to almost immediately cut out half of the records when doing a search on all three values. (This is all assuming that this index will be applied to a table with a much larger dataset.) Since your metrics are based off a smaller dataset, you may not be seeing the order of magnitude of performance differences between using the index and not.

MySQL using different index depending on limit value with ORDER BY query

This is weird to me:
One table 'ACTIVITIES' with one index on ACTIVITY_DATE. The exact same query with different LIMIT value results in different execution plan.
Here it is:
mysql> explain select * from ACTIVITIES order by ACTIVITY_DATE desc limit 20
-> ;
+----+-------------+------------+-------+---------------+-------------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------------+---------+------+------+-------+
| 1 | SIMPLE | ACTIVITIES | index | NULL | ACTI_DATE_I | 4 | NULL | 20 | |
+----+-------------+------------+-------+---------------+-------------+---------+------+------+-------+
1 row in set (0.00 sec)
mysql> explain select * from ACTIVITIES order by ACTIVITY_DATE desc limit 150
-> ;
+----+-------------+------------+------+---------------+------+---------+------+-------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------------+
| 1 | SIMPLE | ACTIVITIES | ALL | NULL | NULL | NULL | NULL | 10629 | Using filesort |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------------+
1 row in set (0.00 sec)
How come when I limit 150 it is not using the index? I mean, scanning 150 lines seems faster than scanning 10629 rows, right?
EDIT
The query uses the index till "limit 96" and starts filesort at "limit 97".
The table has nothing specific, even not a foreign key, here is the complete create table:
mysql> show create table ACTIVITIES\G
*************************** 1. row ***************************
Table: ACTIVITIES
Create Table: CREATE TABLE `ACTIVITIES` (
`ACTIVITY_ID` int(11) NOT NULL AUTO_INCREMENT,
`ACTIVITY_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`USER_KEY` varchar(50) NOT NULL,
`ITEM_KEY` varchar(50) NOT NULL,
`ACTIVITY_TYPE` varchar(1) NOT NULL,
`EXTRA` varchar(500) DEFAULT NULL,
`IS_VISIBLE` varchar(1) NOT NULL DEFAULT 'Y',
PRIMARY KEY (`ACTIVITY_ID`),
KEY `ACTI_USER_I` (`USER_KEY`,`ACTIVITY_DATE`),
KEY `ACTIVITY_ITEM_I` (`ITEM_KEY`,`ACTIVITY_DATE`),
KEY `ACTI_ITEM_TYPE_I` (`ITEM_KEY`,`ACTIVITY_TYPE`,`ACTIVITY_DATE`),
KEY `ACTI_DATE_I` (`ACTIVITY_DATE`)
) ENGINE=InnoDB AUTO_INCREMENT=10091 DEFAULT CHARSET=utf8 COMMENT='Logs activity'
1 row in set (0.00 sec)
mysql>
I also tried to run "ANALYSE TABLE ACTIVITIES" but that did not change a thing.
That's the way things go. Bear with me a minute...
The Optimizer would like to use an INDEX, in this case ACTI_DATE_I. But it does not want to use it if that would be slower.
Plan A: Use the index.
Reach into the BTree-structured index at the end (because of DESC)
Scan backward
For each row in the index, look up the corresponding row in the data. Note: The index has (ACTIVITY_DATE, ACTIVITY_ID) because the PRIMARY KEY is implicitly appended to any secondary key. To reach into the "data" using the PK (ACTIVITY_ID) is another BTree lookup, potentially random. Hence, it is potentially slow. (But not very slow in your case.)
This stops after LIMIT rows.
Plan B: Ignore the table
Scan the table, building a tmp table. (Likely to be in-memory.)
Sort the tmp table
Peel off LIMIT rows.
In your case (96 -- 1% of 10K) it is surprising that it picked the table scan. Normally, the cutoff is somewhere around 10%-30% of the number of rows in the table.
ANALYZE TABLE should have caused a recalculation of the statistics, which could have convinced it to go with the other Plan.
What version of MySQL are you using? (No, I don't know of any changes in this area.)
One thing you could try: OPTIMIZE TABLE ACTIVITIES; That will rebuild the table, thereby repacking the blocks and leading to potentially different statistics. If that helps, I would like to know it -- since I normally say "Optimize table is useless".