I'm trying to troubleshoot a performance issue on MySQL, so I wanted to create a smaller version of a table to work with. When I add a LIMIT clause to the query, it goes from about 2 seconds (for the full insert) to astronomical (42 minutes).
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id;
+------------+-------------+
| 1002395119 | 2012-05-14 |
...
| 1002395157 | 2012-05-14 |
| 1002395187 | 2012-05-14 |
| 1002395475 | 2012-05-14 |
+------------+-------------+
105776 rows in set (2.19 sec)
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id limit 1;
+------------+-------------+
| player_id | insert_date |
+------------+-------------+
| 1000000080 | 2012-05-14 |
+------------+-------------+
1 row in set (42 min 23.26 sec)
mysql> describe player_record;
+------------------------+------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+------------------------+------+-----+---------+-------+
| player_id | int(10) unsigned | NO | PRI | NULL | |
| insert_date | date | NO | PRI | NULL | |
| xp | int(10) unsigned | YES | | NULL | |
+------------------------+------------------------+------+-----+---------+-------+
17 rows in set (0.01 sec) (most columns removed)
There are 20 million rows in the player_record table, so I am creating two tables in memory for the specific dates I am looking to compare.
CREATE temporary TABLE date_curr
(
player_id INT UNSIGNED NOT NULL,
insert_date DATE,
PRIMARY KEY player_id (player_id, insert_date)
) ENGINE=MEMORY;
INSERT into date_curr
SELECT player_id,
MAX(insert_date) AS insert_date
FROM player_record
WHERE insert_date BETWEEN '2012-05-15' AND '2012-05-15' + INTERVAL 6 DAY
GROUP BY player_id;
CREATE TEMPORARY TABLE date_prev LIKE date_curr;
INSERT into date_prev
SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER join date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id limit 0,20000;
date_curr has 216k entries, and date_prev has 105k entries if I don't use a limit.
These tables are just part of the process, used to trim down another table (500 million rows) to something manageable. date_curr includes the player_id and insert_date from the current week, and date_prev has the player_id and most recent insert_date from BEFORE the current week for any player_id present in date_curr.
Here is the explain output:
mysql> explain SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER JOIN date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id
LIMIT 0,20000;
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | pr | range | PRIMARY,insert_date | insert_date | 3 | NULL | 396828 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dc | ALL | PRIMARY | NULL | NULL | NULL | 216825 | Using where; Using join buffer |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
2 rows in set (0.03 sec)
This is on a system with 24G RAM dedicated to the database, and currently is pretty much idle. This specific database is the test so it is completely static. I did a mysql restart and it still has the same behavior.
Here is the 'show profile all' output, with most time being spent on copying to tmp table.
| Status | Duration | CPU_user | CPU_system | Context_voluntary | Context_involuntary | Block_ops_in | Block_ops_out | Messages_sent | Messages_received | Page_faults_major | Page_faults_minor | Swaps | Source_function | Source_file | Source_line |
| Copying to tmp table | 999.999999 | 999.999999 | 0.383941 | 110240 | 18983 | 16160 | 448 | 0 | 0 | 0 | 43 | 0 | exec | sql_select.cc | 1976 |
A bit of a long answer but I hope you can learn something from this.
So based on the evidence in the explain statement you can see that there was two possible indexes that the MySQL query optimizer could have used they are as follows:
possible_keys
PRIMARY,insert_date
However the MySQL query optimizer decided to use the following index:
key
insert_date
This is a rare occasion where MySQL query optimizer used the wrong index. Now there is a probable cause for this. You are working on a static development database. You probably restored this from production to do development against.
When the MySQL optimizer needs to make a decision on which index to use in a query it looks at the statistics around all the possible indexes. You can read more about statistics here http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-other-changes-statistics-estimation.html for a starter.
So when you update, insert and delete from a table you change the index statistics. It might be that the MySQL server because of the static data had the wrong statistics and chose the wrong index. This however is just a guess at this point as a possible root cause.
Now lets dive into the indexes. There was two possible indexes to use the primary key index and the index on insert_date. MySQL used the insert_date one. Remember during a query execution MySQL can only use one index always. Lets look at the difference between the primary key index and the insert_date index.
Simple fact about a primary key index(aka clustered):
A primary key index is normally a btree structure that contains the data rows i.e. it is the table as it contains the date.
Simple fact about secondary index(aka non-clustered):
A secondary index is normally a btree structure that contains the data being indexed(the columns in the index) and a pointer to the location of the data row on the primary key index.
This is a subtle but big difference.
Let me explain when you read a primary key index you are reading the table. The table is in order of the primary index as well. Thus to find a value I would search the index read the data which is 1 operation.
When you read a secondary index you search the index find the pointer then read the primary key index to find the data based on the pointer. This is essentially 2 operations making the operation of reading a secondary index twice as costly as reading the primary key index.
In your case since it chose the insert_date as the index to use it was doing double the work just to do the join. That is problem one.
Now when you LIMIT a recordset it is the last piece of execution of the query. MySQL has to take the entire recordset sort it (if not sorted allready) based on ORDER BY and GROUP BY conditions then take the number of records you want and send it back based on the LIMIT BY section. MySQL has to do a lot of work to keep track of records to send and where it is in the record set etc. LIMIT BY does have a performance hit but I suspect there might be a contributing factor read on.
Look at your GROUP BY it is by player_id. The index that is used is insert_date. GROUP BY essentially orders your record set, however since it had no index to use for ordering (remember a index is sorted in the order of the column(s) contained in it). Essentially you were asking sort/order on player_id and the index used was sorted on insert_date.
This step caused the filesort problem which essentially takes the data that is returned from reading the secondary index and primary index(remember the 2 operations) and then has to sort them. Sorting is normally done on disk as it is a very very expensive operation to do in memory. Thus the entire query result was written to disk and sorted painfully slow to get you your results.
By removing the insert_date index MySQL will now use the primary key index which means the data is ordered(ORDER BY/GROUP BY) player_id and insert_date. This will eliminate the need to read the secondary index and then use the pointer to read the primary key index i.e. the table, and since the data is already sorted MySQL has very little work when applying the GROUP BY piece of the query.
Now the following is a bit of a educated guess again if you could post the results of the explain statement after the index was dropped I would probably be able to confirm my thinking. So by using the wrong index the results were sorted on disk to apply the LIMIT BY properly. Removing the LIMIT BY allows MySQL to probably sort in Memory as it does not have to apply the LIMIT BY and keep track of what is being returned. The LIMIT BY probably caused the temporary table to be created. Once again difficult to say without seeing the difference between the statements i.e. output of explain.
Hopefully this gives you a better understanding of indexes and why they are a double edged sword.
Had the same problem. When I added FORCE INDEX (id) it went back to the few milliseconds of a query it was without the limit, while producing the same results.
Related
how can this sql use index and how can this sql not use index.
CREATE TABLE `testtable` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`a` int(11) NOT NULL,
`b` int(11) NOT NULL,
`c` int(11) NOT NULL,
`d` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_abd` (`a`,`b`,`d`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8;
explain select * from testtable where a > 1;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testtable | NULL | ALL | idx_abd | NULL | NULL | NULL | 10 | 80.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
explain select * from testtable where a < 1;
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
| 1 | SIMPLE | testtable | NULL | range | idx_abd | idx_abd | 4 | NULL | 1 | 100.00 | Using index condition |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
why the first one can't use index but the second use index.
how the index works inside?
In first case, MySQL optimizer (based on statistics) decided that it is better to do a Full Table Scan, instead of first doing Index Lookups, and then do a Data Lookup.
In the first query of yours, the condition used (a > 1) is effectively needing to access 10 out of 11 rows. Always remember that, MySQL does Cost based optimization (tries to minimize the cost). The process is basically:
Assign a cost to each operation.
Evaluate how many operations each possible plan would take.
Sum up the total.
Choose the plan with the lowest overall cost.
Now, default MySQL cost for io_block_read_cost is 1. In the first query, you are going to roughly have two times the I/O block reads (first for index lookups and then Data lookups). So, the cost would come out roughly as 20, in case MySQL decides to use the index. Instead, if it does the Table Scan directly, the cost would be roughly 11 (Data lookup on all the rows). That is why, it decided to use Table Scan instead of Range based Index Scan.
If you want to get details about the Cost breakup, please run each of this queries by appending EXPLAIN format=JSON to them and executing them, like below:
EXPLAIN format=JSON select * from testtable where a > 1;
You can also see how Optimizer compared various plans before locking into a particular strategy. To do this, execute the queries below:
/* Turn tracing on (it's off by default): */
SET optimizer_trace="enabled=on";
SELECT * FROM testtable WHERE a > 1; /* your query here */
SELECT * FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;
/* possibly more queries...
When done with tracing, disable it: */
SET optimizer_trace="enabled=off";
Check more details at MySQL documentation: https://dev.mysql.com/doc/internals/en/optimizer-tracing.html
The alternative is to read the both the index and the data pages. On such small data, that can be less efficient (although the difference in performance -- like the duration of each query -- is quite small).
Your table has 10 rows, which presumably are all on a single data page. MySQL considers it more efficient to just read the 10 rows directly and do the comparison.
The value of indexes is when you have larger tables, particularly tables that span many data pages. One primary use is to reduce the number of data pages being read.
I have a table on mysql and two queries whose performances are quite different. I have extracted plans of the queries, but I couldn't fully understand the reason behind the performance difference.
The table:
+-------------+----------------------------------------------+------------------------------------+
| TableA | | |
+-------------+----------------------------------------------+------------------------------------+
| id | int(10) unsigned NOT NULL AUTO_INCREMENT | |
| userId | int(10) | unsigned DEFAULT NULL |
| created | timestamp | NOT NULL DEFAULT CURRENT_TIMESTAMP |
| PRIMARY KEY | id | |
| KEY userId | userId | |
| KEY created | created | |
+-------------+----------------------------------------------+------------------------------------+
Keys/Indices: The primary key on id field, a key on userId field ASC
, another key on created field ASC.
tableA is a very big table, it contains millions of rows.
The query I run on this table is:
The user with id 1234 has 1.5M records in this table. I want to fetch its latest 100 rows. In order to achieve this, I have 2 different queries:
Query 1:
SELECT * FROM tableA USE INDEX (userId)
WHERE userId=1234 ORDER BY created DESC LIMIT 100;
Query 2:
SELECT * FROM tableA
WHERE userId=1234 ORDER BY id DESC LIMIT 100;
Since id field of tableA is auto increment, the condition of being latest is preserved. These 2 queries return the same result. However, there is a huge performance difference.
Query plans are:
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
| Query No | Operation | Params | Raws | Raw desc |
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
| Query 1 | Sort(using file sort) Unique index scan (ref) | table: tableA; index: userId; | 2.5M | Using index condition; Using filesort |
| Query 2 | Unique index scan (ref) | table: tableA; index: userId; | 2.5M | Using where |
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
+--------+-------------+
| | Performance |
+--------+-------------+
| Query1 | 7,5 s |
+--------+-------------+
| Query2 | 741 ms |
+--------+-------------+
I understand that there is a sorting operation on Query 1. In each query, the index used is userId. But why is sorting not used in Query 2? How does the primary index affect?
Mysql 5.7
Edit: There are more columns on the table, I have extracted them from the table definition above.
Since id field of tableA is auto increment, the condition of being latest is preserved.
That is usually a valid statement.
WHERE userId=1234 ORDER BY created DESC LIMIT 100
needs this 'composite' index: (userId, created). With that, it will hit only 100 rows, regardless of the table size or the number of rows for that user.
The same goes for
WHERE userId=1234 ORDER BY id DESC LIMIT 100;
Namely that it needs (userId, id). However, in InnoDB, when you say INDEX(x) it silently tacks on the PRIMARY KEY columns. So you effectively get INDEX(x,id). This is why your plain INDEX(userId) worked well.
EXPLAIN rarely (if ever) takes into account the LIMIT. This is why 'Rows' is "2.5M" for both queries.
The first query might (or might not) have used INDEX(userId) if you took out the USE INDEX hint. The choice depends on what percentage of the table has userId = 1234. If it is less than about 20%, the index would be used. But it would bounce back and forth between the secondary index and the data -- all 1.5 million times. If more than 20%, it would avoid the bouncing by simply reading all the "millions" of rows, ignoring those that don't apply.
Note: What you had for Q1 will still read at least 1.5M rows, sort them ("Using filesort"), then peel off the desired 100. But with INDEX(userId, created), it can skip the sort and look at only 100 rows.
I cannot explain "Unique index scan" without seeing SHOW CREATE TABLE and the un-annotated EXPLAIN. (EXPLAIN FORMAT=JSON SELECT... might provide more insight.)
So I have combed the web and can't seem to find an answer. I have a table with the following structure
Table structure for table `search_tags`
--
CREATE TABLE IF NOT EXISTS `search_tags` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`LOOK_UP_TO_CAT_ID` int(11) NOT NULL,
`SEARCH_TAG` text COLLATE utf8_unicode_520_ci NOT NULL,
`DATE` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`SOURCE` varchar(225) COLLATE utf8_unicode_520_ci NOT NULL,
`SOURCE_ID` int(11) NOT NULL,
`WEIGHT` int(11) NOT NULL DEFAULT '1000',
PRIMARY KEY (`ID`),
KEY `LOOK_UP_TO_CAT_ID` (`LOOK_UP_TO_CAT_ID`),
KEY `WEIGHT` (`WEIGHT`),
FULLTEXT KEY `SEARCH_TAG` (`SEARCH_TAG`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_520_ci AUTO_INCREMENT=1 ;
The table sits with 800000+ rows and is growing.
When I run a query with a group by on LOOK_UP_TO_CAT_ID it takes between 1-2 seconds for the query to run. I need to run multiple versions of this base with joins to other tables but this seems to be where the bottleneck lies as adding joins to this doesn't slow it down
SELECT LOOK_UP_TO_CAT_ID, WEIGHT
FROM `search_tags`
WHERE `SEARCH_TAG` LIKE '%metallica%'
GROUP BY `LOOK_UP_TO_CAT_ID`
Removing the GROUP BY drops the query time down to 0.1 which seems much more acceptable but then I land up with duplicates.
Using explain with the group by shows that it's creating a temporary table rather than using the index
+----+-------------+-------------+------+-------------------+------+---------+------+--------+----------------------------------------------+--+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | |
+----+-------------+-------------+------+-------------------+------+---------+------+--------+----------------------------------------------+--+
| 1 | SIMPLE | search_tags | ALL | LOOK_UP_TO_CAT_ID | NULL | NULL | NULL | 825087 | Using where; Using temporary; Using filesort | |
+----+-------------+-------------+------+-------------------+------+---------+------+--------+----------------------------------------------+--+
So I'm not sure if mysql is doing the right thing here or not, but to me at least it seems wrong not to use the index. What would be the best why to speed this query up?
Edit:
Heres an example of my data:
+----+-------------------+----------------------------------+------------+---------------+-----------+--------+
| ID | LOOK_UP_TO_CAT_ID | SEARCH_TAG | DATE | SOURCE | SOURCE_ID | WEIGHT |
+----+-------------------+----------------------------------+------------+---------------+-----------+--------+
| 1 | 521 | METALLICA | 2017-02-18 | artist | 15 | 1 |
| 2 | 521 | METALLICA - NOTHING ELSE MATTERS | 2017-02-18 | tracklisting | 22 | 2 |
| 3 | 522 | METALLICA | 2017-02-18 | artist | 15 | 1 |
| 4 | 522 | METALLICA - ST. Anger | 2017-02-18 | product_title | 522 | 2 |
+----+-------------------+----------------------------------+------------+---------------+-----------+--------+
Desired Result
+-------------------+--------+
| LOOK_UP_TO_CAT_ID | WEIGHT |
+-------------------+--------+
| 521 | 1 |
| 522 | 1 |
+-------------------+--------+
A few suggestions for you.
SEARCH_TAG LIKE '%metallica%' will never, in this world of woe, use an index. The pattern haystack like '%needle' (leading %) requires MySQL to examine every value in the column for a match. haystack LIKE 'needle%' (trailing %)does not have this problem.
You have a FULLTEXT index on your SEARCH_TAG column, so use it! WHERE MATCH('metallica') AGAINST SEARCH_TAG is the form of the WHERE clause you need.
You have lots of single-column indexes on your table. These are generally unhelpful for making queries faster unless they happen to match exactly what you're trying to do. You'll be better off using compound covering indexes designed for the queries you're running.
The example query in your question is
SELECT LOOK_UP_TO_CAT_ID, WEIGHT
FROM search_tags
WHERE SEARCH_TAG LIKE '%metallica%'
GROUP BY LOOK_UP_TO_CAT_ID
If you change it to this it will make more SQL sense and run faster.
SELECT LOOK_UP_TO_CAT_ID, MAX(WEIGHT)
FROM search_tags
WHERE SEARCH_TAG LIKE 'metallica%'
GROUP BY LOOK_UP_TO_CAT_ID
(Notice I got rid of the leading %.)
If you add a compound covering index on (SEARCH_TAG, LOOK_UP_CAT_ID, WEIGHT) this query will become quite fast. The entire query can be satisfied from the index. MySQL random-accesses the index to find your SEARCH_TAG, then does a loose index scan to get the results you requested.
( An aside: don't worry when you see filesort in EXPLAIN output in a GROUP BY or ORDER BY query. It's part of the way MySQL satisfies the query. The file in filesort doesn't necessarily mean a slow file on a hard drive. )
In a way, your question doesn't make sense. You have a full-text index but are using LIKE, which does a table scan. You need to use MATCH() to use the full-text index.
What I really think is happening is that the data volume being returned is large. When you execute the query without an order by or group by, results are returned as they are generated. You see results because some rows that are scanned early match your conditions.
A group by/order by needs to read all the results.
You can check this by doing a count(*) instead of select:
SELECT COUNT(*)
FROM `search_tags`
WHERE `SEARCH_TAG` LIKE '%metallica%';
I suspect this might take longer.
You can eliminate the performance hit of the duplicate elimination by using a correlated subquery:
SELECT st.LOOK_UP_TO_CAT_ID, st.WEIGHT
FROM `search_tags` st
WHERE `SEARCH_TAG` LIKE '%metallica%' AND
st.id = (SELECT MIN(st2.id) FROM search_tags st2 WHERE st2.LOOK_UP_TO_CAT_ID = st.LOOK_UP_TO_CAT_ID);
This specifically needs an index on search_tags(LOOK_UP_TO_CAT_ID, ID) for performance.
However, you probably want to use MATCH() to also take advantage of the full text index.
I was busying myself with exploring GROUP BY optimizations. On a classical "max salary per departament" query. And suddenly weird results. The dump below goes straight from my console. NO COMMAND were issued between these two EXPLAINS. Only some time had passed.
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 1 | |
| 2 | DERIVED | emploee | index | NULL | dep_id | 8 | NULL | 84 | Using index |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 3 | |
| 2 | DERIVED | emploee | range | NULL | dep_id | 4 | NULL | 9 | Using index for group-by |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
As you may notice, it examined ten times less rows in second run. I assume it's because some inner counters got changed. But I don't want to depend on these counters. So - is there a way to hint mysql to use "Using index for group by" behavior only?
Or - if my speculations are wrong - is there any other explanation on the behavior and how to fix it?
CREATE TABLE `emploee` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`dep_id` int(11) NOT NULL,
`salary` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `dep_id` (`dep_id`,`salary`)
) ENGINE=InnoDB AUTO_INCREMENT=85 DEFAULT CHARSET=latin1 |
+-----------+
| version() |
+-----------+
| 5.5.19 |
+-----------+
Hm, showing the cardinality of indexes may help, but keep in mind: range's are usually slower then indexes there.
Because it think it can match the full index in the first one, it uses the full one. In the second one, it drops the index and goes for a range, but guesses the total number of rows satisfying that larger range wildly lower then the smaller full index, because all cardinality has changed. Compare it to this: why would "AA" match 84 rows, but "A[any character]" match only 9 (note that it uses 8 bytes of the key in the first, 4 bytes in the second)? The second one will in reality not read less rows, EXPLAIN just guesses the number of rows differently after an update on it's metadata of indexes. Not also that EXPLAIN does not tell you what a query will do, but what it probably will do.
Updating the cardinality can or will occur when:
The cardinality (the number of different key values) in every index of a table is calculated when a table is opened, at SHOW TABLE STATUS and ANALYZE TABLE and on other circumstances (like when the table has changed too much). Note that all tables are opened, and the statistics are re-estimated, when the mysql client starts if the auto-rehash setting is set on (the default).
So, assume 'at any point' due to 'changed too much', and yes, connecting with the mysql client can alter the behavior in choosing indexes of a server. Also: reconnecting of the mysql client after it lost its connection after a timeout counts as connecting with auto-rehash AFAIK. If you want to give mysql help to find the proper method, run ANALYZE TABLE once in a while, especially after heavy updating. If you think the cardinality it guesses is often wrong, you can alter the number of pages it reads to guess some statistics, but keep in mind a higher number means a longer running update of that cardinality, and something you don't want to happen that often when 'data has changed to much' on a table with a lot of operations.
TL;DR: it guesses rows differently, but you'd actually prefer the first behavior if the data makes that possible.
Adding:
On this previously linked page, we can probably also find why especially dep_id might have this problem:
small values like 1 or 2 can result in very inaccurate estimates of cardinality
I could imagine the number of different dep_id's is typically quite small, and I've indeed observed a 'bouncing' cardinality on non-unique indexes with quite a small range compared to the number of rows in my own databases. It easily guesses a range of 1-10 in the hundreds and then down again the next time, just based on the specific sample pages it picks & some algorithm that tries to extrapolate that.
I have a query:
select SQL_NO_CACHE id from users
where id>1 and id <1000
and id in ( select owner_id from comments and content_type='Some_string');
(note that it is short of an actual large query used for my sphinx indexing, representing the problem)
This query is taking about 3.5 seconds(modifying range from id = 1..5000 makes it about 15 secs).
users table has about 35000 entries and comments table has about 8000 entries.
Explain on above query:
explain select SQL_NO_CACHE id from users
where id>1 and id <1000
and id in ( select distinct owner_id from d360_core_comments);
| id | select_type | table | type | possible_keys
| key | key_len | ref | rows | Extra |
| 1 | PRIMARY | users | range | PRIMARY
| PRIMARY | 4 | NULL | 1992 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | d360_core_comments | ALL | NULL | NULL | NULL | NULL | 6901 | Using where; Using temporary |
where the individual subquery(select owner_id from d360_core_comments where content_type='Community20::Topic';) here is taking almost 0.0 seconds.
However if I add index on owner_id,content_type, (note the order here)
create index tmp_user on d360_core_comments (owner_id,content_type);
My subquery runs as is in ~0.0 seconds with NO index used:
mysql> explain select owner_id from d360_core_comments where
content_type='Community20::Topic';
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | d360_core_comments | ALL | NULL | NULL
| NULL | NULL | 6901 | Using where |
However now my main query (select SQL_NO_CACHE id from users where id>1 and id <1000 and id in ( select owner_id from d360_core_comments where content_type='Community20::Topic');)
now runs in ~0 seconds with following explain:
mysql> explain select SQL_NO_CACHE id from users where id>1 and id
<1000 and id in ( select owner_id from d360_core_comments where
content_type='Community20::Topic');
| id | select_type | table | type | possible_keys | key |
key_len | ref | rows | Extra |
| 1 | PRIMARY
| users | range | PRIMARY | PRIMARY | 4 | NULL | 1992 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY |
d360_core_comments | index_subquery | tmp_user | tmp_user | 5 | func | 34 | Using where |
So the main questions I have are:
If the index defined on the table used in my subquery is not getting used in my actual subquery then how it is optimizing the query here?
And why in the first place the first query was taking so much time when the actual subquery and main query independently are much faster?
What seems to happen in full query without the index is that MySQL will build (some sort of) temporary table of all the owner_id that the subquery generates. Then for each row from the users table that matches the id constraint, a lookup in this temporary construct will be performed. It is unclear if the overhead is creating the temporary construct, or if the lookup is implemented suboptimally (so that all elements are linearly matched for each row from the outer query.
When you create the index on owner_id, this doesn't change anything when you run only the subquery, because it has no condition on owner_id, nor does the index cover the content_type column.
However, when you run the full query with the index, there is more information available, since we now have values coming from the outer query that should be matched to owner_id, which is covered by the index. So the execution now seems to be to run the first part of the outer query, and for each matching row do an index lookup by owner_id. In other words, a possible execution plan is:
From Index-Users-Id Get all id matching id>1 and id <1000
For Each Row
Include Row If Index-Comment-OwnerId Contains row.Id
And Row Matches content_type='Some_string'
So in this case, the work to run 1000 (I assume) index lookups is faster than building a temporary construct of the 8000 possible owner_id. But this is only a hypothesis, since I don't know MySQL very well.
If you read this section of the MySQL Reference Manual: Optimizing Subqueries with EXISTS Strategy, you'll see that the query optimizer transforms your subquery condition from:
id in ( select distinct owner_id
from d360_core_comments
where content_type='Community20::Topic')
into:
exists ( select 1
from d360_core_comments
where content_type='Community20::Topic'
and owner_id = users.id )
This is why a index on (owner_id, content_type) is not useful when the subquery is tested as standalone query, but it is useful when considering the transformed subquery.
The first thing you should know is that MySQL can not optimize dependent subqueries, it is a for a long time well-known MySQL deficiency, that is going to be fixed in MySQL 6.x (just google for "mysql dependent subquery" and you will see). That is the subquery is basically executed for each matching row in users table. Since you have an additional condition, the overall execution time depends on that condition. The solution is to substitute the subquery with a join (the very optimization that you expect from MySQL under the hood).
Second, there is a syntax error in your subquery, and I think there was a condition on owner_id. Thus, when you add an index on owner_id it is used, but is not enough for the second condition (hence no using index), but why is not mentioned in EXPLAIN at all is a question (I think because of the condition on the users.id)
Third, I do not know why you need that id > 1 and id < 5000 condition, but you should understand that these are two range conditions that require very accurate, sometimes non-obvious and data-dependent indexing approach (as opposed to equality comparison conditions), and if you actually do not need them and use only to undestand why the query takes so long, then it was a bad idea and they would shed no light.
In case, the conditions are required and the index on owner_id is still there, I would rewrite the query as follows:
SELECT id
FROM (
SELECT owner_id as id
FROM comments
WHERE owner_id < 5000 AND content_type = 'some_string'
) as ids
JOIN users ON (id)
WHERE id > 1;
P.S. A composite index on (content_type, owner_id) will even be better for the query.
Step 1: Use id BETWEEN x AND y instead of id >= x AND id <= y. You may find some surprising gains because it indexes better.
Step 2: Adjust your sub-SELECT to do the filtering so it doesn't have to be done twice:
SELECT SQL_NO_CACHE id
FROM users
WHERE id IN (SELECT owner_id
FROM comments
WHERE content_type='Some_string'
AND owner_id BETWEEN 1 AND 1000);
There seems to be several errors in your statement. You're selecting 2 through 999 for instance, presumably off by one on both ends, and the subselect wasn't valid.