MySQL composite index column order & performance - mysql

I have a table with approx 500,000 rows and I'm testing two composite indexes for it. The first index puts the ORDER BY column last, and the second one is in reverse order.
What I don't understand is why the second index appears to offer better performance by estimating 30 rows to be scanned compared to 889 for the first query, as I was under the impression the second index could not be properly used as the ORDER BY column is not last. Would anyone be able to explain why this is the case? MySQL prefers the first index if both exist.
Note that the second EXPLAIN lists possible_keys as NULL but still lists a chosen key.
1) First index
ALTER TABLE user ADD INDEX test1_idx (city_id, quality);
(cardinality 12942)
EXPLAIN SELECT * FROM user u WHERE u.city_id = 3205 ORDER BY u.quality DESC LIMIT 30;
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| 1 | SIMPLE | u | ref | test1_idx | test1_idx | 3 | const | 889 | Using where |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
2) Second index (same fields in reverse order)
ALTER TABLE user ADD INDEX test2_idx (quality, city_id);
(cardinality 7549)
EXPLAIN SELECT * FROM user u WHERE u.city_id = 3205 ORDER BY u.quality DESC LIMIT 30;
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| 1 | SIMPLE | u | index | NULL | test2_idx | 5 | NULL | 30 | Using where |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
UPDATE:
The second query does not perform well in a real-life scenario whereas the first one does, as expected. I would still be curious as to why MySQL EXPLAIN provides such opposite information.

The rows in EXPLAIN is just an estimate of the number of rows that MySQL believes it must examine to produce the result.
I remembered reading one article from Peter Zaitsev of Percona that this number could be very inaccurate. So you can not simply compare the query efficiency based on this number.
I agree with you that the first index will produce better result in normal scenarios.
You should have noticed that the type column in the first EXPLAIN is ref while index for the second. ref is usually better than a index scan. As you mentioned, if both keys exists, MySQL prefer the first one.

I guess your data type
city_id: MEDIUMINT 3 Bytes
quality: SMALLINT 2 Bytes
As I know,
For
SELECT * FROM user u WHERE u.city_id = 3205 ORDER BY u.quality DESC LIMIT 30;
The second index(quality, city_id) can not be fully used.
Because Order by is Range scan, which can only do for last part of your index.
The first Index looks fit perfect.
I guess that some time Mysql is not so smart. maybe the amount of city_id targeted could effect mysql decide which index will be used.
You may try key word
FORCE INDEX(test1_idx)

Related

Mysql: explain returns more rows than the actual number

I have a table, which contains 40 M rows by counting.
select count(*) from xxxs;
returns 38000389
but the explain:
mysql> explain select * from xxxs where s_uuid = "21eaef";
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | xxxs | NULL | ALL | NULL | NULL | NULL | NULL | 56511776 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 1 warning (0.06 sec)
why the rows is 56M which is much larger than 40 M?
Thanks
UPDATE
1, the above query may take several minutes. is it normal? How to tune the performance?
2, I plan to create an index on s_uuid. I guess it will improve the performance. Am I right?
The "rows" in EXPLAIN is an estimate based on statistics that were gathered in the recent past. The value is rarely exact; sometimes it is even off by more than a factor of two.
Still, the estimate is usually "good enough" for the Optimizer to decide how to perform the query.
Another place to see this estimate of row count is via
SHOW TABLE STATUS LIKE 'xxxs';
(As mentioned in a Comment) Adding this is likely to speed up select * from xxxs where s_uuid = "21eaef";:
INDEX(s_uuid)
I say "likely to" because, if a lot of rows have s_uuid = "21eaef", the Optimizer will shun the index and simply scan the entire table rather than bouncing back and forth from the index's BTree and the data's BTree. You can see the "shun" in EXPLAIN by having Possible keys = idx_uuid but key = NULL.
There will be cases where the Optimizer makes the 'wrong' choice. But we can discuss that in another Q&A.

Does ranging the SQL query speed up the query time?

There is a table words contains word and id columns and 50000 records. I know words with the structure %XC%A are between the id=30000 and the id=35000.
Now consider the following queries:
SELECT * FROM words WHERE word LIKE '%XCX%A'
and
SELECT * FROM words WHERE id>30000 and id < 35000 and word LIKE '%XCX%A'
From time consuming perspective, is there any difference between them?
Well, let's find out...
Here's a data set of approximately 50000 words. Some of the words (but only in the range 30000 to 35000) follow the pattern described:
EXPLAIN
SELECT * FROM words WHERE word LIKE '%XCX%A';
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
| 1 | SIMPLE | words | index | NULL | word | 14 | NULL | 50976 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
EXPLAIN
SELECT * FROM words WHERE id>30000 and id < 35000 and word LIKE '%XCX%A';
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | words | range | PRIMARY | PRIMARY | 4 | NULL | 1768 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
We can see that the first query scans the entire dataset (50976 rows), while the second query only scans rows between the given ids (in my example there are approximately 1768 rows between ids 30000 and 35000; there are lots of unused ids, but that's just a side effect of the way in which the data was created).
So, we can see that by adding the range, MySQL only has to scan (at worst) one fifth of the data set (5000 rows instead oof 50000 rows). This isn't going to make much of a difference on such a small dataset, but it will on dataset 100, or 1000 times this size.
One thing to note is that the two queries will return the same data set (because we know that valid values are only to be found within that id range), but they won't necessarily return the dataset in the same order. For consistency, you would need an ORDER BY clause.
Another thing to note is, of course, that there's no point indexing word (for this query anyway), because '%...' cannot use an index.

`MySQL GROUP BY is slower when using index

I'm running on an AWS m4.large (2 vCPUs, 8 GB ram) and I'm seeing a slightly surprising behaviour regarding MySQL and GROUPBY's. I have this test database:
CREATE TABLE demo (
time INT,
word VARCHAR(30),
count INT
);
CREATE INDEX timeword_idx ON demo(time, word);
I insert 4,000,000 records with (uniformly) random words "t%s" % random.randint(0, 30000) and times random.randint(0, 86400).
SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (1 min 28.29 sec)
EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
| 1 | SIMPLE | demo | index | NULL | timeword_idx | 38 | NULL | 4002267 | |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
and then I don't use the index:
SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
3996922 rows in set (34.75 sec)
EXPLAIN SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | demo | ALL | NULL | NULL | NULL | NULL | 4002267 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
As you can see by using the index the query takes 3 times more time. I'm not that surprised since by using the index the query might have to avoid reading the time and word columns but unfortunately by the index being so sparse it shouldn't gain much. On the contrary it turns a direct scan to a random access pattern when it comes to retrieving count.
I just would like to confirm that this is the reason and wonder if there's a "compact rule" on when and index will bring eventually worse performance when used for GROUP BY.
EDIT:
I followed Gordon Linoff's answer and used:
CREATE INDEX timeword_idx ON demo(time, word, count);
The "covering index" computes the results 10 times faster when compared with the full scan:
SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (3.36 sec)
EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
| 1 | SIMPLE | demo | index | NULL | timeword_idx | 43 | NULL | 4002267 | Using index |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
Very impressive!
You have a reasonably sized table, so the problem might be sequential access of the data or thrashing. Using the index requires going through the index and then looking up the data in the data pages to get the count.
This can actually be worse than just reading the pages and doing a sort, because the pages are not read in order. Sequential reads are considerably more optimized than random reads. In the worst case, the page cache is full and the random reads require flushing pages. If this happens, a single page might need to be read multiple times. With only 4 million relatively small rows, thrashing is unlikely unless you are severely memory constrained.
If this interpretation is correct, then including count in the index should speed the query:
CREATE INDEX timeword_idx ON demo(time, word, count);
From the manual page How MySQL Uses Indexes
Indexes are less important for queries on small tables, or big tables
where report queries process most or all of the rows. When a query
needs to access most of the rows, reading sequentially is faster than
working through an index. Sequential reads minimize disk seeks, even
if not all the rows are needed for the query.
As for tacking on more columns to create covering indexes (ones in which datapages are not accessed but all data is available in the index), be careful. They come at a cost. In your case, your index is getting wide anyway. But a careful balance is always needed.
As spencer alludes to, cardinality always play a role with ranges. For cardinality information, use the show index from tblName command. It is not a driving issue for your query, but useful in other settings. I should rephrase that: cardinality is very high for your table. So your index is deemed a hindrance for it in that query.

MySQL InnoDB indexes slowing down sorts

I am using MySQL 5.6 on FreeBSD and have just recently switched from using MyISAM tables to InnoDB to gain advances of foreign key constraints and transactions.
After the switch, I discovered that a query on a table with 100,000 rows that was previously taking .003 seconds, was now taking 3.6 seconds. The query looked like this:
SELECT *
-> FROM USERS u
-> JOIN MIGHT_FLOCK mf ON (u.USER_ID = mf.USER_ID)
-> WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
I noticed that if I removed the ORDER BY clause, the execution time dropped back down to .003 seconds, so the problem is obviously in the sorting.
I then discovered that if I added back the ORDER BY but removed indexes on the columns referred to in the query (STATUS and ACCESS_ID), the query execution time would take the normal .003 seconds.
Then I discovered that if I added back the indexes on the STATUS and ACCESS_ID columns, but used IGNORE INDEX (STATUS,ACCESS_ID), the query would still execute in the normal .003 seconds.
Is there something about InnoDB and sorting results when referencing an indexed column in a WHERE clause that I don't understand?
Or am I doing something wrong?
EXPLAIN for the slow query returns the following results:
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | u | ref | PRIMARY,STATUS,ACCESS_ID | STATUS | 2 | const | 53902 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | mf | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.u.USER_ID | 1 | NULL |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
EXPLAIN for the fast query returns the following results:
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| 1 | SIMPLE | mf | index | PRIMARY | STREAK | 2 | NULL | 100 | NULL |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.mf.USER_ID | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
Any help would be greatly appreciated.
In the slow case, MySQL is making an assumption that the index on STATUS will greatly limit the number of users it has to sort through. MySQL is wrong. Presumably most of your users are ACTIVE. MySQL is picking up 50k user rows, checking their ACCESS_ID, joining to MIGHT_FLOCK, sorting the results and taking the first 100 (out of 50k).
In the fast case, you have told MySQL it can't use either index on USERS. MySQL is using its next-best index, it is taking the first 100 rows from MIGHT_FLOCK using the STREAK index (which is already sorted), then joining to USERS and picking up the user rows, then checking that your users are ACTIVE and have an ACCESS_ID at or above 8. This is much faster because only 100 rows are read from disk (x2 for the two tables).
I would recommend:
drop the index on STATUS unless you frequently need to retrieve INACTIVE users (not ACTIVE users). This index is not helping you.
Read this question to understand why your sorts are so slow. You can probably tune InnoDB for better sort performance to prevent these kind of problems.
If you have very few users with ACCESS_ID at or above 8 you should see a dramatic improvement already. If not you might have to use STRAIGHT_JOIN in your select clause.
Example below:
SELECT *
FROM MIGHT_FLOCK mf
STRAIGHT_JOIN USERS u ON (u.USER_ID = mf.USER_ID)
WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
STRAIGHT_JOIN forces MySQL to access the MIGHT_FLOCK table before the USERS table based on the order in which you specify those two tables in the query.
To answer the question "Why did the behaviour change" you should start by understanding the statistics that MySQL keeps on each index: http://dev.mysql.com/doc/refman/5.6/en/myisam-index-statistics.html. If statistics are not up to date or if InnoDB is not providing sufficient information to MySQL, the query optimiser can (and does) make stupid decisions about how to join tables.

MySQL has indexed tables and EXPLAIN looks good, but still not using index

I am trying to optimize a query, and all looks well when I got to "EXPLAIN" it, but it's still coming up in the "log_queries_not_using_index".
Here is the query:
SELECT t1.id_id,t1.change_id,t1.like_id,t1.dislike_id,t1.user_id,t1.from_id,t1.date_id,t1.type_id,t1.photo_id,t1.mobile_id,t1.mobiletype_id,t1.linked_id
FROM recent AS t1
LEFT JOIN users AS t2 ON t1.user_id = t2.id_id
WHERE t2.active_id=1 AND t1.postedacommenton_id='0' AND t1.type_id!='Friends'
ORDER BY t1.id_id DESC LIMIT 35;
So it grabs like a 'wallpost' data, and then I joined in the USERS table to make sure the user is still an active user (the number 1), and two small other "ANDs".
When I run this with the EXPLAIN in phpmyadmin it shows
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | t1 | index | user_id | PRIMARY | 4 | NULL | 35 | Using where
1 | SIMPLE | t2 | eq_ref | PRIMARY,active_id | PRIMARY | 4 | hnet_user_info.t1.user_id | 1 | Using where
It shows the t1 query found 35 rows using "WHERE", and the t2 query found 1 row (the user), using "WHERE"
So I can't figure out why it's showing up in the log_queries_not_using_index report.
Any tips? I can post more info if you need it.
tldr; ignore the "not using index warning". A query execution time of 1.3 milliseconds is not a problem; there is nothing to optimize here - look at the entire performance profile to find bottlenecks.
Trust the database engine. The database query planner will use indices when it determines that doing so is beneficial. In this case, due to the low cardinality estimates (35x1), the query planner decided that there was no reason to use indices for the actual execution plan. If indices were used in a trivial case like this it could actually increase the query execution time.
As always, use the 97/3 rule.