MySQL indexes when multiple are possible - mysql

Given the following -
drop table if exists learning_indexes;
create table learning_indexes (
id INT NOT NULL,
col1 CHAR(30),
col2 CHAR(30),
col3 CHAR(30),
PRIMARY KEY (id),
index idx_col1 (col1),
index idx_col1_col2 (col1,col2)
);
explain
select
col1,col2
from
learning_indexes
where
col1 = 'FOO'
and col2 = 'BAR'
Why does MySQL pick idx_col1 over idx_col1_col2?
+----+-------------+------------------+------+------------------------+----------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+------------------------+----------+---------+-------+------+-------------+
| 1 | SIMPLE | learning_indexes | ref | idx_col1,idx_col1_col2 | idx_col1 | 91 | const | 1 | Using where |
+----+-------------+------------------+------+------------------------+----------+---------+-------+------+-------------+
This is my version information -
+-------------------------+---------------------+
| Variable_name | Value |
+-------------------------+---------------------+
| innodb_version | 1.1.8 |
| protocol_version | 10 |
| slave_type_conversions | |
| version | 5.5.29 |
| version_comment | Source distribution |
| version_compile_machine | i386 |
| version_compile_os | osx10.7 |
+-------------------------+---------------------+

I agree with Floaf that MySQL sometimes chooses the wrong indexes, but I don't think this is the case here. MySQL includes the number of rows and the data structure into its decision which index to choose.
For a rather simple query like this one, MySQL will likely not use any index at all if the table contains less than about 100 rows or is even empty. It seems to be computationally cheaper just to scan all table rows than to use the index. In your explain plan, you can see that the "key" column says idx_col1, but the "Extra" column doesn't say "using index".
If the table contains more than about 100 rows, MySQL will start using idx_col1. The explain plan will show you this. Only when there are more than about 100 rows that actually contain the string 'FOO' in col1, MySQL will notice that using idx_col1 doesn't reduce the tentative result set enough, since it will have to scan the remaining 100 rows for the value 'BAR' in col2. Therefore, it will switch to idx_col1_col2.
I'm not entirely sure how MySQL decides this quickly which index to use, but I think it has something to do with heuristics and the cardinality of the individual rows in the index, i.e. how "selective" an indexed row is.

I can't explain your case here, but sometimes MySQL simply chooses the "wrong" index. Maybe the database is small enough that it understands that it does not make any difference in this case.
This query is so simple that it should understand which index is the most appropriate.
I can say by experience that when the queries are getting more complex and especially when the tables grow very large, MySQL sometimes (random?) decides to pick another index and go with that and then queries can go from 0.01 second to 100+ seconds, so if you know which index is the right one, use FORCE INDEX(). Even if you use USE INDEX() MySQL sometimes chooses another index with various devestating result to the query speed.

Related

MySQL returning only part of results from FULL TEXT SEARCH after version 5.6 to 5.7 upgrade

I have a query that consists of two full text searches in boolean mode (combined with OR operator) that worked just fine on MySQL 5.6 and that fails after bumping MySQL version 5.7. Both DBs have the exact same set of records, both are hosted on AWS (InnoDB, aurora).
Query below (don't pay too much attention to the table/column names as I tried to anonymise them):
SELECT
cars.id
FROM cars
INNER JOIN driver_licenses ON driver_licenses.car_id = cars.id
INNER JOIN drivers ON drivers.id = driver_licenses.driver_id AND drivers.noobie_driver = 0
WHERE (
(MATCH(cars.name) AGAINST ('mark*' IN BOOLEAN MODE))
OR (MATCH(drivers.first_name, drivers.last_name, drivers.email) AGAINST ('mark*' IN BOOLEAN MODE))
);
Of course I have the fulltext index on the [first_name, last_name, email] columns, as well a btree index on noobie_driver. There are two indices on cars.name - one btree and the other one fulltext.
Before the upgrade, query returned proper results (counted in hundreds compared to a few million records in total).
After the upgrade - it seems that the query/optimizer focuses only on the first condition and completely disregards the second full text search (by driver's names and email) and returns only few records - related directly to the search result of cars.name.
When queries are ran separately (first time for cars.name and then for drivers details) and then combined, they return same results as before the upgrade.
Also when I force to ignore noobie_driver index (or remove the noobie_driver condition), both full text search conditions are taken into consideration.
Running EXPLAIN in both DBs return the same results.
+----+-------------+---------------------------+------------+--------+------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+-----------------------------------------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------------------+------------+--------+------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+-----------------------------------------------------+------+----------+-------------+
| 1 | SIMPLE | drivers | NULL | ref | PRIMARY,index_drivers_on_noobie_driver | index_drivers_on_noobie_driver | 1 | const | 6798 | 100.00 | NULL |
| 1 | SIMPLE | driver_licenses | NULL | ref | index_driver_licenses_on_car_id,index_driver_licenses_on_driver_id | index_driver_licenses_on_driver_id | 5 | Rental.drivers.id | 1 | 100.00 | Using where |
| 1 | SIMPLE | cars | NULL | eq_ref | PRIMARY | PRIMARY | 4 | Rental.driver_licenses.car_id | 1 | 100.00 | Using where |
+----+-------------+---------------------------+------------+--------+------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+-----------------------------------------------------+------+----------+-------------+
Tomorrow I'll be working on rebuilding the index/table(s) to see if that brings any changes to the behaviour on 5.7, once it's done I'll come back with more details. Running OPTIMIZE TABLE on all 3 tables haven't fixed anything here.
I'm wondering:
Have I missed something and it is a feature now in 5.7 now that it behaves this way?
How to overcome the issue and keep the exact same query (so without ignoring the index or performing two separate queries to combine the results afterwards)?
OK, dropping and recreating the index on noobie_driver column seems to do the trick on smoke-environment database that contains just few thousands records in the drivers table
DROP INDEX index_drivers_on_noobie_driver ON drivers;
CREATE INDEX index_drivers_on_noobie_driver USING BTREE ON drivers(noobie_driver);
BUT with production data that handles ~2kk records in the drivers" table, dropping and recreating an index did not help. I'm starting to believe it could be related to some bug strictly related to MySQL version.
Will be updating the question once I learn something new

Mysql: explain returns more rows than the actual number

I have a table, which contains 40 M rows by counting.
select count(*) from xxxs;
returns 38000389
but the explain:
mysql> explain select * from xxxs where s_uuid = "21eaef";
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | xxxs | NULL | ALL | NULL | NULL | NULL | NULL | 56511776 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 1 warning (0.06 sec)
why the rows is 56M which is much larger than 40 M?
Thanks
UPDATE
1, the above query may take several minutes. is it normal? How to tune the performance?
2, I plan to create an index on s_uuid. I guess it will improve the performance. Am I right?
The "rows" in EXPLAIN is an estimate based on statistics that were gathered in the recent past. The value is rarely exact; sometimes it is even off by more than a factor of two.
Still, the estimate is usually "good enough" for the Optimizer to decide how to perform the query.
Another place to see this estimate of row count is via
SHOW TABLE STATUS LIKE 'xxxs';
(As mentioned in a Comment) Adding this is likely to speed up select * from xxxs where s_uuid = "21eaef";:
INDEX(s_uuid)
I say "likely to" because, if a lot of rows have s_uuid = "21eaef", the Optimizer will shun the index and simply scan the entire table rather than bouncing back and forth from the index's BTree and the data's BTree. You can see the "shun" in EXPLAIN by having Possible keys = idx_uuid but key = NULL.
There will be cases where the Optimizer makes the 'wrong' choice. But we can discuss that in another Q&A.

Very simple AVG() aggregation query on MySQL server takes ridiculously long time

I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows.
The query is:
select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
Which takes almost 10 min to execute. I have an index on date. The EXPLAIN of this query is:
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
The indexes from this table are:
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
If I remove AVG():
select count(*) from mytable where `date` = "2017-11-01";
It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other dates.
I don't have an index over 01. Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.
Any suggestion is appreciated!
To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).
To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*); and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)
A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on date, 01. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg-operation), but it should still be a matter of seconds.
In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.
To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes date, 01, .., 12 and date, 13, .., 24, then use
select * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.
If you only ever average over a single column, you could add 24 seperate indexes (on date, 01, date, 02, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.
Since date is part of your primary key, you could also consider changing the primary key to date, esi. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi to locate rows), so it has to be considered carefully.
As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).
For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
For example:
SELECT COUNT(*) FROM student;
https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_count
If you add AVG() or something else, you lose this optimization

Slow query after upgrade mysql from 5.5 to 5.6

We're upgrading mysql from 5.5 to 5.6 and some queries are deadly slow now.
Queries that took 0.005 seconds before are now taking 49 seconds.
Queries on 5.6 are skipping the indexes, it seems:
+----+-------------+-------+-------+----------------------------------------------------+---------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------------------------+---------+---------+------+--------+-------------+
| 1 | SIMPLE | pens | index | index_contents_on_slug,index_contents_on_slug_hash | PRIMARY | 4 | NULL | 471440 | Using where |
+----+-------------+-------+-------+----------------------------------------------------+---------+---------+------+--------+-------------+
1 row in set (0.00 sec)
But are not being skipped on 5.5:
+----+-------------+-------+-------------+----------------------------------------------------+----------------------------------------------------+---------+------+------+----------------------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+----------------------------------------------------+----------------------------------------------------+---------+------+------+----------------------------------------------------------------------------------------------+
| 1 | SIMPLE | pens | index_merge | index_contents_on_slug,index_contents_on_slug_hash | index_contents_on_slug_hash,index_contents_on_slug | 768,768 | NULL | 2 | Using union(index_contents_on_slug_hash,index_contents_on_slug); Using where; Using filesort |
+----+-------------+-------+-------------+----------------------------------------------------+----------------------------------------------------+---------+------+------+----------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Both DBs were created from the same mysql dump.
Are these indexes not being constructed when I do the import on 5.6? How do I force the index creation?
The query:
SELECT `pens`.* FROM `pens` WHERE (slug_hash = 'style' OR slug = 'style') ORDER BY `pens`.`id` DESC LIMIT 1
Edit: Removed the schema
Ultimately the accepted answer above is correct.
The help from #RandomSeed got me thinking in the right direction. Basically the optimization plans created in 5.6 are significantly different from those in 5.5, so you'll probably have to rework your query, much like I did.
I did not end up using the FORCE INDEX, but instead removed portions of the query until I determined what was causing 5.6 to miss the index. Then I reworked the application logic to deal with that.
The slow query in v5.6 is caused by the engine being unable to, or deciding not to, merge the two relevant indexes (index_contents_on_slug_hash, index_contents_on_slug) in order to process your query. Remember that a query may only use one index per table. In order to be able to take advantage of several indexes on the same table, it needs to pre-merge on-the-fly these indexes into one (in memory). This is the meaning of the index_merge and Using union(...) notices in your execution plan. This consumes time and memory, obviously.
Quick fix (and probably preferred solution anyways): add a two-colums index on slug and slug_hash.
ALTER TABLE pens ADD INDEX index_contents_on_slug_and_slug_hash (
slug, slug_hash
);
Now your new server is probably unable to merge these indexes because it results in an index too large to fit in the buffer. Your new server probably has a much smaller value for key_buffer_size (if the table is MyISAM) or for innodb_buffer_pool_size (if InnoDB) than there used to be in your older installation.

How to speed up mysql select in database with highly redundant key values

I have a very simple MYSQL database with only 3 columns but several millions of rows.
Two of the colums (hid1, hid2) describe study objects (about 50,000 of them) and the third column (score) is the result of a comparison of hid1 with hid2. Thus, the number of rows is max(hid1)*max(hid2), which is quite a big number. Because the table has to be written only once and read many million times, I selected a MyISAM table (I hope this was a good idea). Initially, it was planned that I would retrieve 'score' for a given pair of hid1,hid2 but it turned out to be more convenient to retrieve all scores (and hid2) for a given hid1.
My table ("result") looks like this:
+-------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-----------------------+------+-----+---------+-------+
| hid1 | mediumint(8) unsigned | YES | MUL | NULL | |
| hid2 | mediumint(8) unsigned | YES | | NULL | |
| score | float | YES | | NULL | |
+-------+-----------------------+------+-----+---------+-------+
and a typical query would be
select hid1,hid2,score from result where hid1=13531 into outfile "/tmp/ttt"
Here is the problem: The query just takes too long, at least sometimes. For some 'hid1' values, I get the result back in under a second. For other hid1 (particularly for big numbers), I have to wait for up to 40 sec. As I said, I have to run thousands of these queryies, so I am interested in speeding things up.
Let me reiterate: there are about 50,000 hits to the query, and I don't need them in any particular order. Am I doing something wrong here, or is a relational database like MySQL not up to this task?
What I already tried is to increase the key_buffer in /etc/mysql/my.conf
this appeared to help, but not much. The index on hid1 is a few GB, does the key_buffer have to be bigger than the index size to be effective?
Any hint would be appreciated.
Edit: here is an example run with the corresponding 'explain' output:
select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt"
Query OK, 16465 rows affected (31.88 sec)
As you can see below, the index hid1_idx is actually being used:
mysql> explain select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt";
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| 1 | SIMPLE | result | ref | hid1_index | hid1_index | 4 | const | 15456 | Using where |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)
What I do find puzzling is the fact that query with low numbers for hid1 always are much faster than those with high numbers. This is not what I would expect from using an index.
Two random suggestions, based on a query pattern that always involve equality filter on hid1:
Use InnoDB table instead and take advantage of a clustered index on (hid1, hid2). That way all rows belonging to the same hid will be physically located together, and this will speed up retreival.
Hash-partition the table on hid1, with a suitable nr of partitions.
The simplest way to optimize a query like that, would be to use an index. A simple thing like
alter table results add index(hid1)
would improve the query you sent. Even more, if you want to search by both fields at once, you can use both fields in the index.
alter table results add index(hid1, hid2)
That way, MySQL can access results in a very organized way, and find the information you want.
If you run an explain on the first query, you might see something like
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ALL | | 7765605| Using where
After adding the index, you should see
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ref |hid1 | 2816304|
Which is telling you, in the first case, that it needs to check ALL the rows, and in the second case, that it can find the information using a ref
If you know the combination of hid1 and hid2 is unique, you should consider making that your primary key. That will automatically also add an index to hid1. See: http://dev.mysql.com/doc/refman/5.5/en/multiple-column-indexes.html
Also, check the output of EXPLAIN. See: http://dev.mysql.com/doc/refman/5.5/en/select-optimization.html and related links.