Recently,I face a question how mysql implements the loose index scan?
For example:
the test table structure is:
CREATE TABLE test (
id int(11) NOT NULL default '0',
v1 int(10) unsigned NOT NULL default '0',
v2 int(10) unsigned NOT NULL default '0',
v3 int(10) unsigned NOT NULL default '0',
PRIMARY KEY (id),
KEY v1_v2_v3 (v1,v2,v3)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
select * from test;
+----+----+-----+----+
| id | v1 | v2 | v3 |
+----+----+-----+----+
| 1 | 1 | 0 | 1 |
| 2 | 3 | 1 | 2 |
| 10 | 4 | 10 | 10 |
| 0 | 4 | 100 | 0 |
| 3 | 4 | 100 | 3 |
| 5 | 5 | 9 | 5 |
| 8 | 7 | 3 | 8 |
| 7 | 7 | 4 | 7 |
| 30 | 8 | 15 | 30 |
+----+----+-----+----+
Now let's see two sql:
first one:
mysql> explain select v1,v2 from test group by v1,v2;
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | NULL | v1_v2_v3 | 8 | NULL | 3 | Using index for group-by |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
I know that Using index for group-by means MySQL use the loose index scan to query the sql.But why the explain output column rows is 3?I wonder how MySQL only scan three rows and get the query result.
second one:
mysql> explain select max(v3) from test where v1>3 group by v1,v2;
+----+-------------+-------+-------+---------------+----------+---------+------+------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+---------------------------------------+
| 1 | SIMPLE | test | range | v1_v2_v3 | v1_v2_v3 | 8 | NULL | 1 | Using where; Using index for group-by |
+----+-------------+-------+-------+---------------+----------+---------+------+------+---------------------------------------+
1 row in set (0.00 sec)
mysql> explain select max(v2) from test where v1>3 group by v1,v2;
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | v1_v2_v3 | v1_v2_v3 | 4 | NULL | 4 | Using where; Using index |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
the only difference between the above two sql is in the select list,one is max(v3),another one is max(v2).But why the max(v3) uses the loose index scan,the max(v2) don't use the loose index scan? I don't unnderstand the GROUP BY Optimization says:
The only aggregate functions used in the select list (if any) are MIN() and MAX(), and all of them refer to the same column. The column must be in the index and must immediately follow the columns in the GROUP BY.
why the column must immediately follow the columns in the GROUP BY?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks!
This is too long for a comment.
Essentially, when asking "why does the optimizer behave a certain way", the answer is because the designers implemented it that way. If you want to know "why", you would have to ask them . . . that is not an appropriate question for a general-purpose forum.
I want to point out a few things, though. If you think that that the max(v2) is a bug, then you can report it at bugs.mysql.com. I don't think it is a bug for two reasons:
The documentation explicitly states how the optimization works, and this query is not documented to use the index ("v2" does not follow the keys in the group by).
Even if it were documented differently, the use of an aggregation function on a group by key is, shall I say, non-sensical. It is valid SQL, but it is simply verbose and unnecessary. Such constructs are way down on the list of priorities for database implementors.
Finally, MySQL does not really use statistics (very well?) when creating the query plan. However, in most databases, validating a query plan on 9 rows (which fit on a single data page) often results in a query plan that does a full table scan and "inefficient" algorithms. As an example, an algorithm such as bubble sort is quite inefficient on large numbers of rows, but it can be the most efficient sorting algorithm on a (very) small number of rows.
Is there any reason to use max (v2) in the query? The result is the same even if you do not use the max () function. If you change the query to "select v2 from test where v1> 3 group by v1, v2 ", it will be done by loose index scan method.
And here are the reasons why the column must immediately follow the columns in the GROUP BY.
v1 v2 v3
1 1 1
1 1 2
1 1 10
1 2 1
1 2 2
1 2 8
In this case, select max (v3) from t1 group by v1, v2 to perform loose index scan. This is done as shown in the following figure.
v1 v2 v3
1 1 1
1 1 2
1 1 10 ------------------> 10 return
1 2 1
1 2 2
1 2 8 ------------------> 8 return
However, if you perform select max (v3) from t1 group by v1, loose index scan is not possible. Because you have to access all the keys to find the maximum value(=10).
v1 v2 v3
1 1 1 ------------------> (x)
1 1 2 ------------------> (x)
1 1 10 ------------------> 10 return
1 2 1 ------------------> (x)
1 2 2 ------------------> (x)
1 2 8 ------------------> (x)
Note that you can use the following command to see how many records are accessed using loose index scan (or tight index scan).
flush status;
select max(v3) from t1 group by v1,v2; -- perform loose index scan
show session status like 'Handler_read_key%';
flush status;
select max(v3) from t1 group by v1; -- perform tight index scan
show session status like 'Handler_read_key%';
Related
Hello I have a table created by the following query MariaDB version 10.5.9
CREATE TABLE `test` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`status` varchar(60) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `test_status_IDX` (`status`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4
I always thought that the primary key is by default the clustered index which also defines the order of the rows in the table but here it seems that the index on the status is picked as the clustered. Why is this happening and how can I change it?
MariaDB [test]> select * from test;
+----+--------+
| id | status |
+----+--------+
| 2 | cfrc |
| 5 | hjr |
| 1 | or |
| 3 | test |
| 6 | verve |
| 4 | yes |
+----+--------+
6 rows in set (0.001 sec)
It is not safe to assume that the results of SELECT will be ordered by any column across dB engines. You should always use ORDER BY col [ASC|DESC] if you expect sorting to happen. I see records being displayed in the order they were added, but that can change after deletions/insertions etc, and should not be relied on. See here for more details.
(I am going to cite MySQL docs in my answer but in the context of this question, the information applies to MariaDB as well.)
First of all, let's talk about index extensions. The InnoDB engine automatically creates an additional (composite) index behind the scenes whenever you define a secondary index (i.e. any index that is not the clustered index). That is called an index extension.
This extra index contains the columns you defined in your original secondary index (in the same order) with the columns of the primary key added after them. So, in your example, InnoDB creates an index extension for test_status_IDX (let's call it X), with columns (stauts, id).
Now let's look at the query select * from test;. There is no WHERE clause here, so all the optimizer needs to do to satisfy this query is fetch all columns for all rows of the table. This boils down to fetching status & id since there are no other columns in the table. These exact fields happen to be stored within the extended index X. This makes index X a covering index for this query. A covering index is an index that, given a query, can fully produce the results of the query without having to read any actual data rows.
Therefore, the optimizer reads & returns the values needed for the result of the query from index X, in the order that they appear there, which is by status, hence the order you observed.
To further demonstrate and extend (pun intended) this point, let's reproduce the example (tested with MariaDB 10.4):
1. First create the table & add the rows
CREATE TABLE foo (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
status varchar(60) DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
INSERT INTO foo VALUES
(1, 'or'),
(2, 'cfrc'),
(3, 'test'),
(4, 'yes'),
(5, 'hjr'),
(6, 'verve');
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 1 | or |
| 2 | cfrc |
| 3 | test |
| 4 | yes |
| 5 | hjr |
| 6 | verve |
+----+--------+`
2. Now let's add the secondary index and check the order again
CREATE INDEX secondary_idx ON foo (status);
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 2 | cfrc |
| 5 | hjr |
| 1 | or |
| 3 | test |
| 6 | verve |
| 4 | yes |
+----+--------+
As described above, the rows are returned in the order they appear in the (extended) secondary_idx
3. Now let's drop the index and re-add it with a prefix length of 2 bytes. This means that the index will not store the full value of the column but only its first two bytes, which means the extended index is no longer a covering index because it cannot fully produce the results of the query. Thus the clustered index will be used
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status(2));
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 1 | or |
| 2 | cfrc |
| 3 | test |
| 4 | yes |
| 5 | hjr |
| 6 | verve |
+----+--------+
4. Let's showcase this behaviour in another way. Here we will retain the original secondary index (without a prefix length) but we will add a 3rd column to the table. This will once again render the secondary index a non covering index (because it does not contain the 3rd column), therefore, the clustered index will be used here as well.
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status);
ALTER TABLE foo ADD bar integer NOT NULL;
SELECT * FROM foo;
+----+--------+-----+
| id | status | bar |
+----+--------+-----+
| 1 | or | 0 |
| 2 | cfrc | 0 |
| 3 | test | 0 |
| 4 | yes | 0 |
| 5 | hjr | 0 |
| 6 | verve | 0 |
+----+--------+-----+
Adding bar to the index (or dropping it from the table) will again make the query use the secondary index.
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status, bar);
SELECT * FROM foo;
+----+--------+-----+
| id | status | bar |
+----+--------+-----+
| 2 | cfrc | 0 |
| 5 | hjr | 0 |
| 1 | or | 0 |
| 3 | test | 0 |
| 6 | verve | 0 |
| 4 | yes | 0 |
+----+--------+-----+
You can also use EXPLAIN on all of the SELECT statements above to see which index is used at each stage.
#aprsa is right I falsely assumed that the results will be in the same order as the clustered index but in this case(using INNODB) the status index is used for the query's evaluation so that's why it appears to be 'sorted' by the status. If I select the id then the primary index is used and the results appear to be 'sorted' by the id. In another engine this might not be true.
That particular table is composed of 2 BTrees:
The data, sorted by the PRIMARY KEY. Yes, it is clustered and is ordered 1,2,3,...
The secondary index, sorted by status. Each secondary index contains a copy of the PK so that it can reach into the other BTree to get the rest of the columns (not that there are any more!). That is, the is BTree is equivalent to a 2-column table with PRIMARY KEY(status) plus an id.
Note how the output is in status order. I have to assume it decided to simply read the secondary index in its order to provide the results.
Yes, you must specify an ORDER BY if you want a particular ordering. You must not assume the details I just discussed. Who knows, tomorrow there may be something else going, such as an in-memory "hash" that has the information scrambled in some other way!
(This Answer applies to both MySQL and MariaDB. However, MariaDB is already playing a game with hashing that MySQL has not yet picked up. Be warned! Or simply add an ORDER BY.)
I have a simple InnoDB table with 1M+ rows and some simple indexes.
I need to sort this table by first_public and id columns and get some of them, this is why I've indexed first_public column.
first_public is unique at the moment, but in real life it might be not.
mysql> desc table;
+--------------+-------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-------------------------+------+-----+---------+----------------+
| id | bigint unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
| id_category | int | NO | MUL | NULL | |
| active | smallint | NO | | NULL | |
| status | enum('public','hidden') | NO | | NULL | |
| first_public | datetime | YES | MUL | NULL | |
| created_at | timestamp | YES | | NULL | |
| updated_at | timestamp | YES | | NULL | |
+--------------+-------------------------+------+-----+---------+----------------+
8 rows in set (0.06 sec)
it works well while I'm working with rows before 130000+
mysql> explain select id from table where active = 1 and status = 'public' order by first_public desc, id desc limit 24 offset 130341;
+----+-------------+--------+------------+-------+---------------+---------------------+---------+------+--------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+---------------------+---------+------+--------+----------+----------------------------------+
| 1 | SIMPLE | table | NULL | index | NULL | firstPublicDateIndx | 6 | NULL | 130365 | 5.00 | Using where; Backward index scan |
+----+-------------+--------+------------+-------+---------------+---------------------+---------+------+--------+----------+----------------------------------+
1 row in set, 1 warning (0.00 sec)
but when I try to get some next rows (with offset 140000+), it looks like MySQL don't use first_public column index at all.
mysql> explain select id from table where active = 1 and status = 'public' order by first_public desc, id desc limit 24 offset 140341;
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+
| 1 | SIMPLE | table | NULL | ALL | NULL | NULL | NULL | NULL | 1133533 | 5.00 | Using where; Using filesort |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+
1 row in set, 1 warning (0.00 sec)
I tried to add first_public column in to select clause, but nothing changed.
What I'm doing wrong?
MySQL's optimizer tries to estimate the cost of doing your query, to decide if it's worth using an index. Sometimes it compares the cost of using the index versus just reading the rows in order, and discarding the ones that don't belong in the result.
In this case, it decided that if you use an OFFSET greater than 140k, it gives up on using the index.
Keep in mind how OFFSET works. There's no way of looking up the location of an offset by an index. Indexes help to look up rows by value, not by position. So to do an OFFSET query, it has to examine all the rows from the first matching row on up. Then it discards the rows it examined up to the offset, and then counts out the enough rows to meet the LIMIT and returns those.
It's like if you wanted to read pages 500-510 in a book, but to do this, you had to read pages 1-499 first. Then when someone asks you to read pages 511-520, and you have to read pages 1-510 over again.
Eventually the offset gets to be so large that it's less expensive to read 14000 rows in a table-scan, than to read 14000 index entries + 14000 rows.
What?!? Is OFFSET really so expensive? Yes, it is. It's much more common to look up rows by value, so MySQL is optimized for that usage.
So if you can reimagine your pagination queries to look up rows by value instead of using LIMIT/OFFSET, you'll be much happier.
For example, suppose you read "page" 1000, and you see that the highest id value on that page is 13999. When the client requests the next page, you can do the query:
SELECT ... FROM mytable WHERE id > 13999 LIMIT 24;
This does the lookup by the value of id, which is optimized because it utilizes the primary key index. Then it reads just 24 rows and returns them (MySQL is at least smart enough to stop reading after it reaches the OFFSET + LIMIT rows).
The best index is
INDEX(active, status, first_public, id)
Using huge offsets is terribly inefficient -- it must scan over 140341 + 24 rows to perform the query.
If you are trying to "walk through" the table, use the technique of "remembering where you left off". More discussion of this: http://mysql.rjweb.org/doc.php/pagination
The reason for the Optimizer to abandon the index: It decided that the bouncing back and forth between the index and the table was possibly worse than simply scanning the entire table. (The cutoff is about 20%, but varies widely.)
SELECT AVG(table1.column1) as a,
table2.column2
FROM table1
LEFT OUTER JOIN table2
ON table2.column2 = table1.column2
GROUP BY table2.column2 ORDER BY a DESC LIMIT 10
This is MySQL code. I have 1.5 Million rows in table1, 200.000 rows in table2.
I am still waiting for the query to finish.
Does anybody know a way to work in a shorter time?
Lot of comments in the same vein but I thought I'd give a thorough answer. I'm gonna use one of my own tables/databases here for explanation. Let's take this query:
SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%"
This query returns about 851 and takes 0.5 seconds. If we add the word EXPLAIN to the query, MySQL tells us what this query is doing.
mysql> EXPLAIN SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%";
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1183 | Using where |
| 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 6594 | |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
2 rows in set (0.00 sec)
Important column to look at here is the rows as this is the number of records MySQL is having to look at and in this case for tables A and B it is having to look up all the rows even though there's only 851 that fit the condition. This is how tables can get out of hand quickly, this only has 6594 record to search through but left alone this could easily reach your 1.5 million rows.
So we can cut this down by adding an index to the table, allowing MySQL to store a reference for each record.
ALTER TABLE AmazonWishlistItemPrices ADD INDEX idx_asin (asin)
This simply says create an index called idx_asin and use the column asin to do the indexing. If we re run our EXPLAIN...
mysql> EXPLAIN SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%";
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
| 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1183 | Using where |
| 1 | SIMPLE | B | ref | idx_asin | idx_asin | 12 | mah_database.A.asin | 6 | Using index |
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
2 rows in set (0.00 sec)
We're down to six rows and you can see in the possible_keys it's found our index. You may find that with certain joins and where clauses your indexes are being ignored that's simply MySQL saying "I'm going to have to get all the data anyway" because of the conditions you've provided in the WHERE condition.
It's best to use numeric keys for indexing, you can get away with some varchars but they do take up disk space. You should have a PRIMARY KEY on each table where possible. So look at your database structure and consider adding some indexes.
Final thing to check if your table has indexes you can use SHOW CREATE TABLE followed by the table name.
I have this select query, ItemType is varchar type and ItemComments is int type:
select * from ItemInfo where ItemType="item_type" order by ItemComments desc limit 1
You can see this query has 3 conditions:
where 'ItemType' equals a specific value;
order by 'ItemComments'
with descending order
The interesting thing is, when I select rows with all three conditions, it's getting very slow. But if I drop any one of the three (except condition 2), the query runs quite fast. See:
select * from ItemInfo where ItemType="item_type" order by ItemComments desc limit 1;
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 16.318 sec. */
select * from ItemInfo where ItemType="item_type" order by ItemComments limit 1;
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.140 sec. */
select * from ItemInfo order by ItemComments desc limit 1;
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.015 sec. */
Plus,
I'm using MySQL 5.7 with InnoDB engine.
I have created indexes on both ItemType and ItemComments and table ItemInfo contains 2 million rows.
I have searched many possible explanation like MySQL support for descending index, composite index and so on. But these still can't explain why query #1 runs slowly while query #2 and #3 runs well.
It would be very appreciated if anyone could help me out.
Updates:create table and explain info
Create code:
CREATE TABLE `ItemInfo` (
`ItemID` VARCHAR(255) NOT NULL,
`ItemType` VARCHAR(255) NOT NULL,
`ItemPics` VARCHAR(255) NULL DEFAULT '0',
`ItemName` VARCHAR(255) NULL DEFAULT '0',
`ItemComments` INT(50) NULL DEFAULT '0',
`ItemScore` DECIMAL(10,1) NULL DEFAULT '0.0',
`ItemPrice` DECIMAL(20,2) NULL DEFAULT '0.00',
`ItemDate` DATETIME NULL DEFAULT '1971-01-01 00:00:00',
PRIMARY KEY (`ItemID`, `ItemType`),
INDEX `ItemDate` (`ItemDate`),
INDEX `ItemComments` (`ItemComments`),
INDEX `ItemType` (`ItemType`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
Explain result:
mysql> explain select * from ItemInfo where ItemType="item_type" order by ItemComments desc limit 1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | i | NULL | index | ItemType | ItemComments | 5 | NULL | 83 | 1.20 | Using where |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
mysql> explain select * from ItemInfo where ItemType="item_type" order by ItemComments limit 1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | i | NULL | index | ItemType | ItemComments | 5 | NULL | 83 | 1.20 | Using where |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
mysql> explain select * from ItemInfo order by ItemComments desc limit 1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------+
| 1 | SIMPLE | i | NULL | index | NULL | ItemComments | 5 | NULL | 1 | 100.00 | NULL |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------+
Query from O. Jones:
mysql> explain
-> SELECT a.*
-> FROM ItemInfo a
-> JOIN (
-> SELECT MAX(ItemComments) ItemComments, ItemType
-> FROM ItemInfo
-> GROUP BY ItemType
-> ) maxcomm ON a.ItemType = maxcomm.ItemType
-> AND a.ItemComments = maxcomm.ItemComments
-> WHERE a.ItemType = 'item_type';
+----+-------------+------------+------------+-------+----------------------------------------+-------------+---------+---------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+----------------------------------------+-------------+---------+---------------------------+---------+----------+--------------------------+
| 1 | PRIMARY | a | NULL | ref | ItemComments,ItemType | ItemType | 767 | const | 27378 | 100.00 | Using where |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 772 | mydb.a.ItemComments,const | 10 | 100.00 | Using where; Using index |
| 2 | DERIVED | ItemInfo | NULL | index | PRIMARY,ItemDate,ItemComments,ItemType | ItemType | 767 | NULL | 2289466 | 100.00 | NULL |
+----+-------------+------------+------------+-------+----------------------------------------+-------------+---------+---------------------------+---------+----------+--------------------------+
I'm not sure if I execute this query right but I couldn't get the records within quite a long time.
Query from Vijay. But I add ItemType join condition cause with only max_comnt return items from other ItemType:
SELECT ifo.* FROM ItemInfo ifo
JOIN (SELECT ItemType, MAX(ItemComments) AS max_comnt FROM ItemInfo WHERE ItemType="item_type") inn_ifo
ON ifo.ItemComments = inn_ifo.max_comnt and ifo.ItemType = inn_ifo.ItemType
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 7.441 sec. */
explain result:
+----+-------------+------------+------------+-------------+-----------------------+-----------------------+---------+-------+-------+----------+-----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------------+-----------------------+-----------------------+---------+-------+-------+----------+-----------------------------------------------------+
| 1 | PRIMARY | <derived2> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 1 | PRIMARY | ifo | NULL | index_merge | ItemComments,ItemType | ItemComments,ItemType | 5,767 | NULL | 88 | 100.00 | Using intersect(ItemComments,ItemType); Using where |
| 2 | DERIVED | ItemInfo | NULL | ref | ItemType | ItemType | 767 | const | 27378 | 100.00 | NULL |
+----+-------------+------------+------------+-------------+-----------------------+-----------------------+---------+-------+-------+----------+-----------------------------------------------------+
And I want to explain why I use order with limit at the first place: I was planning to fetch record from the table randomly with a specific probability. The random index generated from python and send to MySQL as a variable. But then I found it cost so much time so I decided to just use the first record I got.
After inspiring by O. Jones and Vijay, I tried using max function, but it doesn't perform well:
select max(ItemComments) from ItemInfo where ItemType='item_type'
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 6.225 sec. */
explain result:
+----+-------------+------------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
| 1 | SIMPLE | ItemInfo | NULL | ref | ItemType | ItemType | 767 | const | 27378 | 100.00 | NULL |
+----+-------------+------------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
Thanks for all contribute to this question. Hope you could bring more solutions based on information above.
Please provide CURRENT SHOW CREATE TABLE ItemInfo.
For most of those queries, you need the composite index
INDEX(ItemType, ItemComments)
For the last one, you need
INDEX(ItemComments)
For that especially slow query, please provide EXPLAIN SELECT ....
Discussion - Why does INDEX(ItemType, ItemComments) help with where ItemType="item_type" order by ItemComments desc limit 1?
An index is structured in a BTree (see Wikipedia), thereby making searching for an individual item very fast, and making scanning in a particular order very fast.
where ItemType="item_type" says to filter on ItemType, but there are a lot of such in the index. In this index, they are ordered by ItemComments (for a given ItemType). The direction desc suggests to start with the highest value of ItemContents; that is the 'end' of the index items. Finally limit 1 says to stop after one item is found. (Somewhat like finding the last "S" in your Rolodex.)
So the query is to 'drill down' the BTree to the end of the entries for ItemType in the composite INDEX(ItemType, ItemContents) and grab one entry -- a very efficient task.
Actually SELECT * implies that there is one more step, namely to get all the columns for that one row. That info is not in the index, but over in the BTree for ItemInfo -- which contains all the columns for all the rows, ordered by the PRIMARY KEY.
The "secondary index" (INDEX(ItemType, ItemComments)) implicitly contains a copy of the relevant PRIMARY KEY columns, so we now have the values of ItemID and ItemType. With those, we can drill down this other BTree to find the desired row and fetch all (*) the columns.
Your first query ordering ascending can take advantage of your index on ItemComment.
SELECT * ... ORDER BY ... LIMIT 1 is a notorious performance antipattern. Why? The server must sort a whole mess of rows, just to discard all but the first.
You might try this (for your descending order variant). It's a little more verbose but much more efficient.
SELECT a.*
FROM ItemInfo a
JOIN (
SELECT MAX(ItemComments) ItemComments, ItemType
FROM ItemInfo
GROUP BY ItemType
) maxcomm ON a.ItemType = maxcomm.ItemType
AND a.ItemComments = maxcomm.ItemComments
WHERE a.ItemType = 'item type'
Why does this work? It uses GROUP BY / MAX() to find the maximum value rather that ORDER BY ... DESC LIMIT 1 . The subquery does your search.
To make this work as efficiently as possible you need a compound (multicolumn) index on (ItemType, ItemComments). Create that with
ALTER TABLE ItemInfo CREATE INDEX ItemTypeCommentIndex (ItemType, ItemComments);
When you create the new index, drop your index on ItemType, because the new index is redundant with that one.
MySQL's query planner is smart enough to see the outer WHERE clause before it runs the inner GROUP BY query, so it doesn't have to aggregate the whole table.
With that compound index MySQL can use a loose index scan to satisfy the subquery. Those are almost miraculously fast. You should read up on the topic.
Your query will select all the rows with based on the where condition. After that it will sort the rows according to order by statement , then it will select the first row. A better query would be something like
SELECT ifo.* FROM ItemInfo ifo
JOIN (SELECT MAX(ItemComments) AS max_comnt FROM ItemInfo WHERE ItemType="item_type") inn_ifo
ON ifo.ItemComments = inn_ifo.max_comnt
As this query only finds maximum value from the column. Finding MAX() is only O(n) but the fastest algorithm for sorting is of O(nlogn) . So if you will avoid the order by statemet the query will perform faster.
Hope this helped.
I need the SQL equivalent of this.
I have a table like this
ID MN MX
-- -- --
A 0 3
B 4 6
C 7 9
Given a number, say 5, I want to find the ID of the row where MN and MX contain that number, in this case that would be B.
Obviously,
SELECT ID FROM T WHERE ? BETWEEN MN AND MX
would do, but I have 9 million rows and I want this to run as fast as possible. In particular, I know that there can be only one matching row, I now that the MN-MX ranges cover the space completely, and so on. With all these constraints on the possible answers, there should be some optimizations I can make. Shouldn't there be?
All I have so far is indexing MN and using the following
SELECT ID FROM T WHERE ? BETWEEN MN AND MX ORDER BY MN LIMIT 1
but that is weak.
If you have an index spanning MN and MX it should be pretty fast, even with 9M rows.
alter table T add index mn_mx (mn, mx);
Edit
I just tried a test w/ a 1M row table
mysql> select count(*) from T;
+----------+
| count(*) |
+----------+
| 1000001 |
+----------+
1 row in set (0.17 sec)
mysql> show create table T\G
*************************** 1. row ***************************
Table: T
Create Table: CREATE TABLE `T` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`mn` int(10) DEFAULT NULL,
`mx` int(10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `mn_mx` (`mn`,`mx`)
) ENGINE=InnoDB AUTO_INCREMENT=1048561 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> select * from T order by rand() limit 1;
+--------+-----------+-----------+
| id | mn | mx |
+--------+-----------+-----------+
| 112940 | 948004986 | 948004989 |
+--------+-----------+-----------+
1 row in set (0.65 sec)
mysql> explain select id from T where 948004987 between mn and mx;
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 239000 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)
mysql> select id from T where 948004987 between mn and mx;
+--------+
| id |
+--------+
| 112938 |
| 112939 |
| 112940 |
| 112941 |
+--------+
4 rows in set (0.03 sec)
In my example I just had an incrementing range of mn values and then set mx to +3 that so that's why I got more than 1, but should apply the same to you.
Edit 2
Reworking your query will definitely be better
mysql> explain select id from T where mn<=947892055 and mx>=947892055;
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 9 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
It's worth noting even though the first explain reported many more rows to be scanned I had enough innodb buffer pool set to keep the entire thing in RAM after creating it; so it was still pretty fast.
If there are no gaps in your set, a simple gte comparison will work:
SELECT ID FROM T WHERE ? >= MN ORDER BY MN ASC LIMIT 1