I have a MySQL (v8.0.30) table that stores trades, the schema is the following:
CREATE TABLE `log_fill` (
`id` bigint NOT NULL AUTO_INCREMENT,
`orderId` varchar(63) NOT NULL,
`clientOrderId` varchar(36) NOT NULL,
`symbol` varchar(31) NOT NULL,
`executionId` varchar(255) DEFAULT NULL,
`executionSide` tinyint NOT NULL COMMENT '0 = long, 1 = short',
`executionSize` decimal(15,2) NOT NULL,
`executionPrice` decimal(21,8) unsigned NOT NULL,
`executionTime` bigint unsigned NOT NULL,
`executionValue` decimal(21,8) NOT NULL,
`executionFee` decimal(13,8) NOT NULL,
`feeAsset` varchar(63) DEFAULT NULL,
`positionSizeBeforeFill` decimal(21,8) DEFAULT NULL,
`apiKey` int NOT NULL,
`side` varchar(20) DEFAULT NULL,
`reconciled` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `executionId` (`executionId`,`executionSide`),
KEY `apiKey` (`apiKey`)
) ENGINE=InnoDB AUTO_INCREMENT=6522695 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
As you can see, there's a BTREE index on the column apiKey which stands for a user, this way I can quickly retrieve all trades for a specific user.
My goal is a query that returns positionSizeBeforeFill + executionSize for the last record, given an apiKey and a symbol. So I wrote the following:
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
ORDER BY id DESC
However the execution is extremely slow (around 100ms). I've noticed that running either WHERE or ORDER BY (and not both together) drastically reduces execution time. For example
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
only takes 220 microseconds to execute. The number of records after filtering by apiKey and symbol is 388.
Similarly,
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
ORDER BY id DESC
takes 26 microseconds (on a 3 million records table).
All in all, separately running WHERE and ORDER BY takes microseconds of execution, when I combine them we scale up to milliseconds (around 1000x more).
Running EXPLAIN on the slow query it turns out it has to examine 116032 rows.
I tried to create a temporary table hoping for MySQL to perform sorting only on the filtered records, but the outcome is the same. Was wondering whether the problem might be the index (whose cardinality is 203), but how can it be the case when WHERE alone takes very little time? I could not find similar cases on other questions or forums. I think I just fail at understanding how InnoDB selects data, I thought it would first filter by WHERE and then perform ORDER BY on the filtered rows. How can I improve this? Thanks!
Edit: The EXPLAIN statement on the slow query returns
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+----------------------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where; Backward index scan |
The query with WHERE only
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+-------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where |
The query with ORDER BY only on the full table
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+---------+----------+---------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | index | NULL | PRIMARY | 8 | NULL | 2503238 | 100.00 | Backward index scan |
26 microseconds for any query against a 3-million-row table implies that you have the Query cache enabled. Please rerunning your timings with SELECT SQL_NO_CACHE .... (Even milliseconds would be suspicious.) Were all 3M rows returned? Shoveling that many rows probably takes more than a second under any circumstances.
Meanwhile, to speed up the first two queries, add
INDEX(symbol, apikey, id)
EXPLAIN gives only approximate (eg, 116032) counts. A "cardinality" of 203 is also just an estimate, but it is used by the Optimizer in some situations. Please get the exact count just to check that there really are any rows:
SELECT COUNT(*)
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
With the ORDER BY id DESC, it will scan the entire B+Tree what holds the data. As it says, it will do a 'Backward index scan'. However, since the "index" is the PRIMARY KEY and the PK is clustered with the data, it is really referring to the data's BTree.
The EXPLAIN for the first query decided that the indexes were not useful enough for WHERE; instead it avoided the sort (ORDER BY) by doing the Backward full table scan, same as the 3rd query. (And ignored any rows that did not match the WHERE.
I added a composite index on (apiKey, symbol) and now the query runs in as little as 0.2ms. In order to reduce the creation time for the index I reduce the number of records from 3M to 500K and the gain in time is about 97%, I believe it's going to be more on the full table. I thought using just the apiKey index it would first filter out by user, then by symbol, but I'm probably wrong.
Related
there is the table test :
show create table test;
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`body` longtext NOT NULL,
`timestamp` int(11) NOT NULL,
`handle_after` datetime NOT NULL,
`status` varchar(100) NOT NULL,
`queue_id` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
KEY `idxTimestampStatus` (`timestamp`,`status`),
KEY `idxTimestampStatus2` (`status`,`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=80000 DEFAULT CHARSET=utf8
there is two select's
1) select * from test where status = 'in_queue' and timestamp > 1625721850;
2) select id from test where status = 'in_queue' and timestamp > 1625721850;
in the first select explain show me that no indexes are used
in the second select index idxTimestampStatus is used.
MariaDB [db]> explain select * from test where status = 'in_queue' and timestamp > 1625721850;
+------+-------------+-------+------+----------------------------------------+------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+----------------------------------------+------+---------+------+----------+-------------+
| 1 | SIMPLE | test | ALL | idxTimestampStatus,idxTimestampStatus2 | NULL | NULL | NULL | 80000 | Using where |
+------+-------------+-------+------+----------------------------------------+------+---------+------+----------+-------------+
MariaDB [db]> explain select id from test where status = 'in_queue' and timestamp > 1625721850;
+------+-------------+-------+------+----------------------------------------+---------------------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+----------------------------------------+---------------------+---------+-------+------+--------------------------+
| 1 | SIMPLE | test | ref | idxTimestampStatus,idxTimestampStatus2 | idxTimestampStatus2 | 302 | const | 4 | Using where; Using index |
+------+-------------+-------+------+----------------------------------------+---------------------+---------+-------+------+--------------------------+
Help me figure out what i'm doing wrong ?
How should i create index for first select?
why does the number of columns affect the index usage ?
What you saw is to be expected. (The "number of columns" did not cause what you saw.) Read all the points below; various combinations of them should address all the issues raised in both the Question and Comments.
Deciding between index and table scan:
The Optimizer uses statistics to decide between using an index and doing a full table scan.
If less than (about) 20% of the rows need to be fetched, the index will be used. This involves bouncing back and forth between the index's BTree and the data's BTree.
If more of the table is needed, then it is deemed more efficient to simply scan the table, ignoring any rows that don't match the WHERE.
The "20%" is not a hard-and-fast number.
SELECT id ... status ... timestamp;
In InnoDB, a secondary index implicitly includes the columns of the PRIMARY KEY.
If all the columns mentioned in the query are in an index, then that index is "covering". This means that all the work can be done in the index's BTree without touching the data's BTree.
Using index == "covering". (That is, EXPLAIN gives this clue.)
"Covering" overrides the "20%" discussion.
SELECT * ... status ... timestamp;
SELECT * needs to fetch all columns, so "covering" does not apply and the "20%" becomes relevant.
If 1625721850 were a larger number, the EXPLAIN would switch from ALL to Index.
idxTimestampStatus2 (status,timestamp)
The order of the clauses in WHERE does not matter.
The order of the columns in a "composite" index is important. ("Composite" == multi-column)
Put the = column(s) first, then one "range" (eg >) column.
More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
I have the following table (it has more data columns, removed them because it would be a long post):
CREATE TABLE `members` (
`memberid` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`memberid`),
KEY `members_lname_ix` (`lastname`)
) ENGINE=InnoDB AUTO_INCREMENT=1019 DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci;
By default, a user only ever accesses 10-20 rows from this table at a time and it is usually sorted by the lastname column, it's all paginated server side. so I decided to add an index to lastname to help with sorting, however the index does not seem to be working like I would expect it to. when I run EXPLAIN SELECT * FROM members ORDER BY lastname ASC I get:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra
1 | simple | members | ALL | null | null | null | null | 711 | using filesort
I can at least confirm the index exists because if I run SHOW INDEX FROM members I get:
Table | Non_Unique | Key_name | Seq_in_ix | Col_name | Collation | Cardinality | Sub part | Packed | Null | Ix type
members | 0 | PRIMARY | 1 | memberid | A | 711 | null | null | (blank) | BTREE
members | 1 | members_lname_ix | 1 | lastname | A | 711 | null | null | YES | BTREE
if I add USE INDEX (members_lname_ix) both possible_keys and key will remain null. However if I add FORCE INDEX (members_lname_ix) possible_keys remains null and key shows members_lname_ix. This is my first time trying to apply indexing but to me this doesn't seem very intuitive - it feels like mysql should know that I created an index for lastname, no? I can't quite figure out what I'm doing wrong here unless I am misunderstanding something. Is the solution here to just keep using FORCE INDEX?
There are two ways to perform that query:
Plan A (as you were expecting):
Scan through the index sequentially, reading the entire (estimated) 711 rows.
Randomly look up each row in the data BTree. This involves reading the entire dataset.
Deliver the data in order.
Plan B (what it does):
Scan through the data, reading all 711 rows.
Sort the data
Deliver the sorted data.
Plan B does not touch the index at all; this was deemed to be a bigger savings than not having to sort the data.
In a table as tiny as yours, it would be hard to see a difference in speed. (In my test case, it took under 10 milliseconds either way.) In huge tables, the difference could be significant.
For optimal pagination, see http://mysql.rjweb.org/doc.php/pagination
I am trying to find the closest gene, given position information, from a gene table. Here is an example:
SELECT chrom, txStart, txEnd, name2, strand FROM
wgEncodeGencodeCompV12 WHERE chrom = 'chr1' AND txStart < 713885 AND
strand = '+' ORDER BY txStart DESC LIMIT 1;
My test runs have been pretty slow, which is problematic.
Here is an EXPLAIN output with default indexing (by chrom):
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | ref | chrom | chrom | 257 | const | 15843 | Using where; Using filesort |
Filesort is used and is probably causing all the sluggishness?
I tried speeding up the sorting by indexing (chrom, txStart, strand), or just txStart alone, but it only got slower (?). My reasoning is that txStart is not selective enough to be a good index, and that a whole-table scanning in this case is actually faster?
Here is the EXPLAIN output with the additional indexing:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | range | chrom,closest_gene_lookup | closest_gene_lookup | 261 | NULL | 57 | Using where |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | range | chrom,txStart | txStart | 4 | NULL | 1571 | Using where |
Table structure
CREATE TABLEwgEncodeGencodeCompV12(
binsmallint(5) unsigned NOT NULL,
namevarchar(255) NOT NULL,
chromvarchar(255) NOT NULL,
strandchar(1) NOT NULL,
txStartint(10) unsigned NOT NULL,
txEndint(10) unsigned NOT NULL,
cdsStartint(10) unsigned NOT NULL,
cdsEndint(10) unsigned NOT NULL,
exonCountint(10) unsigned NOT NULL,
exonStartslongblob NOT NULL,
exonEndslongblob NOT NULL,
scoreint(11) default NULL,
name2varchar(255) NOT NULL,
cdsStartStatenum('none','unk','incmpl','cmpl') NOT NULL,
cdsEndStatenum('none','unk','incmpl','cmpl') NOT NULL,
exonFrameslongblob NOT NULL,
KEYchrom(chrom,bin),
KEYname(name),
KEYname2(name2)
)
Is there a way to make this more efficient? I appreciate your time!
(update)Solution:
Combining both commenters' suggestions significantly improved run time.
In your case (query on a single table, no joins, no complicated stuff) it is important to understand the distribution of values in each column, and to understand how the database server utilizes the indexes. When you have a field with a rather big range of different values, then that one should be used for indexing. (e.g. an index on strand would just split the whole data in + or - and downstream filters would have to process each row of the either + or - result set, thats near the worst case)
So far, we know that txStart has the most differentiated values distribution amongst the interesting columns of your query.
So, your query definitely should utilize an index query on that column! But a btree index, not a hash index (operators <, <=, > etc. are fast on btree, but not on hash).
Try again with just a single (btree) index on txStart (I know you already tried that, but please try again and avoid all secondary indexes etc..).
Multi column indexes are nice, but their complexity make them not as fast as plain single column indexes, MySQLs optimizer is rather stupid in selecting the optimal indexes ;-)
Another important factor could be the dynamic row size (because of using longblob columns). But I am not up-to-date on the current state of MySQL in that regard.
The index that you want is: wgEncodeGencodeCompV12(chrom, strand, txstart).
In general, you want the fields with equalities as the first columns in the index. Then add one field with the inequality.
My MySQL database has over 350 million rows, and is growing. It's 32GB in size right now. I am using SSD's and lots of RAM, but would like to seek advice to make sure I am using appropriate indexes.
CREATE TABLE `qcollector` (
`key` bigint(20) NOT NULL AUTO_INCREMENT,
`instrument` char(4) DEFAULT NULL,
`datetime` datetime DEFAULT NULL,
`last` double DEFAULT NULL,
`lastsize` int(10) DEFAULT NULL,
`totvol` int(10) DEFAULT NULL,
`bid` double DEFAULT NULL,
`ask` double DEFAULT NULL,
PRIMARY KEY (`key`),
KEY `datetime_index` (`datetime`)
) ENGINE=InnoDB;
show index from qcollector;
+------------+------------+----------------+--------------+-------------+-----------+-- -----------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| qcollector | 0 | PRIMARY | 1 | key | A | 378866659 | NULL | NULL | | BTREE | | |
| qcollector | 1 | datetime_index | 1 | datetime | A | 63144443 | NULL | NULL | YES | BTREE | | |
+------------+------------+----------------+--------------+-------------+-----------+------ -------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.03 sec)
select * from qcollector order by datetime desc limit 1;
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| key | instrument | datetime | last | lastsize | totvol | bid | ask |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| 389054487 | ES | 2012-06-29 15:14:59 | 1358.25 | 2 | 2484771 | 1358.25 | 1358.5 |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
1 row in set (0.09 sec)
A typical query that is slow (full table scan, this query takes 3-4 minutes):
explain select date(datetime), count(lastsize) from qcollector where instrument = 'ES' and datetime > '2011-01-01' and time(datetime) between '15:16:00' and '15:29:00' group by date(datetime) order by date(datetime) desc;
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
| 1 | SIMPLE | qcollector | ALL | datetime_index | NULL | NULL | NULL | 378866659 | Using where; Using temporary; Using filesort |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
A couple ideas for you to consider:
A covering index (that is, an index that includes ALL of the columns referenced in the query) may help some. Such an index is going to require more disk (SSD?) space, but it will remove the necessity for MySQL to visit the data pages to lookup the values of the columns that aren't in the index.
ON qcollector (datetime,instrument,lastsize)
or
ON qcollector (instrument,datetime,lastsize)
Do you really need to exclude rows that have a NULL value for lastsize from the count? Could you return a count of all rows instead? If you could instead return COUNT(1) or SUM(1), then the query wouldn't need to reference the lastsize column, so it wouldn't be needed in an index to make it a covering index.
The COUNT(lastsize) expression is equivalent to SUM(IF(lastsize IS NULL,0,1))
Do you need to return dates when there are only NULL lastsize values for the datetime range, or could all of the rows with a NULL lastsize be excluded? That is, could you include a predicate like
AND lastsize IS NOT NULL
in your query?
Those may help some.
I think the big problem is that the predicates on the TIME(datetime) expression are not sargable. That is, MySQL won't use an index range scan operation for those. The predicate on the bare datetime column is sargable... that's why the EXPLAIN is showing the datetime_index as a possible key.
And the other big problem is that the query is doing GROUP BY and ORDER BY operations on a derived expression, which is going to require MySQL to generate an intermediate result set (as a temporary MyISAM table), and then process that result set. And that can be a lot of heavy lifting when there are lots of rows to process.
As far as table changes, I would consider using separate DATE and TIME columns, and using a TIMESTAMP datatype in place of DATETIME (if you need to store the date and time together). I would rewrite the query to reference the bare DATE and bare TIME columns, and consider adding a covering index that included all columns referenced in the rewritten query, with leading columns being the columns with the highest cardinality (and having the most selective predicates in the query.)
When you use date and time functions on a column the indexes cannot be used efficiently. You could also store the date and time in separate columns and index those, though this will take up more storage space.
You may also want to consider adding multi-column indexes. An index on (instrument, datetime) would probably help you here.
I got a weird problem of MySQL index. I have a table views_video:
CREATE TABLE `views_video` (
`video_id` smallint(5) unsigned NOT NULL,
`record_date` date NOT NULL,
`region` char(2) NOT NULL DEFAULT '',
`views` mediumint(8) unsigned NOT NULL
PRIMARY KEY (`video_id`,`record_date`,`region`),
KEY `video_id` (`video_id`)
)
The table contains 3.4 million records.
I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 156
I got:
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | views_video | range | PRIMARY,video_id | video_id | 2 | NULL | 587984 | Using where |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
But when I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 157
I got:
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | views_video | ALL | PRIMARY,video_id | NULL | NULL | NULL | 3412892 | Using where |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
video_id is from 1 to 1034. There is nothing special between 156 and 157.
What happens here?
* update *
I have added more data into the database. Now video_id is from 1 to 1064. And the table now has 3.8M records. And the difference become 114 and 115.
I'm guessing that with 3.4 million records, and only 1064 possible entries for your key, your selectivity is very low. (In other words, there are many duplicates, which makes it far less useful as a key.) The optimizer is taking its best guess if it is more efficient to use the key or not. You've found a threshold for that decision.
It might be the key population
Run these
SELECT (COUNT(1)/20) INTO #FivePctOfData FROM views_video;
SELECT COUNT(1) videpidcount,video_id FROM FROM views_video
WHERE id <= 157 GROUP BY video_id;
The query optimizer proabably took a vacation when one one of the key hit the 5% threshold.
You said there are 3.4 million rows. 5% would be 170,000. Perhaps this number was exceeded at some point in the query optimizer's life cycle on your query.
If you've added/deleted substantial data since creating the table, it's worthwhile to try ANALYZE TABLE on it. It frequently solves a lot of phantom indexing issues, and it's very fast even on large tables.
Update: Also, the unique index values are very low compared to the number of rows in the table. MySQL won't use indexes when a single index value points to too many rows. Try constraining the query further with another column that's part of the primary key.