Optimizing MySQL indexes for query (trading tick data database) - mysql

My MySQL database has over 350 million rows, and is growing. It's 32GB in size right now. I am using SSD's and lots of RAM, but would like to seek advice to make sure I am using appropriate indexes.
CREATE TABLE `qcollector` (
`key` bigint(20) NOT NULL AUTO_INCREMENT,
`instrument` char(4) DEFAULT NULL,
`datetime` datetime DEFAULT NULL,
`last` double DEFAULT NULL,
`lastsize` int(10) DEFAULT NULL,
`totvol` int(10) DEFAULT NULL,
`bid` double DEFAULT NULL,
`ask` double DEFAULT NULL,
PRIMARY KEY (`key`),
KEY `datetime_index` (`datetime`)
) ENGINE=InnoDB;
show index from qcollector;
+------------+------------+----------------+--------------+-------------+-----------+-- -----------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| qcollector | 0 | PRIMARY | 1 | key | A | 378866659 | NULL | NULL | | BTREE | | |
| qcollector | 1 | datetime_index | 1 | datetime | A | 63144443 | NULL | NULL | YES | BTREE | | |
+------------+------------+----------------+--------------+-------------+-----------+------ -------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.03 sec)
select * from qcollector order by datetime desc limit 1;
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| key | instrument | datetime | last | lastsize | totvol | bid | ask |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| 389054487 | ES | 2012-06-29 15:14:59 | 1358.25 | 2 | 2484771 | 1358.25 | 1358.5 |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
1 row in set (0.09 sec)
A typical query that is slow (full table scan, this query takes 3-4 minutes):
explain select date(datetime), count(lastsize) from qcollector where instrument = 'ES' and datetime > '2011-01-01' and time(datetime) between '15:16:00' and '15:29:00' group by date(datetime) order by date(datetime) desc;
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
| 1 | SIMPLE | qcollector | ALL | datetime_index | NULL | NULL | NULL | 378866659 | Using where; Using temporary; Using filesort |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+

A couple ideas for you to consider:
A covering index (that is, an index that includes ALL of the columns referenced in the query) may help some. Such an index is going to require more disk (SSD?) space, but it will remove the necessity for MySQL to visit the data pages to lookup the values of the columns that aren't in the index.
ON qcollector (datetime,instrument,lastsize)
or
ON qcollector (instrument,datetime,lastsize)
Do you really need to exclude rows that have a NULL value for lastsize from the count? Could you return a count of all rows instead? If you could instead return COUNT(1) or SUM(1), then the query wouldn't need to reference the lastsize column, so it wouldn't be needed in an index to make it a covering index.
The COUNT(lastsize) expression is equivalent to SUM(IF(lastsize IS NULL,0,1))
Do you need to return dates when there are only NULL lastsize values for the datetime range, or could all of the rows with a NULL lastsize be excluded? That is, could you include a predicate like
AND lastsize IS NOT NULL
in your query?
Those may help some.
I think the big problem is that the predicates on the TIME(datetime) expression are not sargable. That is, MySQL won't use an index range scan operation for those. The predicate on the bare datetime column is sargable... that's why the EXPLAIN is showing the datetime_index as a possible key.
And the other big problem is that the query is doing GROUP BY and ORDER BY operations on a derived expression, which is going to require MySQL to generate an intermediate result set (as a temporary MyISAM table), and then process that result set. And that can be a lot of heavy lifting when there are lots of rows to process.
As far as table changes, I would consider using separate DATE and TIME columns, and using a TIMESTAMP datatype in place of DATETIME (if you need to store the date and time together). I would rewrite the query to reference the bare DATE and bare TIME columns, and consider adding a covering index that included all columns referenced in the rewritten query, with leading columns being the columns with the highest cardinality (and having the most selective predicates in the query.)

When you use date and time functions on a column the indexes cannot be used efficiently. You could also store the date and time in separate columns and index those, though this will take up more storage space.
You may also want to consider adding multi-column indexes. An index on (instrument, datetime) would probably help you here.

Related

MySQL - Query becomes slow when combining WHERE and ORDER BY

I have a MySQL (v8.0.30) table that stores trades, the schema is the following:
CREATE TABLE `log_fill` (
`id` bigint NOT NULL AUTO_INCREMENT,
`orderId` varchar(63) NOT NULL,
`clientOrderId` varchar(36) NOT NULL,
`symbol` varchar(31) NOT NULL,
`executionId` varchar(255) DEFAULT NULL,
`executionSide` tinyint NOT NULL COMMENT '0 = long, 1 = short',
`executionSize` decimal(15,2) NOT NULL,
`executionPrice` decimal(21,8) unsigned NOT NULL,
`executionTime` bigint unsigned NOT NULL,
`executionValue` decimal(21,8) NOT NULL,
`executionFee` decimal(13,8) NOT NULL,
`feeAsset` varchar(63) DEFAULT NULL,
`positionSizeBeforeFill` decimal(21,8) DEFAULT NULL,
`apiKey` int NOT NULL,
`side` varchar(20) DEFAULT NULL,
`reconciled` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `executionId` (`executionId`,`executionSide`),
KEY `apiKey` (`apiKey`)
) ENGINE=InnoDB AUTO_INCREMENT=6522695 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
As you can see, there's a BTREE index on the column apiKey which stands for a user, this way I can quickly retrieve all trades for a specific user.
My goal is a query that returns positionSizeBeforeFill + executionSize for the last record, given an apiKey and a symbol. So I wrote the following:
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
ORDER BY id DESC
However the execution is extremely slow (around 100ms). I've noticed that running either WHERE or ORDER BY (and not both together) drastically reduces execution time. For example
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
only takes 220 microseconds to execute. The number of records after filtering by apiKey and symbol is 388.
Similarly,
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
ORDER BY id DESC
takes 26 microseconds (on a 3 million records table).
All in all, separately running WHERE and ORDER BY takes microseconds of execution, when I combine them we scale up to milliseconds (around 1000x more).
Running EXPLAIN on the slow query it turns out it has to examine 116032 rows.
I tried to create a temporary table hoping for MySQL to perform sorting only on the filtered records, but the outcome is the same. Was wondering whether the problem might be the index (whose cardinality is 203), but how can it be the case when WHERE alone takes very little time? I could not find similar cases on other questions or forums. I think I just fail at understanding how InnoDB selects data, I thought it would first filter by WHERE and then perform ORDER BY on the filtered rows. How can I improve this? Thanks!
Edit: The EXPLAIN statement on the slow query returns
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+----------------------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where; Backward index scan |
The query with WHERE only
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+-------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where |
The query with ORDER BY only on the full table
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+---------+----------+---------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | index | NULL | PRIMARY | 8 | NULL | 2503238 | 100.00 | Backward index scan |
26 microseconds for any query against a 3-million-row table implies that you have the Query cache enabled. Please rerunning your timings with SELECT SQL_NO_CACHE .... (Even milliseconds would be suspicious.) Were all 3M rows returned? Shoveling that many rows probably takes more than a second under any circumstances.
Meanwhile, to speed up the first two queries, add
INDEX(symbol, apikey, id)
EXPLAIN gives only approximate (eg, 116032) counts. A "cardinality" of 203 is also just an estimate, but it is used by the Optimizer in some situations. Please get the exact count just to check that there really are any rows:
SELECT COUNT(*)
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
With the ORDER BY id DESC, it will scan the entire B+Tree what holds the data. As it says, it will do a 'Backward index scan'. However, since the "index" is the PRIMARY KEY and the PK is clustered with the data, it is really referring to the data's BTree.
The EXPLAIN for the first query decided that the indexes were not useful enough for WHERE; instead it avoided the sort (ORDER BY) by doing the Backward full table scan, same as the 3rd query. (And ignored any rows that did not match the WHERE.
I added a composite index on (apiKey, symbol) and now the query runs in as little as 0.2ms. In order to reduce the creation time for the index I reduce the number of records from 3M to 500K and the gain in time is about 97%, I believe it's going to be more on the full table. I thought using just the apiKey index it would first filter out by user, then by symbol, but I'm probably wrong.

MySQL - Created index isn't showing up as possible key

I have the following table (it has more data columns, removed them because it would be a long post):
CREATE TABLE `members` (
`memberid` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`memberid`),
KEY `members_lname_ix` (`lastname`)
) ENGINE=InnoDB AUTO_INCREMENT=1019 DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci;
By default, a user only ever accesses 10-20 rows from this table at a time and it is usually sorted by the lastname column, it's all paginated server side. so I decided to add an index to lastname to help with sorting, however the index does not seem to be working like I would expect it to. when I run EXPLAIN SELECT * FROM members ORDER BY lastname ASC I get:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra
1 | simple | members | ALL | null | null | null | null | 711 | using filesort
I can at least confirm the index exists because if I run SHOW INDEX FROM members I get:
Table | Non_Unique | Key_name | Seq_in_ix | Col_name | Collation | Cardinality | Sub part | Packed | Null | Ix type
members | 0 | PRIMARY | 1 | memberid | A | 711 | null | null | (blank) | BTREE
members | 1 | members_lname_ix | 1 | lastname | A | 711 | null | null | YES | BTREE
if I add USE INDEX (members_lname_ix) both possible_keys and key will remain null. However if I add FORCE INDEX (members_lname_ix) possible_keys remains null and key shows members_lname_ix. This is my first time trying to apply indexing but to me this doesn't seem very intuitive - it feels like mysql should know that I created an index for lastname, no? I can't quite figure out what I'm doing wrong here unless I am misunderstanding something. Is the solution here to just keep using FORCE INDEX?
There are two ways to perform that query:
Plan A (as you were expecting):
Scan through the index sequentially, reading the entire (estimated) 711 rows.
Randomly look up each row in the data BTree. This involves reading the entire dataset.
Deliver the data in order.
Plan B (what it does):
Scan through the data, reading all 711 rows.
Sort the data
Deliver the sorted data.
Plan B does not touch the index at all; this was deemed to be a bigger savings than not having to sort the data.
In a table as tiny as yours, it would be hard to see a difference in speed. (In my test case, it took under 10 milliseconds either way.) In huge tables, the difference could be significant.
For optimal pagination, see http://mysql.rjweb.org/doc.php/pagination

MySql refuses to use index

I'm new to query optimizations so I accept I don't understand everything yet but I do not understand why even this simple query isn't optimized as expected.
My table:
+------------------+-----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-----------+------+-----+-------------------+----------------+
| tasktransitionid | int(11) | NO | PRI | NULL | auto_increment |
| taskid | int(11) | NO | MUL | NULL | |
| transitiondate | timestamp | NO | MUL | CURRENT_TIMESTAMP | |
+------------------+-----------+------+-----+-------------------+----------------+
My indexes:
+-----------------+------------+-------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------------+------------+-------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| tasktransitions | 0 | PRIMARY | 1 | tasktransitionid | A | 952 | NULL | NULL | | BTREE | | |
| tasktransitions | 1 | transitiondate_ix | 1 | transitiondate | A | 952 | NULL | NULL | | BTREE | | |
+-----------------+------------+-------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
My query:
SELECT taskid FROM tasktransitions WHERE transitiondate>'2013-09-31 00:00:00';
gives this:
+----+-------------+-----------------+------+-------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+-------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | tasktransitions | ALL | transitiondate_ix | NULL | NULL | NULL | 1082 | Using where |
+----+-------------+-----------------+------+-------------------+------+---------+------+------+-------------+
If I understand everything correctly Using where and ALL means that all rows are retrieved from the storage engine and filtered at server layer. This is sub-optimal. Why does it refuse to use the index and only retrieve the requested range from the storage engine (innoDB)?
Cheers
MySQL will not use the index if it estimates that it would select a significantly large portion of the table, and it thinks that a table-scan is actually more efficient in those cases.
By analogy, this is the reason the index of a book doesn't contain very common words like "the" -- because it would be a waste of time to look up the word in the index and find the list of page numbers is a very long list, even every page in the book. It would be more efficient to simply read the book cover to cover.
My experience is that this happens in MySQL if a query's search criteria would match greater than 20% of the table, and this is usually the right crossover point. There could be some variation based on the data types, size of table, etc.
You can give a hint to MySQL to convince it that a table-scan would be prohibitively expensive, so it would be much more likely to use the index. This is not usually necessary, but you can do it like this:
SELECT taskid FROM tasktransitions FORCE INDEX (transitiondate_ix)
WHERE transitiondate>'2013-09-31 00:00:00';
I once was trying to join two tables and MySQL was refusing to use an index, resulting in >500ms queries, sometimes a few seconds. Turns out the column I was joining on had different encodings on each table. Changing both to the same encoding sped up the query to consistently less than 100ms.
Just in case, it helps somebody.
I have a table with a varchar column _id (long int coded as string). I added an index for this column, but query was still slow. I was executing this query:
select * from table where (_id = 2221835089) limit 1
I realized that the _id column wasn't been generated as string (I'm Laravel as DB framework). Well, if query is executed with the right data type in the where clause everything worked like a charm:
select * from table where (_id = '2221835089') limit 1
I am new at my MySQL 8.0, have finished 2 simple tutorials completely, and there is only two subjects that has not worked for me, one of them is indexing. I read the section labeled "2 Answers" and found that using
the statement suggested at the end of said section, seems to defeat the
purpose of the original USE INDEX or FORCE INDEX statement below. The suggested statement is like getting a table sorted via a WHERE statement instead of MySQL using USE INDEX or FORCE INDEX. It works, but seems to me it is not the same as using the natural USE INDEX or FORCE INDEX. Does any one knows why MySQL is ignoring my simple request to index a 10 row table on the Lname column?
Field
Type
Null
Key
Default
Extra
ID
int
NO
PRI
Null
auto_increment
Lname
varchar(20)
NO
MUL
Null
Fname
varchar(20)
NO
Mul
Null
City
varchar(15)
NO
Null
Birth_Date
date
NO
Null
CREATE INDEX idx_Lname ON TestTable (Lname);
SELECT * FROM TestTable USE INDEX (idx_Lname);
SELECT * From Testtable FORCE INDEX (idx_LastFirst);

MySQL join issue, query hangs

I have a table holding numeric data points with timestamps, like so:
CREATE TABLE `value_table1` (
`datetime` datetime NOT NULL,
`value` decimal(14,8) DEFAULT NULL,
KEY `datetime` (`datetime`)
) ENGINE=InnoDB;
My table holds a data point for every 5 seconds, so timestamps in the table will be, e.g.:
"2013-01-01 10:23:35"
"2013-01-01 10:23:40"
"2013-01-01 10:23:45"
"2013-01-01 10:23:50"
I have a few such value tables, and it is sometimes necessary to look at the ratio between two value series.
I therefore attempted a join, but it seems to not work:
SELECT value_table1.datetime, value_table1.value / value_table2.rate
FROM value_table1
JOIN value_table2
ON value_table1.datetime = value_table2.datetime
ORDER BY value_table1.datetime ASC;
Running EXPLAIN on the query shows:
+----+-------------+--------------+------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | value_table1 | ALL | NULL | NULL | NULL | NULL | 83784 | Using temporary; Using filesort |
| 1 | SIMPLE | value_table2 | ALL | NULL | NULL | NULL | NULL | 83735 | |
+----+-------------+--------------+------+---------------+------+---------+------+-------+---------------------------------+
Edit
Problem solved, no idea where my index disappeared to. EXPLAIN showed it, thanks!
Thanks!
As your explain shows, the query is not using indexes on the join. Without indexes, it has to scan every row in both tables to process the join.
First of all, make sure the columns used in the join are both indexed.
If they are, then it might be the column type that is causing issues. You could create an integer representation of the time, and then use that to join the two tables.

MySQL Index is bigger than the data stored

I have a database with the following stats
Tables Data Index Total
11 579,6 MB 0,9 GB 1,5 GB
So as you can see the Index is close to 2x bigger. And there is one table with ~7 million rows that takes up at least 99% of this.
I also have two indexes that are very similar
a) UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
b) KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
Update: Here is the table definition (at least structurally) of the largest table
CREATE TABLE `invoices` (
`id` int(10) unsigned NOT NULL auto_increment,
`customer_id` int(10) unsigned NOT NULL,
`order_no` varchar(10) default NULL,
`invoice_no` varchar(20) default NULL,
`customer_no` varchar(20) default NULL,
`name` varchar(45) NOT NULL default '',
`archived` tinyint(4) default NULL,
`invoiced` tinyint(4) default NULL,
`time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`group` int(11) default NULL,
`customer_group` int(11) default NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
KEY `idx_time` (`time`),
KEY `idx_order` (`order_no`),
KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
) ENGINE=InnoDB AUTO_INCREMENT=9146048 DEFAULT CHARSET=latin1 |
Update 2:
mysql> show indexes from invoices;
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| invoices | 0 | PRIMARY | 1 | id | A | 7578066 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_time | 1 | time | A | 541290 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_order | 1 | order_no | A | 6091 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 3 | order_no | A | 7578066 | NULL | NULL | YES | BTREE | |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
My questions are:
Is there a way to find unused indexes in MySQL?
Are there any common mistakes that impact the size of the index?
Can indexA safely be removed?
How can you measure the size of each index? All I get is the total of all indexes.
You can remove index A, because, as you have noted, it is a subset of another index. And it's possible to do this without disrupting normal processing.
The size of the index files is not alarming in itself and it can easily be true that the net benefit is positive. In other words, the usefulness and value of an index shouldn't be discounted because it results in a large file.
Index design is a complex and subtle art involving a deep understanding of the query optimizer explanations and extensive testing. But one common mistake is to include too few fields in an index in order to make it smaller. Another is to test indexes with insufficient, or insufficiently representative data.
I may be wrong, but the first index (idx_customer_invoice) is UNIQUE, the second (idx_customer_invoice_order) is not, so you'll probably lose the uniqueness constraint when you remove it. No?
Is there a way to find unused indexes in MySQL?
The database engine optimizer will select a proper index when attempting to optimize your query. Depending on when you collected statistics on your indexes last, the index which is chosen will vary. Unused indexes could suddenly become used because of new data repartition.
Can indexA safely be removed?
I would say yes, if indexA and indexB are B-Tree indexes. This is because an index that starts with the same columns in the same order will have the same structure.
use
show indexes from table;
to define what indexes do you have in a particular table. Cardinality would tell how useful your index is.
You can remove your indexes safely (it will not break a table), but beware: some queries might execute slower. First you should analyze your queries to decide whether you need a certain index or not.
I don't think you can find out data length of a particular index, though.
BUT, I think you probably think that if indexes length is greater than data length twice is something abnormal... Well, you are wrong. All of your indexes might be useful ;) If you have a table that provides a lot of information and you have to search on it upon a large number of column, it can easily be that indexes of this table will 2 times bigger in size that the tables data.
indexA can remove because there's a
indexB include indexA
what impact your index length is
your column type and column length
use:
select index_length from information_schema.tables
where table_name='your_table_name' and
table_schema='your_db_name';
get your table index_length