I'm running a fairly simple auto catalog
CREATE TABLE catalog_auto (
id INT(10) UNSIGNED NOT NULL auto_increment,
make varchar(35),
make_t varchar(35),
model varchar(40),
model_t varchar(40),
model_year SMALLINT(4) UNSIGNED,
fuel varchar(35),
gearbox varchar(15),
wd varchar(5),
engine_cc SMALLINT(4) UNSIGNED,
variant varchar(40),
body varchar(30),
power_ps SMALLINT(4) UNSIGNED,
power_kw SMALLINT(4) UNSIGNED,
power_hp SMALLINT(4) UNSIGNED,
max_rpm SMALLINT(5) UNSIGNED,
torque SMALLINT(5) UNSIGNED,
top_spd SMALLINT(5) UNSIGNED,
seats TINYINT(2) UNSIGNED,
doors TINYINT(1) UNSIGNED,
weight_kg SMALLINT(5) UNSIGNED,
lkm_def TINYINT(3) UNSIGNED,
lkm_mix TINYINT(3) UNSIGNED,
lkm_urb TINYINT(3) UNSIGNED,
tank_cap TINYINT(3) UNSIGNED,
co2 SMALLINT(5) UNSIGNED,
PRIMARY KEY(id),
INDEX `gi`(`make`,`model`,`model_year`,`fuel`,`gearbox`,`wd`,`engine_cc`),
INDEX `mkt`(`make`,`make_t`),
INDEX `mdt`(`make`,`model`,`model_t`)
);
The table has about 60.000 rows so far, so, nothing that simple queries, even without indexes, couldn't handle.
The point is, i'm trying to get the hang of using indexes, so i made a few, based on my most frequent queries.
Say i want engine_cc for a specific set of criteria like so:
SELECT DISTINCT engine_cc FROM catalog_auto WHERE make='audi' AND model='a4' and model_year=2006 AND fuel='diesel' AND gearbox='manual' AND wd='front';
EXPLAIN says:
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------+------+--------------------------+
| 1 | SIMPLE | catalog_auto | ref | gi,mkt,mdt | gi | 408 | const,const,const,const,const,const | 8 | Using where; Using index |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------+------+--------------------------+
The query is using gi index as expected, no problem here.
After selecting base criteria, i need the rest of the columns as well:
SELECT * FROM catalog_auto WHERE make='audi' AND model='a4' and model_year=2006 AND fuel='diesel' AND gearbox='manual' AND wd='front' AND engine_cc=1968;
EXPLAIN says:
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| 1 | SIMPLE | catalog_auto | ref | gi,mkt,mdt | gi | 411 | const,const,const,const,const,const,const | 3 | Using where |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
It selected a KEY but NOT using the index. The query however, is very fast(1 row in set (0.00 sec)), but since the table doesn't have that many rows, i assume even without indexing, it would be the same.
Tried it like this:
SELECT * FROM catalog_auto WHERE id IN (SELECT id FROM catalog_auto WHERE make='audi' AND model='a6' AND model_year=2009);
Again, in EXPLAIN:
+----+--------------------+--------------+-----------------+--------------------+---------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-----------------+--------------------+---------+---------+------+-------+-------------+
| 1 | PRIMARY | catalog_auto | ALL | NULL | NULL | NULL | NULL | 59060 | Using where |
| 2 | DEPENDENT SUBQUERY | catalog_auto | unique_subquery | PRIMARY,gi,mkt,mdt | PRIMARY | 4 | func | 1 | Using where |
+----+--------------------+--------------+-----------------+--------------------+---------+---------+------+-------+-------------+
Still NOT using any index, not even PRIMARY KEY. Shouldn't this, at least use the PRIMARY KEY?
Documentation says: MySQL can ignore a key even if it finds one, if it determines that a full table scan would be faster, depending on the query.
Is that the reason why it's not using any of the indexes? Is this a good practice? If not, how would you recommend indexing columns, for a SELECT * statement, to always use an index, given the above query.
I'm not much of a MySQL expert, so any pointers would be greatly appreciated.
Using MySQL 5.5 with InnoDB.
I'm basically saying the same answer that #DStanley said, but I want to expand on it more than I can fit in a comment.
The "Using index" note means that the query is using only the index to get the columns it needs.
The absence of this note doesn't mean the query isn't using an index.
What you should look at is the key column in the EXPLAIN report:
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| 1 | SIMPLE | catalog_auto | ref | gi,mkt,mdt | gi | 411 | const,const,const,const,const,const,const | 3 | Using where |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
The key column says the optimizer chooses to use the gi index. So it is using an index. And the ref column confirms that's referencing all seven columns of that index.
The fact that it must fetch more of the columns to return * means it can't claim "Using [only] index".
Also read this excerpt from https://dev.mysql.com/doc/refman/5.6/en/explain-output.html:
Using index
The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index.
I think of this analogy, to a telephone book:
If you look up a business in a phone book, it's efficient because the book is alphabetized by the name. When you find it, you also have the phone number right there in the same entry. So if that's all you need, it's very quick. That's an index-only query.
If you want extra information about the business, like their hours or credentials or whether they carry a certain product, you have to do the extra step of using that phone number to call them and ask. That's a couple of extra minutes of time to get that information. But you were still able to find the phone number without having to read the entire phone book, so at least it didn't take hours or days. That's a query that used an index, but had to also go look up the row from the table to get other data.
I'm not a MySQL expert, but my guess is that the index was used for the row lookup, but the actual data has to be retrieved from the data pages, so an additional lookup is necessary.
In your first query, the data you ask for is available by looking only at the index keys. When you ask for columns that aren't in the index in the second and third queries, the engine uses the key to do a SEEK on the data tables, so it's still very fast.
With SQL performance, since the optimizer has a lot of freedom to choose the "best" plan, the proof is in the pudding when it comes to indexing. If adding an index makes a common query faster, great, use it. If not, then save the space and overhead of maintaining the index (or look for a better index).
Note that you don't get a free lunch - additional indices can actually slow down a system, particularly if you have frequent inserts or updates on columns that are indexed, since the systme will have to constantly maintain those indices.
Related
I have a MySQL (v8.0.30) table that stores trades, the schema is the following:
CREATE TABLE `log_fill` (
`id` bigint NOT NULL AUTO_INCREMENT,
`orderId` varchar(63) NOT NULL,
`clientOrderId` varchar(36) NOT NULL,
`symbol` varchar(31) NOT NULL,
`executionId` varchar(255) DEFAULT NULL,
`executionSide` tinyint NOT NULL COMMENT '0 = long, 1 = short',
`executionSize` decimal(15,2) NOT NULL,
`executionPrice` decimal(21,8) unsigned NOT NULL,
`executionTime` bigint unsigned NOT NULL,
`executionValue` decimal(21,8) NOT NULL,
`executionFee` decimal(13,8) NOT NULL,
`feeAsset` varchar(63) DEFAULT NULL,
`positionSizeBeforeFill` decimal(21,8) DEFAULT NULL,
`apiKey` int NOT NULL,
`side` varchar(20) DEFAULT NULL,
`reconciled` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `executionId` (`executionId`,`executionSide`),
KEY `apiKey` (`apiKey`)
) ENGINE=InnoDB AUTO_INCREMENT=6522695 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
As you can see, there's a BTREE index on the column apiKey which stands for a user, this way I can quickly retrieve all trades for a specific user.
My goal is a query that returns positionSizeBeforeFill + executionSize for the last record, given an apiKey and a symbol. So I wrote the following:
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
ORDER BY id DESC
However the execution is extremely slow (around 100ms). I've noticed that running either WHERE or ORDER BY (and not both together) drastically reduces execution time. For example
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
only takes 220 microseconds to execute. The number of records after filtering by apiKey and symbol is 388.
Similarly,
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
ORDER BY id DESC
takes 26 microseconds (on a 3 million records table).
All in all, separately running WHERE and ORDER BY takes microseconds of execution, when I combine them we scale up to milliseconds (around 1000x more).
Running EXPLAIN on the slow query it turns out it has to examine 116032 rows.
I tried to create a temporary table hoping for MySQL to perform sorting only on the filtered records, but the outcome is the same. Was wondering whether the problem might be the index (whose cardinality is 203), but how can it be the case when WHERE alone takes very little time? I could not find similar cases on other questions or forums. I think I just fail at understanding how InnoDB selects data, I thought it would first filter by WHERE and then perform ORDER BY on the filtered rows. How can I improve this? Thanks!
Edit: The EXPLAIN statement on the slow query returns
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+----------------------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where; Backward index scan |
The query with WHERE only
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+-------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where |
The query with ORDER BY only on the full table
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+---------+----------+---------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | index | NULL | PRIMARY | 8 | NULL | 2503238 | 100.00 | Backward index scan |
26 microseconds for any query against a 3-million-row table implies that you have the Query cache enabled. Please rerunning your timings with SELECT SQL_NO_CACHE .... (Even milliseconds would be suspicious.) Were all 3M rows returned? Shoveling that many rows probably takes more than a second under any circumstances.
Meanwhile, to speed up the first two queries, add
INDEX(symbol, apikey, id)
EXPLAIN gives only approximate (eg, 116032) counts. A "cardinality" of 203 is also just an estimate, but it is used by the Optimizer in some situations. Please get the exact count just to check that there really are any rows:
SELECT COUNT(*)
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
With the ORDER BY id DESC, it will scan the entire B+Tree what holds the data. As it says, it will do a 'Backward index scan'. However, since the "index" is the PRIMARY KEY and the PK is clustered with the data, it is really referring to the data's BTree.
The EXPLAIN for the first query decided that the indexes were not useful enough for WHERE; instead it avoided the sort (ORDER BY) by doing the Backward full table scan, same as the 3rd query. (And ignored any rows that did not match the WHERE.
I added a composite index on (apiKey, symbol) and now the query runs in as little as 0.2ms. In order to reduce the creation time for the index I reduce the number of records from 3M to 500K and the gain in time is about 97%, I believe it's going to be more on the full table. I thought using just the apiKey index it would first filter out by user, then by symbol, but I'm probably wrong.
I have a table with around 500,000 rows, with a composite primary key index. I'm trying a simple select statement as follows
select * from transactions where account_id='1' and transaction_id='003a4955acdbcf72b5909f122f84d';
The explain statement gives me this
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
-------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | transactions | NULL | const | PRIMARY | PRIMARY | 94 | const,const | 1 | 100.00 | NULL
My primary index is on account_id and transaction_id.
My engine is InnoDB.
When I run my query it takes around 156 milliseconds.
Given that explain shows that only one row needs to be examined, I'm not sure how to optimize this further. What changes can I do to significantly improve this?
I'm going to speculate a bit, given the information provided: your primary key is composed of an integer field account_id and a varchar one called transaction_id.
Since they're both components of the PRIMARY index created when you defined them as PRIMARY KEY(account_id, transaction_id), as they are they're the best you can have.
I think the bottleneck here is the transaction_id: as a string, it requires more effort to be indexed, and to be searched for. Changing its type to a different, easier to search one (i.e. numbers) would probably help.
The only other improvement I see is to simplify the PRIMARY KEY itself, either by removing the account_id field (it seems useless to me, since the transaction_id tastes like being unique, but that depends on your design) or by substitute the whole key with an integer, AUTO INCREMENT value (not recommended).
I have the following table (it has more data columns, removed them because it would be a long post):
CREATE TABLE `members` (
`memberid` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`memberid`),
KEY `members_lname_ix` (`lastname`)
) ENGINE=InnoDB AUTO_INCREMENT=1019 DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci;
By default, a user only ever accesses 10-20 rows from this table at a time and it is usually sorted by the lastname column, it's all paginated server side. so I decided to add an index to lastname to help with sorting, however the index does not seem to be working like I would expect it to. when I run EXPLAIN SELECT * FROM members ORDER BY lastname ASC I get:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra
1 | simple | members | ALL | null | null | null | null | 711 | using filesort
I can at least confirm the index exists because if I run SHOW INDEX FROM members I get:
Table | Non_Unique | Key_name | Seq_in_ix | Col_name | Collation | Cardinality | Sub part | Packed | Null | Ix type
members | 0 | PRIMARY | 1 | memberid | A | 711 | null | null | (blank) | BTREE
members | 1 | members_lname_ix | 1 | lastname | A | 711 | null | null | YES | BTREE
if I add USE INDEX (members_lname_ix) both possible_keys and key will remain null. However if I add FORCE INDEX (members_lname_ix) possible_keys remains null and key shows members_lname_ix. This is my first time trying to apply indexing but to me this doesn't seem very intuitive - it feels like mysql should know that I created an index for lastname, no? I can't quite figure out what I'm doing wrong here unless I am misunderstanding something. Is the solution here to just keep using FORCE INDEX?
There are two ways to perform that query:
Plan A (as you were expecting):
Scan through the index sequentially, reading the entire (estimated) 711 rows.
Randomly look up each row in the data BTree. This involves reading the entire dataset.
Deliver the data in order.
Plan B (what it does):
Scan through the data, reading all 711 rows.
Sort the data
Deliver the sorted data.
Plan B does not touch the index at all; this was deemed to be a bigger savings than not having to sort the data.
In a table as tiny as yours, it would be hard to see a difference in speed. (In my test case, it took under 10 milliseconds either way.) In huge tables, the difference could be significant.
For optimal pagination, see http://mysql.rjweb.org/doc.php/pagination
If you use a count on a non-null-column, on one table, without any where-parts, the optimaizer just return the number of rows in that table.
If you ask for a DISTINCT count on a UNIQE non-null-column, like the PRIMARY KEY, the answers should be the same, but this time mariadb do the calculations insted.
And if you have left join on other tables, and still no where-parts, the results should still be the number of rows in that table.
Is there a reason for mariadb not using thous optimizations? Is there case when the DISTINCT count of an unfiltered primary key, could give any other result then the number of rows in that tabel?
case:
CREATE TABLE products (
our_article_id varchar(50) CHARACTER SET utf8 NOT NULL,
...,
PRIMARY KEY(our_article_id)
);
CREATE TABLE product_article_id (
article_id varchar(255) COLLATE utf8_bin NOT NULL,
our_article_id varchar(50) CHARACTER SET utf8 NOT NULL,
...
PRIMARY KEY(article_id),
INDEX(our_article_id)
);
Count queries, 1st, basic count
DESCRIBE SELECT COUNT(our_article_id) FROM products;
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
2nd DISTINCT on primary key
DESCRIBE SELECT COUNT(DISTINCT our_article_id) FROM products;
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
| 1 | SIMPLE | products | index | NULL | PRIMARY | 152 | NULL | 225089 | Using index |
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
3th, DISTINCT on PRIMARY KEY, and a LEFT JOIN without WHERE-parts
DESCRIBE SELECT COUNT(DISTINCT our_article_id) FROM products LEFT JOIN product_article_id USING (our_article_id);
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
| 1 | SIMPLE | products | index | NULL | PRIMARY | 152 | NULL | 225089 | Using index |
| 1 | SIMPLE | product_article_id | ref | PRIMARY | PRIMARY | 152 | testseek.products.our_article_id | 12579 | Using index |
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
"Is there a reason for mariadb not using thous optimizations?" -- There are a zillion missing optimizations in MySQL/MariaDB; that's missing. Let's look at the history.
MySQL started about 2 decades ago as a lean and mean database engine. It focused on features that most people needed, while minimizing the overhead. This meant that a lot of rare optimizations were not in the early releases, and only get added over time if they seem important enough.
Take the PRIMARY KEY, for example. It is defined as UNIQUE. It is BTree organized. And, with InnoDB, it is also defined as Clustered. Other vendors allow various combinations clustering, non-BTree indexing, etc. MySQL decided that the limitations were "good enough" for "most" people.
Over the years, the 'worst' omissions have been gradually fixed. Transactions is probably the biggest and most important. It arrived in 2001(?), and MyISAM is being removed this year (2016) with the advent of 8.0.
4.1 (2002?) saw subqueries. Before that, creating a tmp table was "good enough". Now (8.0) subqueries are being one-upped by CTEs, which covers a few things that neither tmp tables nor subqueries can do efficiently.
There have been a huge number of optimizations put into MySQL 5.6 and 5.7 and MariaDB 10.x; you probably have not used more than a couple of them. The product is into "diminishing returns". It would damage its "lean and mean" heritage if it slowed down the optimizer to check for the next thousand extremely rare optimizations.
Meanwhile, guys like me spend a lot of time saying "MySQL/MariaDB doesn't have that; here's the workaround". It's the shorter COUNT(*) in your case. Since there is a clean workaround, it may be another decade before your suggestions are implemented. It is OK to file a bug report with bugs.mysql.com or mariadb.com to suggest the optimizations.
Another, almost never needed case, is INDEX(a ASC, b DESC) as a way of optimizing ORDER BY a ASC, b DESC. That is coming with 8.0. But I doubt if more than one query in 5,000 really needs it. (I have seen a lot of queries.) I suggest that its rarity is why it took two decades to implement it. The lack of a clean workaround is why it did not take another decade.
I am trying to find the closest gene, given position information, from a gene table. Here is an example:
SELECT chrom, txStart, txEnd, name2, strand FROM
wgEncodeGencodeCompV12 WHERE chrom = 'chr1' AND txStart < 713885 AND
strand = '+' ORDER BY txStart DESC LIMIT 1;
My test runs have been pretty slow, which is problematic.
Here is an EXPLAIN output with default indexing (by chrom):
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | ref | chrom | chrom | 257 | const | 15843 | Using where; Using filesort |
Filesort is used and is probably causing all the sluggishness?
I tried speeding up the sorting by indexing (chrom, txStart, strand), or just txStart alone, but it only got slower (?). My reasoning is that txStart is not selective enough to be a good index, and that a whole-table scanning in this case is actually faster?
Here is the EXPLAIN output with the additional indexing:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | range | chrom,closest_gene_lookup | closest_gene_lookup | 261 | NULL | 57 | Using where |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | range | chrom,txStart | txStart | 4 | NULL | 1571 | Using where |
Table structure
CREATE TABLEwgEncodeGencodeCompV12(
binsmallint(5) unsigned NOT NULL,
namevarchar(255) NOT NULL,
chromvarchar(255) NOT NULL,
strandchar(1) NOT NULL,
txStartint(10) unsigned NOT NULL,
txEndint(10) unsigned NOT NULL,
cdsStartint(10) unsigned NOT NULL,
cdsEndint(10) unsigned NOT NULL,
exonCountint(10) unsigned NOT NULL,
exonStartslongblob NOT NULL,
exonEndslongblob NOT NULL,
scoreint(11) default NULL,
name2varchar(255) NOT NULL,
cdsStartStatenum('none','unk','incmpl','cmpl') NOT NULL,
cdsEndStatenum('none','unk','incmpl','cmpl') NOT NULL,
exonFrameslongblob NOT NULL,
KEYchrom(chrom,bin),
KEYname(name),
KEYname2(name2)
)
Is there a way to make this more efficient? I appreciate your time!
(update)Solution:
Combining both commenters' suggestions significantly improved run time.
In your case (query on a single table, no joins, no complicated stuff) it is important to understand the distribution of values in each column, and to understand how the database server utilizes the indexes. When you have a field with a rather big range of different values, then that one should be used for indexing. (e.g. an index on strand would just split the whole data in + or - and downstream filters would have to process each row of the either + or - result set, thats near the worst case)
So far, we know that txStart has the most differentiated values distribution amongst the interesting columns of your query.
So, your query definitely should utilize an index query on that column! But a btree index, not a hash index (operators <, <=, > etc. are fast on btree, but not on hash).
Try again with just a single (btree) index on txStart (I know you already tried that, but please try again and avoid all secondary indexes etc..).
Multi column indexes are nice, but their complexity make them not as fast as plain single column indexes, MySQLs optimizer is rather stupid in selecting the optimal indexes ;-)
Another important factor could be the dynamic row size (because of using longblob columns). But I am not up-to-date on the current state of MySQL in that regard.
The index that you want is: wgEncodeGencodeCompV12(chrom, strand, txstart).
In general, you want the fields with equalities as the first columns in the index. Then add one field with the inequality.