I have a primary key on (A, B) where A is an INT and B is an INT. Would queries searching on A run faster if I had an index on A instead?
I understand the leftmost prefix rule, but I'm curious if a multi-column key/index performs worse than a single-column key/index due to the key being longer.
In some cases it may perform worse - if the rest of the columns are large, for example:
A: int, B: varchar(128), C: text
index on A will perform better than index on A,B,C
In most cases it perform the same; in your case you have 4 vs 8 bytes, so overhead of having a second index is not worth doing it.
Keep in mind that primary key performs better than a secondary index, especially if the storage engine is InnoDB (the primary key is a clustering index) and it's not a covering query (it have to access the table to load data not stored in the index)
Actually in InnoDB all secondary indexes contain the primary key, so they are larger than the PK by default.
You have a situation where the composite key has two components. The first is 4 bytes and the second 4 bytes. The total key is 8 bytes.
A primary key index is clustered, meaning that the "leaf"s of the b-tree are the actual records themselves. A clustered index is going to be faster to access than other types of indexes.
One consideration in the performance of an index is the size of the key (as well as additional columns being kept in the index). An index with a 4-byte key is going to be smaller than an index with an 8-byte key. This means less disk usage and less storage in memory. However, the gains here might be pretty small. After all, a million rows in the table would correspond to at most a 10-20 million bytes (indexes have additional overheads in them).
Another consideration is the performance of data modification steps. In a clustered index, inserting/modifying a key value in the middle of a table requires re-writing the records themselves. However, you question does not seem to be address data modification.
If you have already defined the primary key index, then adding another index is additional overhead for the system. You might find that both indexes are occupying memory, so instead of saving space you are actually adding to it.
Ultimately, the answer to this type of rather arcane question is to do some timing tests. If the B column were much, much larger than the A component, I could possibly see some gains. For queries that only use A, I could possibly see some gains. However, my guess is that such gains would be quite minimal.
Related
I am asking this question with repect to mysql database.I read that clustered index orders the table based on primary key or columns that we provide for making clustered index, where as in non clustered index there is separate space taken for key and record pointer.
Also I read as there is no separate index table, clustered index is faster than non clustered index where as non clustered index must first look into index table find corresponding record pointer and fetch record data
Does that mean there is no extra space taken for clustered index?
PS:I know that there are already some similar answers on this question but I can't understand.
There is no extra space taken because every InnoDB table is stored as the clustered index. There is in fact only the clustered index, and secondary indexes. There's no separate storage for data, because all the unindexed columns are simply stored in the terminal nodes of the clustered index. You might like to read more about it here: https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
It is true that if you do a lookup using a secondary index, and then select columns besides those in the secondary index, InnoDB would do a sort of double lookup. Once to search the secondary index, which results in the value of the primary key(s) where the value you are searching for is found, and then it uses those primary keys to search the clustered index to combine with the other columns.
This double-lookup is mitigated partially by the Adaptive Hash, which is a cache of frequently-searched values. This cache is populated automatically as you run queries. So over time, if you run queries for the same values over again, it isn't so costly.
The situation is more complex than your question.
First, let's talk only about ENGINE=InnoDB; other engines work differently.
There is about 1% overhead for the non-leaf BTree nodes to "cluster" the PRIMARY KEY with the data.
If you do not explicitly specify a PRIMARY KEY, it may be able to use a UNIQUE key as the PK. But if not, then a hidden, 6-byte number will be used for the PK. This would take more space than if you had, say, a 4-byte INT for the PK! That is, you cannot create a table without a PRIMARY KEY.
The above 2 items is TMI; think of the PK as taking no extra space.
Yes, lookup by the PK is faster than lookup by a secondary key. But if you need a secondary key, then create it. Playing a game of first fetching ids, then fetching the rows is slower than doing all the work in a single query.
A Secondary key also uses BTree also. But it is sorted by the key's column(s) and does not include all the other columns. Instead, it includes the PK's columns. (Hence the "double-lookup" that Bill mentioned.)
A "covering index" is one that contains all the columns needed for a particular SELECT. In that case, all the work can be done in the index's BTree, thereby avoiding the double-lookup. That is, a covering index is as fast as a primary key lookup. (I would guess that 20% of indexes are "covering" or could be made covering by adding a column or two.)
BTrees have a bunch of overhead. A Rule of Thumb: Add up the size of each column (4 bytes for INT, etc), then multiply by 2 or 3. The result will often be a good estimate of the disk space needed for the Data or Index Btree.
This discussion does not cover FULLEXT or SPATIAL indexes.
I am reading Effective Mysql - Optimizing Mysql Statements and in chapter 3 there was this explanation:
The secondary indexes in InnoDB use the B-tree data structure; however, they differ from the MyISAM implementation. In InnoDB, the secondary index stores the physical value of the primary key. In MyISAM, the secondary index stores a pointer to the data that contains the primary key value.
This is important for two reasons. First, the size of secondary indexes in InnoDB can be much larger when a large primary key is defined—for example when your primary key in InnoDB is 40 bytes in length. As the number of secondary indexes increase, the comparison size of the indexes can become significant. The second difference is that the secondary index now includes the primary key value and is not required as part of the index. This can be a significant performance improvement with table joins and covering indexes.
There are many questions that come to my mind, mostly due to lack of understanding of what author is trying to convey.
It is unclear what the author means in the second difference in
second paragraph. What is not required as part of index anymore?
Does InnoDB secondary index B-tree only store PK value or PK value
and Pointer to it? or PK Value and pointer to data row?
What kind of performance improvement would there be due to the storage method (2nd question's answer)?
This question contains an example and also an answer. He explains how it contains PK value, but what I am still not understanding is,
To complete the join, if the pointer is not there in the secondary index and only the value, wont MySQL do a full index scan on Primary Key index with that value from secondary index? How would that be efficient than having the pointer also?
The secondary index is an indirect way to access the data. Unlike the primary (clustered) index, when you traverse the secondary index in InnoDB and you reach the leaf node you find a primary key value for the corresponding row the query is looking for. Using this value you traverse the primary index to fetch the row. This means 2 index look ups in InnoDB.
For MyISAM because the leaf of the secondary node is a pointer to the actual row you only require 1 index lookup.
The secondary index is formed based on certain attributes of your table that are not the PK. Hence the PK is not required to be part of the index by definition. Whether it is (InnoDB) or not (MyISAM) is implementation detail with corresponding performance implications.
Now the approach that InnoDB follows might at first seem inefficient in comparison to MyISAM (2 lookups vs 1 lookup) but it is not because the primary index is kept in memory so the penalty is low.
But the advantage is that InnoDB can split and move rows to optimize the table layout on inserts/updates/deletes of rows without needing to do any updates on the secondary index since it does not refer to the affected rows directly
Basics..
MyISAM's PRIMARY KEY and secondary keys work the same. -- Both are BTrees in the .MYI file where a "pointer" in the leaf node points to the .MYD file.
The "pointer" is either a byte offset into the .MYD file, or a record number (for FIXED). Either results in a "seek" into the .MYD file.
InnoDB's data, including the columns of the PRIMARY KEY, is stored in one BTree ordered by the PK.
This makes a PK lookup slightly faster. Both drill down a BTree, but MyISAM needs an extra seek.
Each InnoDB secondary key is stored in a separate BTree. But in this case the leaf nodes contain any extra columns of the PK. So, a secondary key lookup first drills down that BTree based on the secondary key. There it will find all the columns of both the secondary key and the primary key. If those are all the columns you need, this is a "covering index" for the query, and nothing further is done. (Faster than MyISAM.)
But usually you need some other columns, so the column(s) of the PK are used to drill down the data/PK BTree to find the rest of the columns in the row. (Slower than MyISAM.)
So, there are some cases where MyISAM does less work; some cases where InnoDB does less work. There are a lot of other things going on; InnoDB is winning many comparison benchmarks over MyISAM.
Caching...
MyISAM controls the caching of 1KB index blocks in the key_buffer. Data blocks are cached by the Operating System.
InnoDB caches both data and secondary index blocks (16KB in both cases) in the buffer_pool.
"Caching" refers to swapping in/out blocks as needed, with roughly a "least recently used" algorithm.
No BTree is loaded into RAM. No BTree is explicitly kept in RAM. Every block is requested as needed, with the hope that it is cached in RAM. For data and/or index(es) smaller than the associated buffer (key_buffer / buffer_pool), the BTree may happen to stay in RAM until shutdown.
The source-of-truth is on disk. (OK, there are complex tricks that InnoDB uses with log files to avoid loss of data when a crash occurs before blocks are flushed to disk. That cleanup automatically occurs when restarting after the crash.)
Pulling the plug..
MyISAM:
Mess #1: Indexes will be left in an unclean state. CHECK TABLE and REPAIR TABLE are needed.
Mess #2: If you are in the middle of UPDATEing a thousand rows in a single statement, some will be updated, some won't.
InnoDB:
As alluded to above, InnoDB performs things atomically, even across pulling the plug. No index is left mangled. No UPDATE is left half-finished; it will be ROLLBACKed.
Example..
Given
columns a,b,c,d,e,f,g
PRIMARY KEY(a,b,c)
INDEX(c,d)
The BTree leaf nodes will contain:
MyISAM:
for the PK: a,b,c,pointer
for secondary: c,d,pointer
InnoDB:
for the PK: a,b,c,d,e,f,g (the entire row is stored with the PK)
for secondary: c,d,a,b
What could be the possible downside of having UNIQUE constraint for a large string (varchar) (roughly 100 characters or so) in MYSQL during :
insert phase
retrieval phase (on another primary key)
Can the length of the query impact the performance of read/writes ? (Apart from disk/memory usage for book-keeping).
Thanks
Several issues. There is a limit on the size of a column in an index (191, 255, 767, 3072, etc, depending on various things).
Your column fits within the limit.
Simply make a UNIQUE or PRIMARY key for that column. There are minor performance concerns, but keep this in mind: Fetching a row is more costly than any datatype issues involving the key used to locate it.
Your column won't fit.
Now the workarounds get ugly.
Index prefixing (INDEX foo(50)) has a number of problems and inefficiencies.
UNIQUE foo(50) is flat out wrong. It is declaring that the first 50 characters are constrained to be unique, not the entire column.
Workarounds with hashing the string (cf md5, sha1, etc) have a number of problems and inefficiencies. Still, this may be the only viable way to enforce uniqueness of a long string.
(I'll elaborate if needed.)
Fetching a row (Assuming the statement is parsed and the PRIMARY KEY is available.)
Drill down the BTree containing the data (and ordered by the PK). This may involve bring a block (or more) from disk into the buffer_pool.
Parse the block to find the row. (There are probably dozens of rows in the block.)
At some point in the process lock the row for reading and/or be blocked by some other connection that is, say, updating or deleting.
Pick apart the row -- that is, split into columns.
For any text/blob columns needed, reach into the off-record storage. (Wide columns are not stored with the narrow columns of the row; they are stored in other block(s).) The costly part is locating (and reading from disk if not cached) the extra block(s) containing the big TEXT/BLOB.
Convert from the internal storage (not word-aligned, little-endian, etc) into the desired format. (A small amount of CPU code, but necessary. This means that the data files are compatible across OS and even hardware.)
If the next step is to compare two strings (for JOIN or ORDER BY), then that a simple subroutine call to a scan over however many characters there are. (OK, most utf8 collations are not 'simple'.) And, yes, comparing two INTs would be faster.
Disk space
Should INT be used instead of VARCHAR(100) for the PRIMARY KEY? It depends.
Every secondary key has a copy of the PRIMARY KEY in it. This implies that a PK that is VARCHAR(100) makes secondary indexes bulkier than if the PK were INT.
If there are no secondary keys, then the above comment implies that INT is the bulkier approach!
If there are more than 2 secondary keys, then using varchar is likely to be bulkier.
(For exactly one secondary key, it is a tossup.)
Speed
If all the columns of a SELECT are in a secondary index, the query may be performed entirely in the index's BTree. ("Covering index", as indicated in EXPLAIN by "Using index".) This is sometimes a worthwhile optimization.
If the above does not apply, and it is useful to look up row(s) via a secondary index, then there are two BTree lookups -- once in the index, then via the PK. This is sometimes a noticeable slowdown.
The point here is that artificially adding an INT id may be slower than simply using the bulky VARCHAR as the PK. Each case should be judged on its tradeoffs; I am not making a blanket statement.
I am using InnoDB. My Index selectivity (cardinality / total-rows) is < 100%, roughly 96-98%.
I would like to know if the columns, which are not part of the keys, are also stored in sorted order. This influences my tables' design.
Would also be interest to understand how much performance degradation in lookup I can expect when index selectivity is < 100%.
(I get these question since for InnoDB it's only mentioned that indexes are clustered and there's TID/RP stored after the index)
No, it doesn't matter for the order of the non-keyed columns.
The answer to your second is more complex - I'm going to walk through it since I think you might be misunderstanding InnoDB a little -
There are two types of indexes, primary and secondary.
The primary key index is clustered - that is, data is stored in the leaves of the B+Tree. Looking up by primary key is just one tree traversal, and you've got the row(s) you're looking for.
Looking up by secondary key requires searching through the secondary key, finding the primary key rows that match, and then looking through the primary key to get the data.
You only need to worry about selectivity of secondary (non clustered) indexes, since the primary (clustered) index will always have a selectivity of 1. How selective a secondary index needs to be varies a lot - for one; it depends on the width of the index versus the width of the row. It also depends on if you have memory fit, since if secondary keys don't "follow" the primary key, it may cause random IO to look up each of the rows from the clustered index.
what is the performance characteristic for Unique Indexes in Mysql and Indexes in general in MySQl (like the Primary Key Index):
Given i will insert or update a record in my databse: Will the speed of updating the record (=building/updating the indexes) be different if the table has 10 Thousand records compared to 100 Million records. Or said differently, does the Index buildingtime after changing one row depend on the total indexsize?
Does this also apply for any other indexes in Mysql like the Primary Key index?
Thank you very much
Tom
Most indexes in MySQL are really the same internally -- they're B-tree data structures. As such, updating a B-tree index is an O(log n) operation. So it does cost more as the number of entries in the index increases, but not badly.
In general, the benefit you get from an index far outweighs the cost of updating it.
A typical MySQL implementation of an index is as a set of sorted values (not sure if any storage engine uses different strategies, but I believe this holds for the popular ones) -- therefore, updating the index inevitably takes longer as it grows. However, the slow-down need not be all that bad -- locating a key in a sorted index of N keys is O(log N), and it's possible (though not trivial) to make the update O(1) (at least in the amortized sense) after the finding. So, if you square the number of records as in your example, and you pick a storage engine with highly optimized implementation, you could reasonably hope for the index update to take only twice as long on the big table as it did on the small table.
Note that if new primary key values are always larger than the previous (i.e. autoincrement integer field), your index will not need to be rebuilt.