I have one interesting question: what is difference between cluster index and unique index? What's better and faster and why?
With MySQL's InnoDB engine, you get 3 choices of ordinary indexes:
PRIMARY KEY -- "clustered" with the data and "unique". (Often AUTO_INCREMENT)
UNIQUE -- unique, not clustered.
INDEX -- non unique, not clustered.
All are implemented as BTrees. Clustered means that the data is in the leaf node.
Considerations:
Any of the 3 choices can help with performance of searching for row(s).* InnoDB needs a PRIMARY KEY.
If you want the database to spit at you when you are trying to insert something that is already there, you need to specify the column(s) of the PRIMARY KEY or a UNIQUE.
Secondary keys (UNIQUE or INDEX) go through the PRIMARY KEY to find the data. Corollary 1: Finding a row by the PK is faster. Corollary 2: A bulky (eg, VARCHAR(255)) PK is an extra burden on every secondary key.
A "covering" secondary key does not need to go beyond its BTree.
Data and indexes are cached in RAM at the 16KB block unit; caching may be an important consideration in performance.
UUID/GUID indexes are terrible when the index cannot be fully cached -- due to high I/O.
An INSERT must immediately check the PRIMARY KEY and any UNIQUE key(s) for duplicate, but it can delay updating other secondary keys. (This may have impact on performance during inserts.)
From those details, it is sometimes possible to deduce the answers to your questions.
(Caveat: Engines other than InnoDB have some different characteristics.)
Related
I am asking this question with repect to mysql database.I read that clustered index orders the table based on primary key or columns that we provide for making clustered index, where as in non clustered index there is separate space taken for key and record pointer.
Also I read as there is no separate index table, clustered index is faster than non clustered index where as non clustered index must first look into index table find corresponding record pointer and fetch record data
Does that mean there is no extra space taken for clustered index?
PS:I know that there are already some similar answers on this question but I can't understand.
There is no extra space taken because every InnoDB table is stored as the clustered index. There is in fact only the clustered index, and secondary indexes. There's no separate storage for data, because all the unindexed columns are simply stored in the terminal nodes of the clustered index. You might like to read more about it here: https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
It is true that if you do a lookup using a secondary index, and then select columns besides those in the secondary index, InnoDB would do a sort of double lookup. Once to search the secondary index, which results in the value of the primary key(s) where the value you are searching for is found, and then it uses those primary keys to search the clustered index to combine with the other columns.
This double-lookup is mitigated partially by the Adaptive Hash, which is a cache of frequently-searched values. This cache is populated automatically as you run queries. So over time, if you run queries for the same values over again, it isn't so costly.
The situation is more complex than your question.
First, let's talk only about ENGINE=InnoDB; other engines work differently.
There is about 1% overhead for the non-leaf BTree nodes to "cluster" the PRIMARY KEY with the data.
If you do not explicitly specify a PRIMARY KEY, it may be able to use a UNIQUE key as the PK. But if not, then a hidden, 6-byte number will be used for the PK. This would take more space than if you had, say, a 4-byte INT for the PK! That is, you cannot create a table without a PRIMARY KEY.
The above 2 items is TMI; think of the PK as taking no extra space.
Yes, lookup by the PK is faster than lookup by a secondary key. But if you need a secondary key, then create it. Playing a game of first fetching ids, then fetching the rows is slower than doing all the work in a single query.
A Secondary key also uses BTree also. But it is sorted by the key's column(s) and does not include all the other columns. Instead, it includes the PK's columns. (Hence the "double-lookup" that Bill mentioned.)
A "covering index" is one that contains all the columns needed for a particular SELECT. In that case, all the work can be done in the index's BTree, thereby avoiding the double-lookup. That is, a covering index is as fast as a primary key lookup. (I would guess that 20% of indexes are "covering" or could be made covering by adding a column or two.)
BTrees have a bunch of overhead. A Rule of Thumb: Add up the size of each column (4 bytes for INT, etc), then multiply by 2 or 3. The result will often be a good estimate of the disk space needed for the Data or Index Btree.
This discussion does not cover FULLEXT or SPATIAL indexes.
I am reading Effective Mysql - Optimizing Mysql Statements and in chapter 3 there was this explanation:
The secondary indexes in InnoDB use the B-tree data structure; however, they differ from the MyISAM implementation. In InnoDB, the secondary index stores the physical value of the primary key. In MyISAM, the secondary index stores a pointer to the data that contains the primary key value.
This is important for two reasons. First, the size of secondary indexes in InnoDB can be much larger when a large primary key is defined—for example when your primary key in InnoDB is 40 bytes in length. As the number of secondary indexes increase, the comparison size of the indexes can become significant. The second difference is that the secondary index now includes the primary key value and is not required as part of the index. This can be a significant performance improvement with table joins and covering indexes.
There are many questions that come to my mind, mostly due to lack of understanding of what author is trying to convey.
It is unclear what the author means in the second difference in
second paragraph. What is not required as part of index anymore?
Does InnoDB secondary index B-tree only store PK value or PK value
and Pointer to it? or PK Value and pointer to data row?
What kind of performance improvement would there be due to the storage method (2nd question's answer)?
This question contains an example and also an answer. He explains how it contains PK value, but what I am still not understanding is,
To complete the join, if the pointer is not there in the secondary index and only the value, wont MySQL do a full index scan on Primary Key index with that value from secondary index? How would that be efficient than having the pointer also?
The secondary index is an indirect way to access the data. Unlike the primary (clustered) index, when you traverse the secondary index in InnoDB and you reach the leaf node you find a primary key value for the corresponding row the query is looking for. Using this value you traverse the primary index to fetch the row. This means 2 index look ups in InnoDB.
For MyISAM because the leaf of the secondary node is a pointer to the actual row you only require 1 index lookup.
The secondary index is formed based on certain attributes of your table that are not the PK. Hence the PK is not required to be part of the index by definition. Whether it is (InnoDB) or not (MyISAM) is implementation detail with corresponding performance implications.
Now the approach that InnoDB follows might at first seem inefficient in comparison to MyISAM (2 lookups vs 1 lookup) but it is not because the primary index is kept in memory so the penalty is low.
But the advantage is that InnoDB can split and move rows to optimize the table layout on inserts/updates/deletes of rows without needing to do any updates on the secondary index since it does not refer to the affected rows directly
Basics..
MyISAM's PRIMARY KEY and secondary keys work the same. -- Both are BTrees in the .MYI file where a "pointer" in the leaf node points to the .MYD file.
The "pointer" is either a byte offset into the .MYD file, or a record number (for FIXED). Either results in a "seek" into the .MYD file.
InnoDB's data, including the columns of the PRIMARY KEY, is stored in one BTree ordered by the PK.
This makes a PK lookup slightly faster. Both drill down a BTree, but MyISAM needs an extra seek.
Each InnoDB secondary key is stored in a separate BTree. But in this case the leaf nodes contain any extra columns of the PK. So, a secondary key lookup first drills down that BTree based on the secondary key. There it will find all the columns of both the secondary key and the primary key. If those are all the columns you need, this is a "covering index" for the query, and nothing further is done. (Faster than MyISAM.)
But usually you need some other columns, so the column(s) of the PK are used to drill down the data/PK BTree to find the rest of the columns in the row. (Slower than MyISAM.)
So, there are some cases where MyISAM does less work; some cases where InnoDB does less work. There are a lot of other things going on; InnoDB is winning many comparison benchmarks over MyISAM.
Caching...
MyISAM controls the caching of 1KB index blocks in the key_buffer. Data blocks are cached by the Operating System.
InnoDB caches both data and secondary index blocks (16KB in both cases) in the buffer_pool.
"Caching" refers to swapping in/out blocks as needed, with roughly a "least recently used" algorithm.
No BTree is loaded into RAM. No BTree is explicitly kept in RAM. Every block is requested as needed, with the hope that it is cached in RAM. For data and/or index(es) smaller than the associated buffer (key_buffer / buffer_pool), the BTree may happen to stay in RAM until shutdown.
The source-of-truth is on disk. (OK, there are complex tricks that InnoDB uses with log files to avoid loss of data when a crash occurs before blocks are flushed to disk. That cleanup automatically occurs when restarting after the crash.)
Pulling the plug..
MyISAM:
Mess #1: Indexes will be left in an unclean state. CHECK TABLE and REPAIR TABLE are needed.
Mess #2: If you are in the middle of UPDATEing a thousand rows in a single statement, some will be updated, some won't.
InnoDB:
As alluded to above, InnoDB performs things atomically, even across pulling the plug. No index is left mangled. No UPDATE is left half-finished; it will be ROLLBACKed.
Example..
Given
columns a,b,c,d,e,f,g
PRIMARY KEY(a,b,c)
INDEX(c,d)
The BTree leaf nodes will contain:
MyISAM:
for the PK: a,b,c,pointer
for secondary: c,d,pointer
InnoDB:
for the PK: a,b,c,d,e,f,g (the entire row is stored with the PK)
for secondary: c,d,a,b
Are clustered index created and stored separately than the actual data in MySQL and if so, then why can we not have more than one clustered index. All we need to do is create another index and store it in memory.
A clustered index is, at least partially, the way the table is physically sorted and stored, i.e. the row order on disk. That's why you can only have one. But because it reflects the physical layout of the rows, it's potentially more compact and performant than a typical index.
UPDATE:
As #RickJames excellently points out below, in InnoDB (MySQL's default engine since 5.5.5), a lookup is typically a two-stage process. One b-tree relates a secondary key to a primary key, a second b-tree relates the primary key to the location of a data record. If retrieving data on primary key, only the second lookup is necessary. In that sense, a b-tree lookup is always necessary.
Additionally, according to the MySQL documentation:
Typically, the clustered index is synonymous with the primary key. 1
And the reason it's considered "clustered" and not just a primary key is because InnoDB attempts to order the data records according to primary key and leaves room for future records to be inserted in the correct location in its data pages 2.
Because of that, not only is a query on a InnoDB primary key one fewer b-tree lookup than a secondary index, but the primary key b-tree can be significantly smaller because of the physically ordering of the data on disk.
It stands to reason even if there were a mechanism to make a secondary index that pointed directly to a data record (like an index MyISAM), it wouldn't perform as well as InnoDB's primary/clustered index.
So, it's fundamentally the (at least partial) physical ordering of data records by primary key which prevents you from getting the same performance from a secondary index.
MySQL's InnoDB does the following for its PRIMARY KEY: The data records are in PK order, stored together in a B+Tree structure. This allows for rapid point-queries and range scans. That is, the value at the 'bottom' of the tree has all the columns of the table.
InnoDB's secondary keys are also in a B+Tree, but the bottom values are the PK columns. Hence, a second lookup is needed to fetch a row(s) by a secondary key.
Note that a secondary key could contain all the columns of the table, thereby acting like a second clustered index. The drawback is that any modification to the table would necessarily involve changes to both BTrees.
MyISAM, in contrast, throws the data into a file (the .MYD) and has every index in its own BTree in the .MYI file. The bottom of each BTree is a pointer (row number or byte offset) into the .MYD. The PK is not implemented any differently than a secondary key.
(Note: FULLTEXT and SPATIAL indexes are not covered by the above discussion.)
I am using InnoDB. My Index selectivity (cardinality / total-rows) is < 100%, roughly 96-98%.
I would like to know if the columns, which are not part of the keys, are also stored in sorted order. This influences my tables' design.
Would also be interest to understand how much performance degradation in lookup I can expect when index selectivity is < 100%.
(I get these question since for InnoDB it's only mentioned that indexes are clustered and there's TID/RP stored after the index)
No, it doesn't matter for the order of the non-keyed columns.
The answer to your second is more complex - I'm going to walk through it since I think you might be misunderstanding InnoDB a little -
There are two types of indexes, primary and secondary.
The primary key index is clustered - that is, data is stored in the leaves of the B+Tree. Looking up by primary key is just one tree traversal, and you've got the row(s) you're looking for.
Looking up by secondary key requires searching through the secondary key, finding the primary key rows that match, and then looking through the primary key to get the data.
You only need to worry about selectivity of secondary (non clustered) indexes, since the primary (clustered) index will always have a selectivity of 1. How selective a secondary index needs to be varies a lot - for one; it depends on the width of the index versus the width of the row. It also depends on if you have memory fit, since if secondary keys don't "follow" the primary key, it may cause random IO to look up each of the rows from the clustered index.
What is the difference between MySQL unique and non-unique index in terms of performance?
Let us say I want to make an index on a combo of 2 columns, and the combination is unique, but I create a non-unique index. Will that have any significant effect on the performance or the memory MySQL uses?
Same question, is there is difference between primary key and unique index?
UNIQUE and PRIMARY KEY are constraints, not indexes. Though most databases implement these constraints by using an index. The additional overhead of the constraint in addition to the index is insignificant, especially when you count the cost of tracking down and correcting unintentional duplicates when (not if) they occur.
Indexes are usually more effective if there you have a high selectivity. This is the ratio of number of distinct values to the total number of rows.
For example, in a column for Social Security Number, you may have 1 million rows with 1 million distinct values. So the selectivity is 1000000/1000000 = 1.0 (although there are rare historical exceptions, SSN's are intended to be unique).
But another column in that table, "gender" may only have two distinct values over 1 million rows. 2/1000000 = very low selectivity.
An index with a UNIQUE or PRIMARY KEY constraint is guaranteed to have a selectivity of 1.0, so it will always be as effective as an index can be.
You asked about the difference between a primary key and a unique constraint. Chiefly, it's that you can have only one primary key constraint per table (even if that constraint's definition includes multiple columns), whereas you can have multiple unique constraints. A column with a unique constraint may permit NULLs, whereas columns in primary key constraints must not permit NULLs. Otherwise, primary key and unique are very similar in their implementation and their use.
You asked in a comment about whether to use MyISAM or InnoDB. In MySQL, they use the term storage engine. There are bunch of subtle differences between these two storage engines, but the chief ones are:
InnoDB supports transactions, so you can choose to roll back or commit changes. MyISAM is effectively always autocommit.
InnoDB enforces foreign key constraints. MyISAM doesn't enforce or even store foreign key constraints.
If these features are things you need in your application, then you should use InnoDB.
To respond to your comment, it's not that simple. InnoDB is actually faster than MyISAM in quite a few cases, so it depends on what your application's mix of selects, updates, concurrent queries, indexes, buffer configuration, etc.
See http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/ for a very thorough performance comparison of the storage engines. InnoDB wins over MyISAM frequently enough that it's clearly not possible to say one is faster than the other.
As with most performance-related questions, the only way to answer it for your application is to test both configurations using your application and a representative sample of data, and measure the results.
On a non-unique index that just happens to be unique and a unique index? I'm not sure, but I'd guess not a lot. The optimiser should examine the cardinality of the index and use that (it will always be the number of rows, for a unique index).
As far as a primary key is concerned, probably quite a lot, but it depends which engine you use.
The InnoDB engine (which is used by many people) always clusters rows on the primary key. This means that the PK is essentially combined with the actual row data. If you're doing a lot of lookups by PK (or indeed, range scans etc), this is a Good Thing, because it means that it won't need to fetch as many blocks from the disc.
A non-PK unique index will never be clustered in InnoDB.
On the other hand, some other engines (MyISAM in particular) don't cluster the PK, so the primary key is just like a normal unique index.