Why does InnoDB require clustered index upon creating a table?

Why does InnoDB require clustered index upon creating a table? - mysql

Even if I don't have a primary key or unique key, InnoDB still creates a cluster index on a synthetic column as described below.
https://dev.mysql.com/doc/refman/5.5/en/innodb-index-types.html
So, why does InnoDB have to require clustered index? Is there a defenite reason clustered index must exist here?
In Oracle Database or MSSQL I don't see they require this.
Also, I don't think cluster index have so tremendous advantage comparing to ordinary table either.
It is true that looking for data using clustering key does not need an additional disk read and faster than when I don't have one but without cluster index, secondary index can look up faster by using physical rowID.
Therefore, I don't see any reason for insisting using it.

Other vendors have a "ROWNUM" or something like that. InnoDB is much simpler. Instead of having that animal, it simply requires something that you will usually want anyway. In both cases, it is a value that uniquely identifies a row. This is needed for guts of transactions -- knowing which row(s) to lock, etc, to provide transactional integrity. (I won't go into the rationale here.)
In requiring (or providing) a PK, and in doing certain other simplifications, InnoDB sacrifices several little-used (or easily worked around) features: Multiple pks, multiple clustered indexes, no pk, etc.
Since the "synthetic column" takes 6 bytes, it is almost always better to simply provide id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, even if you don't use it. But if you don't use it, but do have a non-NULL UNIQUE key, then you may as well make it the PK. (As MySQL does by default.)
A lookup by a secondary key first gets the PK value from the secondary key's BTree. Then the main BTree (with the data ordered by the PK) is drilled down to find the row. Hence, secondary keys can be slower that use of the PK. (Usually this is not enough slower to matter.) So, this points out one design decision that required a PK.) (Other vendors use ROWNUM, or something, to locate the record, instead of the PK.)
Back to "Why?". There are many decisions in MySQL where the designers said "simplicity is better for this free product, let's not bother building some complex, but little-used feature. At first there were no subqueries (temp tables were a workaround). No Views (they are only syntactic sugar). No Materialized Views (OK, this may be a failing; but they can be simulated). No bit-mapped or hash or isam (etc) indexing (BTree is very good for "all-around" usage).
Also, by always "clustering" the PK with the data, lookups via the PK are inherently faster than the competition (no going through a ROWNUM). (Secondary key lookups may not be faster.)
Another difference -- MySQL was very late in implementing "index merge", wherein it uses two indexes, then ANDs or ORs the results. This can be efficient with ROWNUMs, but not with clustered PKs.
(I'm not a MySQL/MariaDB/Percona developer, but I have used them since 1999, and have been to virtually all major MySQL Conferences, where inside info is often divulged. So, I think I have enough insight into their thinking to present this answer.)

Related

MYSQL: What is the impact of varchar length on performance when used a primary key? [duplicate]

What would be the performance penalty of using strings as primary keys instead of bigints etc.? String comparison is much more expensive than integer comparison, but on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.
An application that I work on uses strings as primary keys in several tables (MySQL). It is not trivial to change this, and I'd like to know what can be gained performance wise to justify the work.

on the other hand I can imagine that
internally a DBMS will compute hash
keys to reduce the penalty.
The DB needs to maintain a B-Tree (or a similar structure) with the key in a way to have them ordered.
If the key is hashed and stored it in the B-Tree that would be fine to check rapidly the uniqueness of the key -- the key can still be looked up efficiently. But you would not be able to search efficient for range of data (e.g. with LIKE) because the B-Tree is no more ordered according to the String value.
So I think most DB really store the String in the B-Tree, which can (1) take more space than numeric values and (2) require the B-Tree to be re-balanced if keys are inserted in arbitrary order (no notion of increasing value as with numeric pk).
The penalty in practice can range from insignificant to huge. It all depends on the usage, the number of rows, the average size of the string key, the queries which join table, etc.

In our product we use varchar(32) for primary keys (GUIDs) and we haven't met performance issues of this. Our product is a web site with extreme overload and is critical to be stable.
We use SQL Server 2005.
Edit: In our biggest tables we have more than 3 000 000 records with lots of inserts and selects from them. I think in general, the benefit of migrating to int key will be very low, but the problems while migrating very high.

One thing to watch out for is page splits (I know this can happen in SQL Server - probably the same in MySQL).
Primary keys are physically ordered. By using an auto-increment integer you guarantee that each time you insert you are inserting the next number up, so there is no need for the db to reorder the keys. If you use strings however, the pk you insert may need to be placed in the middle of the other keys to maintain the pk order. That process of reordering the pks on the insert can get expensive.

It depends on several factors: RDBMS, number of indexes involving those columns but in general it will be more efficient using ints, folowed by bigints.
Any performance gains depend on usage, so without concrete examples of table schema and query workload it is hard to say.
Unless it makes sense in the domain (I'm thinking unique something like social security number), a surrogate integer key is a good choice; referring objects do not need to have their FK reference updated when the referenced object changes.

Is automatically indexing primary key really good?

In some DBMS like MySQL the primary key is always indexed by default. I know indexing can make operations like selection and comparison of the indexed column much faster, but it can also slow down other operations like insertion and update. There are cases when there are few selections on the primary key of a table, in which indexing will not bring much benefit. In such cases wouldn't it better not indexing the primary key?
Clarification: I just learned that primary key is actually implemented by a special index, like clustered index for InnoDB. Index can definitely be used to enforce the uniqueness constraint of primary key, but is it really necessary to use index to do this? From what I know, index is often implemented as btree which can improve the performance of many more operations than just checking the uniqueness, which can be simply done by a hashtable. So why not use other simpler structures to enforce the uniqueness that have less negative impact on the performance of insert and update operations?
The article here mentions a similar point:
Unique indexes use as much space as nonunique indexes do. The value of
every column as well as the record's location is stored. This can be a
waste if you use the unique index as a constraint and never as an
index. Put another way, you may rely on the unique index to enforce
uniqueness but never write a query that uses the unique value. In this
case, there's no need for MySQL to store the locations of every record
in the index: you'll never use them.
And in the following paragraph:
Unfortunately, there's no way to signal your intentions to MySQL. In
the future, we'll likely find a feature introduced for this specific
case. The MyISAM storage engine already has support for unique columns
without an index (it uses a hash-based system), but the mechanism
isn't exposed at the SQL level yet.
The "hash-based system" is an example of what I meant by "other simpler structures".

A primary key that isn't indexed is neither primary nor even a key.
Your question doesn't make sense.

Let's go back in history about 20 years when MySQL was just getting started. The inventor said to himself, "what indexing system is simple and efficient and generally useful". The answer was BTree. So, BTrees are all that existed for a long time in MySQL. Then he asked himself "what bells and whistles should we put on the PRIMARY KEY". The answer was KISS -- make identical to other UNIQUE indexes. This was the MyISAM engine.
Later (about 15 years ago) another inventor joined forces. He brought 'simple', yet transactional, InnoDB engine. Since transactions really need a PK, InnoDB has a PK that is UNIQUE and clustered. And, again, the data+PK is a BTree.
Every so often someone would ask "Do we need bitmap indexes, hash indexes, a second clustered index, etc." The answer always came back, "No, BTree is good enough." A few non-MySQL engines have been invented to do non-BTree indexes. Perhaps the most successful is Tokutek and its "Fractal index". MariaDB now includes TokuDB. Another is the "columnar indexing" of Infinidb.
(Apologies to Monty and Heikki if they did not actually ask those questions.)
Hash and BTree indexes are about equally fast for "point queries". But for "range queries", Hash is useless and BTree is excellent. Why implement both when one is clearly better?

What are the benefits of a sequential index key, and how much wiggle room do I have?

Sequential keys allow one to use clustered index. How material is that benefit? How much is lost if 1% (say) of the keys are out of sequential order by one or two rank?
Thanks,
JDelage

Short:
Clustered index, in general, can be used on anything that is sortable. Sequentiality (no gaps) is not required - your records will be maintained in order with common index maintenance principles (only difference is that with clustered index the leafs are big because they hold the data, too).
Long:
Good clustering can give you orders of magnitude improvements.
Basically with good clustering you will be reading data very efficiently on any spinning media.
The measure on which you should evaluate if the clustering is good should be done by examining the most common queries (that actually will read data and can not be answered by indexes alone).
So, for example if you have composite natural keys as primary key on which the table is clustered AND if you always access the data according to the subset of the key then with simple sequential disk reads you will get answers to your query in the most efficient way.
However, if the most common way to access this data is not according to the natural key (for example the application spends 95% of time looking for last 5 records within the group AND the date of update is not part of the clustered index), then you will not be doing sequential reads and your choice of the clustered index might not be the best.
So, all this is at the level of physical implementation - this is where things depend on the usage.
Note:
Today not so relevant, but tomorrow I would expect most DBs to run off the SSDs - where access times are nicer and nicer and with that (random access reads are similar in speed to sequential reads on SSDs) the importance of clustered indexes would diminish.

You need to understand the purpose of the clustered-index.
It may be helpful in some cases, to speed up inserts, but mostly, we use clustered indexes to make queries faster.
Consider the case where you want to read a range of keys from a table - this is very common - it's called a range scan.
Range scans on a clustered index are massively better than a range scan on a secondary index (not using a covering index). These are the main case for using clustered indexes. It mostly saves 1 IO operation per row in your result. This can be the difference between a query needing, say, 10 IO operations and 1000.
It really is amazing, particularly if you have no blobs and lots of records per page.
If you have no SPECIFIC performance problem that you need to fix, don't worry about it.
But do also remember, that it is possible to make a composite primary key, and that your "unique ID" need not be the whole primary key. A common (very good) technique is to add something which you want to range-scan, as the FIRST part of the PK, and add a unique ID (meaningless) afterwards.
So consider the case where you want to scan your table by time - you can make the time the first part of the PK (it is not going to be unique, so it's not enough on its own), and a unique ID the second.
Do not however, do premature optimisation. If your database fits in memory (Say 32Gb), you don't care about IO operations. It's never going to do any reads anyway.

What is the use of Mysql Index key?

Hi I am a newbie to mysql
Here are my questions:
What is the use of Mysql Index key?
Does it make a difference in mysql queries with defining an Index key and without it?
Are all primary keys default Index key?
Thanks a million

1- Defining an index on a column (or set of columns) makes searching on that column (or set) much faster, at the expense of additional disk space.
2- Yes, the difference is that queries using that column will be much faster.
3- Yes, as it's usual to search by the primary key, it makes sense for that column to always be indexed.
Read more on MySQL indexing here.

An index is indeed an additional set of records. Nothing more.
Things that make indexes access faster are:
Internally there's more chance that the engine put in buffer the index than the whole table rows
The index is smaller so to parse it means reading less blocks of the hard drive
The index is sorted already, so finding a given value is easy
In case of being not null, it's even faster (for various reasons, but the most important thing to know is that the engine doesn't store null values in indexes)
To know whether or not an index is useful is not so easy to guess (obviously I'm not talking about the primary key) and should be investigated. Here are some counterparts when it might slow down your operations:
It will slow down inserts and updates on indexed fields
It requires more maintenance: statistics have to be built for each index so the computing could take a significantly longer time if you add many indexes
It might slow down the queries when the statistics are not up to date. This effect could be catastrophic because the engine would actually go "the wrong way"
It might slow down when the query are inadequate (anyway indexes should not be a rule but an exception: no index, except if there's an urge on certain queries. I know usually every table has at least one index, but it came after investigations)
We could comment this last point a lot, but I think every case is special and many examples of this already exist in internet.
Now about your question 'Are all primary keys default Index key?', I must say that it is not like this that the optimizer works. When there are various indexes defined in a table, the more efficient index combination will be compiled with on the fly datas and some other static datas (indexes statistics), in order to reach the best performances. There's no default index per se, every situation leeds to a different result.
Regards.

Are Multi-column Primary Keys in MySQL a optimisation problem?

Been looking into using multi-column primary keys and as performance is extremely important with the size of traffic and database I need to know if there is anything to consider before I start throwing out the unique ID method on many of my tables and start using mulit column primary keys.
So, what are the performance/optimisation pros/cons to using multi column primary keys versus a basic single column, auto-inc primary key?

Is there a particular reason that you need/want to use multi-column keys instead of an (I assume) already created single-column key?
One of the problems with Natural Keys is dealing with cascading an update to the key value across all the foreign keys. A surrogate key such as an auto-increment column avoids this.
In terms of performance, depending on the row count, the data types of the columns, your storage engine, and the amount of RAM you have dedicated to MySQL, multi-column keys can affect performance due to the sheer size of the index.
In my experience, it is almost always easier in terms of development and maintenance to use a surrogate key as the PK and then create indexes that cover your queries across the natural keys. However, the only way to determine the true performance impact for your application is to benchmark it with realistic a realistic load and dataset.
HTH -
Chris

I wouldn't think that there would be any performance problems with multiple primary keys. It's more or less equivalent to having multiple indexes (you will spend a little bit more time computing index values when doing inserts).
Sometimes the data model makes more sense with multiple keys. I'd worry about being straightforward first and worry about performance second. You can always add more indexes, improve your queries, or twiddle server settings.
I think the most I've encountered was a 4-column primary key. Makes me cringe a little bit, but it worked¹.
[1] "worked" is defined to mean "the application performed to specification", and is not meant to imply that actual tasks were accomplished using said application. :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008