Index vs Auto Increment ID set as PK - mysql

In MySQL, what is the difference between Index and an ID set to AUTO_INCREMENT and as Primary Key?
Will this also increase the speed to searching the database? Or is the AUTO_INCREMENT ID just for the purpose of the of the user and the computer doesn't consider it while searching the database? Reading about INDEX on w3schools.com I came across this line:
Indexes are used to retrieve data from the database very fast. The
users cannot see the indexes, they are just used to speed up
searches/queries.

In MySQL, the primary key creates an index on the key . . . but the original data pages are the leafs of the index. This may be a little convoluted, but the effect is that the data is actually sorted on the data pages.
A regular index is implemented as a b-tree (note: the "b" standards for "balanced" rather than "binary", contrary to what many people believe). The leafs are stored separately from the original data.
auto_increment is a property of one column of a table, where the value is set to a new value on each insert and the new value is larger than the previous value. The increment is usually 1, but that is not guaranteed. auto_increment does not directly relate to indexing, but is almost always associated with the primary key of the table.
So, in both cases, you have an index. The primary key index is slightly smaller because storage is combined with the data pages themselves. On the other hand, the data needs to be in order on the disk, which can complicate inserts and updates. On the other hand, auto-increment guarantees that all new rows go at the end of the data. On the other hand, I've run out of hands.

When you index a column, you make a binary tree (or possibly another data structure) to speed up the search process.
ID or primary key by default is indexed.
Auto_Increment means you want MySQL to automatically set a value to the column whenever a new row gets inserted. The value will be set incrementally.

AUTO_INCREMENT is an integer sequence generator and nothing else. It has no inherent relation to indexes of any kind, and only exists to generate unique sequential numbers. It is frequently used to generate integer surrogate keys which are frequently used as primary keys.
You don't have to use either AUTO_INCREMENT or integer IDs as an index so long as you have one or more fields that you can use to uniquely identify a row.
In fact, in terms of scalability, sequence generators like AUTO_INCREMENT are counter-productive as you can only ever have a single instance of a sequence generator, limiting the number of 'master/write' servers and/or bottlenecking insert performance to that of the node running the generator.

Related

How to normalize an already started database with primary key ID values, as small as possible?

As time passes and the database has a lot of INSERTs, the auto increment primary key IDs of the tables can get quite big, especially if the INSERTs are automated by some automated process. I know that a MySQL INT field (for example) has this range: signed from -2147483648 to 2147483647, or unsigned from 0 to 4294967295. It is difficult to reach those limit values ​​(but not impossible). Is there a command or process that normalizes the IDs to smallest possible values? If the value of an auto increment ID of a table arrived at 967295, but then for some reason the last 967200 INSERTs are deleted, how do I normalize the table so that the next auto increment is 96, initially respecting all secondary keys and maintaining the intact structure?
Is there a command or process that normalizes the IDs to smallest possible values? If the value of an auto increment ID of a table arrived at 967295, but then for some reason the last 967200 INSERTs are deleted, it is possible to normalize the table so that the next auto increment is 96, initially respecting all secondary keys and maintaining the intact structure?
There is no command because primary keys aren't supposed to change.
You could write a program to periodically scan the table for large gaps and then change the primary keys to use the next available one. And you'd also have to change everywhere it is used as a foreign key.
This will not work with an autoincremented primary key. Autoincrement is a simple counter, it will not jump gaps for you. You'd have to normalize every entry in the table every time, then reset the autoincrement to max(id). Rewriting most of the rows in the table will be slow and will cause locks. Alternatively, you'd write an insert trigger to find the next available key; slow and complicated.
This assumes there is nothing outside the database referring to the primary key. For example, http://example.com/thing/123 where 123 is an ID in the things table.
There are two better solutions.
Use a UUID primary key.
Use a bigint primary key.
UUID primary keys solve and create problems. They're awkward in MySQL. And they require 16 bytes, 12 more than an int.
Bigints work just like ints. They only take 4 more bytes. They're the native integer type on any modern hardware. They're a simple upgrade that only has to be done once: alter the type of the primary key and its foreign keys. You might run out of 2 billion or even 4 billion primary keys. You're not going to run out of 9,223,372,036,854,775,808.

Type of primary key index and what happens on modifying them in MySQL

The highest rated answer here mentions how in MySQL records are stored in the order of primary index. Does that mean the primary index created is a sparse index? And if so, what happens if one modifies the primary key, by changing the column on which it is constructed or modifying one of the entries.
Apologies for asking 2 questions in the same post, but I thought it was better asking them together. I came upon this doubt while dealing with tables which were extremely slow when queried, so I thought maybe inserting rows in some manner which is based on an actual column in the table would help.
I'll answer about MySQL's default storage engine, InnoDB. This is the storage engine you should use for all tables in MySQL, unless you have a very specific reason not to.
Answer to Question 1: The primary key is not a sparse index. I know that term to refer to an index that only has values for a subset of rows in the table. I.e. an index with a WHERE clause. But the primary key must account for all rows in the table. It is the column or columns you use if you need to reference any single row uniquely.
Even though the clustered index is stored in order by primary key values, it isn't necessarily stored in that order on disk. Each page in an InnoDB tablespace has a link to the "next" page, which may not be physically located immediately after. It may be much later in the file, or even earlier. The pages may be interspersed with pages for other indexes (even other table if you are using a shared tablespace). Is this what you meant by sparse index?
Answer to Question 2: If you change the column(s) of the table's primary key, the storage of the table must be restructured. InnoDB uses the primary key as the clustered index, which means the rest of the columns are stored along with leaf nodes of the B-tree data structure. Changing the primary key column(s) may change the order of storage, and also the size of each B-tree node (both internal nodes and leaf nodes).
This means while you are restructuring the primary key, your server temporarily needs a lot of extra storage space — up to double the size of the table. Once the restructuring is finished, the old version of the table is automatically dropped.

MySQL query and insertion optimisation with varchar(255) UUIDs

I think this question has been asked in some way shape or form but I couldn't find a question that had asked exactly what I wish to understand so I thought I'd put the question here
Problem statement
I have built a web application with a MySQL database of say customer records with an INT(11) id PK AI field and a VARCHAR(255) uuid field. The uuid field is not indexed nor set as unique. The uuid field is used as a public identifier so its part of URLs etc. - e.g. https://web.com/get_customer/[uuid]. This was done because the UUID is 'harder' to guess for a regular John Doe - but understand that it is certainly not 'unguessable' in theory. But the issue now is that as the database is growing larger I have observed that the query to retrieve a particular customer record is taking longer to complete.
My thoughts on how to solve the issue
The solution that is coming to mind is to make the uuid field unique and also index the same. But I've been doing some reading in relation to this and various blog posts, StackOverflow answers on this have described putting indices on UUIDs as being really bad for performance. I also read that it will also increase the time it takes to insert a new customer record into the database as the MySQL database will take time to find the correct location in which to place the record as a part of the index.
The above mentioned https://web.com/get_customer/[uuid] can be accessed without having to authenticate which is why I'm not using the id field for the same. It is possible for me to consider moving to integer based UUIDs (I don't need the UUIDs to be universally unique - they just need to be unique for that particular table) - will that improve the the indicing performance and in turn the insertion and querying performance?
Is there a good blog post or information page on how to best set up a database for such a requirement - Need the ability to store a customer record which is 'hard' to guess, easy to insert and easy to query in a large data set.
Any assistance is most appreciated. Thank you!
The received wisdom you mention about putting indexes on UUIDs only comes up when you use them in place of autoincrementing primary keys. Why? The entire table (InnoDB) is built behind the primary key as a clustered index, and bulk loading works best when the index values are sequential.
You certainly can put an ordinary index on your UUID column. If you want your INSERT operations to fail in the astronomically unlikely event you get a random duplicate UUID value you can use an index like this.
ALTER TABLE customer ADD UNIQUE INDEX uuid_constraint (uuid);
But duplicate UUIDv4s are very rare indeed. They have 122 random bits, and most software generating them these days uses cryptographic-quality random number generators. Omitting the UNIQUE index is, I believe, an acceptable risk. (Don't use UUIDv1, 2, 3, or 5: they're not hard enough to guess to keep your data secure.)
If your UUID index isn't unique, you save time on INSERTs and UPDATEs: they don't need to look at the index to detect uniqueness constraint violations.
Edit. When UUID data is in a UNIQUE index, INSERTs are more costly than they are in a similar non-unique index. Should you use a UNIQUE index? Not if you have a high volume of INSERTs. If you have a low volume of INSERTs it's fine to use UNIQUE.
This is the index to use if you omit UNIQUE:
ALTER TABLE customer ADD UNIQUE INDEX uuid (uuid);
To make lookups very fast you can use covering indexes. If your most common lookup query is, for example,
SELECT uuid, givenname, surname, email
FROM customer
WHERE uuid = :uuid
you can create this so-called covering index.
ALTER TABLE customer
ADD INDEX uuid_covering (uuid, givenname, surname, email);
Then your query will be satisfied directly from the index and therefore be faster.
There's always an extra cost to INSERT and UPDATE operations when you have more indexes. But the cost of a full table scan for a query is, in a large table, far far greater than the extra INSERT or UPDATE cost. That's doubly true if you do a lot of queries.
In computer science there's often a space / time tradeoff. SQL indexes use space to save time. It's generally considered a good tradeoff.
(There's all sorts of trickery available to you by using composite primary keys to speed things up. But that's a topic for when you have gigarows.)
(You can also save index and table space by storing UUIDs in BINARY(16) columns and use UUID_TO_BIN() and BIN_TO_UUID() functions to convert them. )

How does MySQL determine if an INSERT is unique?

I would like to know if there is an implicit SELECT being run prior to performing an INSERT on a table that has any column defined as UNIQUE. I cannot find anything about this in the documentation for INSERT.
I have asked some other questions that nobody seems to be able to answer - perhaps because I'm not properly explaining myself - that are related to the above question.
If I understand correctly, then I assume the following would be true:
CASE 1:
You have a table with 1 billion rows. Each row has a UUID column which is unique. If you perform an insert the server must do some kind of implicit SELECT COUNT(*) FROM table WHERE UUID = [new uuid] and determine if the count is 0 or 1. Correct?
CASE 2:
You have a table with 1 billion rows. Each row has a composite unique key consisting of a DATE and a UUID. If you perform an insert the server must do some kind of implicit SELECT COUNT(*) FROM table WHERE DATE = [date] AND UUID = [new uuid] and check if the count is 0 or 1. Yes?
I use the word implicit because at some point, somewhere in the process, the server MUST be checking the value. If not it would require that the laws of physics dictate that two identical rows cannot exist - and as far as I'm informed physics don't play a big role when it comes to the uniqueness of numbers written down somewhere, in binary, on a magnetic disk in a computer.
Let's assume your 1 billion rows are equally and sequentially distributed across 2,000 different dates. Would this not mean that case 2 would perform the insert faster because it can look up the UUIDs segmented into dates? If not, then would it be better to use case 1 for insert speed - and in that case, why?
This question is theoretical, so don't bother with considering regular SELECT performance in this case. The primary key wouldn't be the UUID+DATE index.
As a response to comments: The UUID in my case is designed solely for the purpose of avoiding duplicate entries because of bad connections. Since you cannot make the same entry for a different date twice (without it logically being a new entry), the UUID does not need to be globally unique - it needs only be unique for each date. This is why I can permit it being part of a composite key.
There are a few flaws and misconceptions in the previous answers; rather than point them out, I will start from scratch.
Referring to InnoDB only...
An INDEX (including UNIQUE and PRIMARY KEY) is a BTree. BTrees are very efficient a locating one row based on the key the BTree is sorted on. (It is also efficient at scanning in key-order.) The "fan out" of a typical BTree in MySQL is on the order of 100. So, for a million rows, the BTree is about 3 levels deep (log100(million)); for a trillion rows, it is only twice as deep (approximately). So, even if nothing is cached, it takes only 3 disk hits to locate one particular row in a million-row index.
I am being loose here with "index" versus "table" because they are essentially the same (in InnoDB, at least). Both are BTrees. What differs is what is in the leaf nodes: The leaf nodes of a table BTree has all the columns. (I am ignoring the off-block storage for TEXT/BLOB in InnoDB.) An INDEX (other than the PRIMARY KEY) has a copy of the PRIMARY KEY in the leaf node. This is how a secondary key can get from the INDEX BTree to the rest of the row's columns, and how InnoDB does not have to store multiple copies of all the columns.
The PRIMARY KEY is "clustered" with the data. That is one BTree contains both all the columns of all the rows, and it is ordered according to the PRIMARY KEY specification.
Locating a record by PRIMARY KEY is one BTree search. Locating a record by a SECONDARY KEY is two BTree searches, one in the secondary INDEX's BTree which gives you the PRIMARY KEY; then a second one to drill down the data/PK BTree.
PRIMARY KEY(UUID)... Since the UUID is very random, the "next" row you INSERT will be located at a 'random' spot. If the table is much bigger than be cached in the buffer_pool, the block the new row needs to go into is very likely to not be cached. This leads to a disk hit to pull the block into cache (the buffer pool), and eventually another disk hit to write it back to disk.
Since a PRIMARY KEY is a UNIQUE KEY, something else is going on at the same time (No SELECT COUNT(*) etc). The UNIQUEness is checked after the block is fetched and before deciding whether to give a "duplicate key" error, or to store the row. Also, if the block is "full" then the block will need to be 'split' to make room for the new row.
INDEX(UUID) or UNIQUE(UUID)... There is a BTree for that index. On INSERT, some randomly located block will need to be fetched, modified, possibly split, and written back to disk, very much like the PK discussion above. If you had UNIQUE(UUID), there would also be a check for UNIQUEness and possibly an error message. In either case, there is, now and/or later, disk I/O.
AUTO_INCREMENT PK... If the PRIMARY KEY is an auto_increment, then new records are added to the 'last' block in the data BTree. When it gets full (every 100 or so records) there is (logically) a block split and flush of the old block to disk. (Actually, the I/O is probably delayed and done in the background.)
PRIMARY KEY(id) + UNIQUE(UUID)... Two BTrees. On an INSERT, there is activity in both. This is likely to be worse than simply PRIMARY KEY(UUID). Add up the disk hits above to see what I mean.
"Disk hits" are the killer in huge tables, and especially with UUIDs. "Count the disk hits" to get a feel for performance, especially when comparing two possible techniques.
Now for your secret sauce... PRIMARY KEY(date, UUID)... You are allowing the same UUID to show up on two different days. This can help! Back to how a PK works and checking for UNIQUEness... The "compound" index (date, UUID) is checked for UNIQUEness as the record is inserted. The records are sorted by date+UUID, so all of today's records are clumped together. IF (and this might be a big IF) one day's data fits in the buffer pool (but the entire table does not), then this is what is happening every morning... INSERTs are suddenly adding new records to the "end" of the table because of the new "date". These inserts are occurring randomly within the new date. Blocks in the buffer_pool are being pushed out to disk to make room for the new blocks. But, nicely, what you see is smooth, fast, INSERTs. This is unlike what you saw with PRIMARY KEY(UUID), when many rows had to wait for a disk read before UNIQUEness could be checked. All of today's blocks stay cached, and you don't have to wait for I/O.
But, if you ever get so big that you cannot fit one day's data in the buffer pool, things will start slowing down, first at the end of the day, then it will creep earlier and earlier as the frequency of INSERTs increases.
By the way, PARTITION BY RANGE(date), together with PRIMARY KEY(uuid, date) has somewhat similar characteristics. (Yes I deliberately flipped the PK columns.)
When inserting large amounts of data in a table, keep in mind that the data ends up being physically stored on a disk somewhere. To actually read and write the data from the disk, MySQL (and most other RDBMS) uses something called a clustered index. If you specify a Primary Key or a Unique Index on a table, the column or columns participating in the key/index becomes the clustered index key. This means that on the disk, data is physically stored in the same order as the values in the key column(s).
By utilising the clustered index, the database engine can quickly determine whether a value already exists, without having to scan the whole table. In theory, if a table contains N = 1.000.000 records, the engine on average needs log2(N) = 20 operations to check if a value exists, regardless of how many columns participate in the index. For secondary indexes, a B-tree or a hash table is typically used (search the web for these terms, for a detailed explanation of how they work).
The conclusion of this article is wrong:
"... MySQL is unable to buffer enough data to guarantee a value is
unique and is therefore caused to perform a tremendous amount of
reading for each insert to guarantee uniqueness"
This is incorrect. Checking uniqueness does not really require any additional work, as the engine had to locate the place to insert the new record anyway. What causes the performance slowdown, is the use of UUID's. Remember that UUID's are randomly generated, whenever a new record is inserted. This means that the new record needs to be inserted at a random physical position on the disk, and this causes existing data to be shifted around, to accomodate the new record. If, on the other hand, the index column is a value that increases monotonically (such as an auto-increment INT), new records will always be inserted after the last record, meaning no existing data will ever need to be moved.
In your case, there won't be any performance difference between case 1 and case 2. But you will still run into trouble because of the randomness of the UUID's. It would be much better if you used an auto-incrementing value instead of the UUID. Also, since UUID's are always unique by nature, it really doesn't make much sense to index them with a UNIQUE constraint. Alternatively, if you really must use UUID's, make sure that you have a primary key on your table, that is based on auto-incrementing INT's, to ensure that new records are never randomly inserted on the disk.
This is the very purpose of a UNIQUE constraint:
A UNIQUE index creates a constraint such that all values in the index must be distinct. An error occurs if you try to add a new row [or update an existing row] with a key value that matches [another] existing row.
Earlier in the same manual page, it is stated that
A column list of the form (col1,col2,...) creates a multiple-column index. Index key values are formed by concatenating the values of the given columns.
How this constraint is implemented is not documented, but it must somehow equate to a preliminary SELECT with the values to be inserted/updated. The cost of such a check is often negligible, because, by definition, the fields are indexed (this overhead becomes relevant when dealing with bulk inserts).
The number of columns covered by the index is not meaningful in terms of performance (for example, compared to the number of rows in the table). It does impact the disk space occupied by the index, but this should really not matter in your design decisions.

Can a database table be without a primary key?

Can anyone tell me if a table in a relational database (such as MySQL / SQL SERVER) can be without a primary key?
For example, I could have table day_temperature, where I register temperature and time. I don't see the reason to have a primary key for such a table.
Technically, you can declare such a table.
But in your case, the time should be made the PRIMARY KEY, since it's probably wrong to have different temperatures for the same time and probably useless to have same more than once.
Logically, each table should have a PRIMARY KEY so that you could distinguish two records.
If you don't have a candidate key in you data, just create a surrogate one (AUTO_INCREMENT, SERIAL or whatever your database offers).
The only excuse for not having a PRIMARY KEY is a log or similar table which is a subject to heavy DML and having an index on it will impact performance beyond the level of tolerance.
Like always it depends.
Table does not have to have primary key. Much more important is to have correct indexes. On database engine depends how primary key affects indexes (i.e. creates unique index for primary key column/columns).
However, in your case (and 99% other cases too), I would add a new auto increment unique column like temp_id and make it surrogate primary key.
It makes much easier maintaining this table -- for example finding and removing records (i.e. duplicated records) -- and believe me -- for every table comes time to fix things :(.
If the possibility of having duplicate entries (for example for the same time) is not a problem, and you don't expect to have to query for specific records or range of records, you can do without any kind of key.
You don't need a PK, but it's recommended that you have one. It's the best way to identify unique rows. Sometimes you don't want an auto incremental int PK, but rather create the PK on something else. For example in your case, if there's only one unique row per time, you should create the PK on the time. It makes looks up based on time faster, plus it ensures that they're unique (you can be sure that the data integrity isn't violated):
Even if you do not add a primary key to an InnoDB table in MySQL, MySQL adds a hidden clustered index to that table. If you do not define a primary key, MySQL locates the first UNIQUE index where all the key columns are NOT NULL and InnoDB uses it as the clustered index.
If the table has no primary key or suitable UNIQUE index, InnoDB internally generates a clustered index GEN_CLUST_INDEX on a synthetic column containing row ID values.
https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
The time would then become your primary key. It will help index that column so that you can query data based on say a date range. The PK is what ultimately makes your row unique, so in your example, the datetime is the PK.
I would include a surrogate/auto-increment key, especially if there is any possibility of duplicate time/temperature readings. You would have no other way to uniquely identify a duplicate row.
I run into the same question on one of the tables i did.
The problem was that the PK was supposed to be composed out of all the rows of the table all is well but this means that the table size will grow very fast with each row inserted.
I choose to not have a PK, but only have an index on the row i do the lookup on.
When you replicate a database on mysql, A table without a primary key may cause delay in the replication.
http://lists.mysql.com/mysql/227217
The most common mistake when using ROW or MIXED is the failure to
verify that every table you want to replicate has a PRIMARY KEY on
it. This is a mistake because when a ROW event (such as the one
documented above) is sent to the slave and neither the master's copy
nor the slave's copy of the table has a PRIMARY KEY on the table,
there is no way to easily identify which unique row you want
replication to change.
According to your answer I would consider three options:
put a PK on both cols, this way for each time there could be only one temp and vise versa. This solution allows for multiple rows with the same temp or the same time just that there wouldn't be any two rows with same temp AND time.
don't put a PK at all but do put a unique index on both cols. one unique index containing both cols. this would allow for nulls in temp and time but incurs more space to maintain index.
these two options would be best for retrieval speed if you have heavy reads but would result in lower inserts rate as indices would have to be updated as well.
don't put any index at all, nor PK. this would be best for inserts but very bad for searching. useful for logging where retrieval is done by another
mechanism or when inserting device is not required to check for dups.
Also, it is very important to consider cardinality here and think about future consequences of using an auto incremented number. if you're planning to do A LOT OF inserts then even an auto incremented unsigned bigint would be a risk because it would eventually run out. In your example I guess you'll be saving data daily - for how long? this would be problematic if you saved temp every minute... so I'll take this as an extreme example.
I guess it is best to think about what you need from the table. are you doing "save-and-forget" for the entire year for the temp at every minute? are you going to use this table frequently in real-time decision making in your business logic? I think it is best to segregate data necessary for real-time (oltp) from long-term saving data that would be required seldom and its retrieval latency is allowed to be high (olap). it's even worth duplicating the data into two different tables, one heavily indexed and get erased once in a while to control cardinality and the second is actually saved on a magentic disk with almost no indices at all (it is possible to transfer a schema from your main fs into another fs).
I've got a better example of a table that doesn't need a primary key - a joiner table. Say I have a table with something called "capabilities", and another table with something called "groups", and I want a joiner table that tells me all the capabilities that all the groups might have, so it's basicallly
create table capability_group
( capability_id varchar(32),
group_id varchar(32));
There is no reason to have a primary key on that, because you never address a single row - you either want all the capabilities for a given group, or all the groups for a given capabilty. It would be better to have a unique constraint on (capabilty_id,group_id), and separate indexes on both fields.