Mysql performance with combined unique key - mysql

I am going to use unique keys for my messages table which will be conbination of three columns fromid, localid and timestamp
I want to ask a question that is the message insertion will get slow or it will perform well?

Compared to what?
Having a primary key or unique index is always going to affect message insertion time, because column values need to be compared to the values in the index. In most environments, the unique index can fit in memory, so this is just a few comparison operations and an insert -- nothing to worry about and much less than either network overhead or disk i/o.
If you have a really large table compared to available memory, then the operations could start to take more time.
If your application needs to enforce this unique index, then you should use it. You probably won't notice the additional overhead for enforcing uniqueness, unless you do very intense performance tests.

Related

Does having auto_increment ID degrade performance compared to no auto_increment?

I have a big table with ID column set as bigint auto_increment. At some time in history we decided to generate IDs on client side as some pseudo-random number.
My question is simple: given that the table column is still set as auto_increment, does this affect performance negatively even though we ALWAYS supply ID in INSERT queries? Is this significant/should I bother to change the schema? The table is EXTREMELY huge and busy, I am afraid I cannot do this without downtime.
Trust the database to give you the "best" in most situations. There is a lot of code (at least in InnoDB) to make AUTO_INCREMENT fast and efficient, even when multiple connections are inserting into the same table.
Notes on AUTO_INCREMENT and home-grown IDs:
AUTO_INCREMENT probably suffers from fewer locking delays and deadlocks.
I would not focus on the primary key for optimization; look elsewhere.
Supplying your own ID means you have to send it to the server; this increases (by a tiny amount) the bandwidth and latency.
SELECT LAST_INSERT_ID() is not costly, and is isolated to the connection. That is, a class of race conditions are avoided.
If you are supplying a number to the auto_inc column, you are defeating many of its optimizations.
Do you have a second column with your generated id? That is, do you have both a PRIMARY KEY and UNIQUE key? If so, that is overhead, especially for INSERTs, which must deal with all unique keys.
"some pseudo-random number" -- How do you deal with occasional duplicates?
What do you mean by "extremely busy"? If that is mostly bumping a counter of social media "Likes", then moving that column to a parallel table ('vertical' partitioning) is likely to provide a significant overall optimization.
Please provide SHOW CREATE TABLE and the main queries (read and write) against the table. This will help focus this Q&A to better help you.
To change the PRIMARY KEY requires downtime. (I think that is the case even with MySQL 8.0's new ALTER optimizations.)
Technically there would still be some performance impact through the use of an auto-increment column even though you supply your own values. The engine must still update the auto-increment sequence variable in the background so that a value could be generated in the event that no value is supplied in an insert query.
If you are using InnoDB then there will also be a lock applied to the table on insert to prevent duplicate auto-increment values. You can change the lock settings and turn it off if you want to be pedantic about performance.
With all this said, this should not have any impactful effect on real-world performance as the auto-increment sequence is a variable that the engine has quick access to.
If you haven't had any issues up to this point, just leave it as is as the performance will not degrade with additional records.
The official information regarding this topic is very limited, unfortunately.

MYSQL: What is the impact of varchar length on performance when used a primary key? [duplicate]

What would be the performance penalty of using strings as primary keys instead of bigints etc.? String comparison is much more expensive than integer comparison, but on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.
An application that I work on uses strings as primary keys in several tables (MySQL). It is not trivial to change this, and I'd like to know what can be gained performance wise to justify the work.
on the other hand I can imagine that
internally a DBMS will compute hash
keys to reduce the penalty.
The DB needs to maintain a B-Tree (or a similar structure) with the key in a way to have them ordered.
If the key is hashed and stored it in the B-Tree that would be fine to check rapidly the uniqueness of the key -- the key can still be looked up efficiently. But you would not be able to search efficient for range of data (e.g. with LIKE) because the B-Tree is no more ordered according to the String value.
So I think most DB really store the String in the B-Tree, which can (1) take more space than numeric values and (2) require the B-Tree to be re-balanced if keys are inserted in arbitrary order (no notion of increasing value as with numeric pk).
The penalty in practice can range from insignificant to huge. It all depends on the usage, the number of rows, the average size of the string key, the queries which join table, etc.
In our product we use varchar(32) for primary keys (GUIDs) and we haven't met performance issues of this. Our product is a web site with extreme overload and is critical to be stable.
We use SQL Server 2005.
Edit: In our biggest tables we have more than 3 000 000 records with lots of inserts and selects from them. I think in general, the benefit of migrating to int key will be very low, but the problems while migrating very high.
One thing to watch out for is page splits (I know this can happen in SQL Server - probably the same in MySQL).
Primary keys are physically ordered. By using an auto-increment integer you guarantee that each time you insert you are inserting the next number up, so there is no need for the db to reorder the keys. If you use strings however, the pk you insert may need to be placed in the middle of the other keys to maintain the pk order. That process of reordering the pks on the insert can get expensive.
It depends on several factors: RDBMS, number of indexes involving those columns but in general it will be more efficient using ints, folowed by bigints.
Any performance gains depend on usage, so without concrete examples of table schema and query workload it is hard to say.
Unless it makes sense in the domain (I'm thinking unique something like social security number), a surrogate integer key is a good choice; referring objects do not need to have their FK reference updated when the referenced object changes.

How does MySQL determine if an INSERT is unique?

I would like to know if there is an implicit SELECT being run prior to performing an INSERT on a table that has any column defined as UNIQUE. I cannot find anything about this in the documentation for INSERT.
I have asked some other questions that nobody seems to be able to answer - perhaps because I'm not properly explaining myself - that are related to the above question.
If I understand correctly, then I assume the following would be true:
CASE 1:
You have a table with 1 billion rows. Each row has a UUID column which is unique. If you perform an insert the server must do some kind of implicit SELECT COUNT(*) FROM table WHERE UUID = [new uuid] and determine if the count is 0 or 1. Correct?
CASE 2:
You have a table with 1 billion rows. Each row has a composite unique key consisting of a DATE and a UUID. If you perform an insert the server must do some kind of implicit SELECT COUNT(*) FROM table WHERE DATE = [date] AND UUID = [new uuid] and check if the count is 0 or 1. Yes?
I use the word implicit because at some point, somewhere in the process, the server MUST be checking the value. If not it would require that the laws of physics dictate that two identical rows cannot exist - and as far as I'm informed physics don't play a big role when it comes to the uniqueness of numbers written down somewhere, in binary, on a magnetic disk in a computer.
Let's assume your 1 billion rows are equally and sequentially distributed across 2,000 different dates. Would this not mean that case 2 would perform the insert faster because it can look up the UUIDs segmented into dates? If not, then would it be better to use case 1 for insert speed - and in that case, why?
This question is theoretical, so don't bother with considering regular SELECT performance in this case. The primary key wouldn't be the UUID+DATE index.
As a response to comments: The UUID in my case is designed solely for the purpose of avoiding duplicate entries because of bad connections. Since you cannot make the same entry for a different date twice (without it logically being a new entry), the UUID does not need to be globally unique - it needs only be unique for each date. This is why I can permit it being part of a composite key.
There are a few flaws and misconceptions in the previous answers; rather than point them out, I will start from scratch.
Referring to InnoDB only...
An INDEX (including UNIQUE and PRIMARY KEY) is a BTree. BTrees are very efficient a locating one row based on the key the BTree is sorted on. (It is also efficient at scanning in key-order.) The "fan out" of a typical BTree in MySQL is on the order of 100. So, for a million rows, the BTree is about 3 levels deep (log100(million)); for a trillion rows, it is only twice as deep (approximately). So, even if nothing is cached, it takes only 3 disk hits to locate one particular row in a million-row index.
I am being loose here with "index" versus "table" because they are essentially the same (in InnoDB, at least). Both are BTrees. What differs is what is in the leaf nodes: The leaf nodes of a table BTree has all the columns. (I am ignoring the off-block storage for TEXT/BLOB in InnoDB.) An INDEX (other than the PRIMARY KEY) has a copy of the PRIMARY KEY in the leaf node. This is how a secondary key can get from the INDEX BTree to the rest of the row's columns, and how InnoDB does not have to store multiple copies of all the columns.
The PRIMARY KEY is "clustered" with the data. That is one BTree contains both all the columns of all the rows, and it is ordered according to the PRIMARY KEY specification.
Locating a record by PRIMARY KEY is one BTree search. Locating a record by a SECONDARY KEY is two BTree searches, one in the secondary INDEX's BTree which gives you the PRIMARY KEY; then a second one to drill down the data/PK BTree.
PRIMARY KEY(UUID)... Since the UUID is very random, the "next" row you INSERT will be located at a 'random' spot. If the table is much bigger than be cached in the buffer_pool, the block the new row needs to go into is very likely to not be cached. This leads to a disk hit to pull the block into cache (the buffer pool), and eventually another disk hit to write it back to disk.
Since a PRIMARY KEY is a UNIQUE KEY, something else is going on at the same time (No SELECT COUNT(*) etc). The UNIQUEness is checked after the block is fetched and before deciding whether to give a "duplicate key" error, or to store the row. Also, if the block is "full" then the block will need to be 'split' to make room for the new row.
INDEX(UUID) or UNIQUE(UUID)... There is a BTree for that index. On INSERT, some randomly located block will need to be fetched, modified, possibly split, and written back to disk, very much like the PK discussion above. If you had UNIQUE(UUID), there would also be a check for UNIQUEness and possibly an error message. In either case, there is, now and/or later, disk I/O.
AUTO_INCREMENT PK... If the PRIMARY KEY is an auto_increment, then new records are added to the 'last' block in the data BTree. When it gets full (every 100 or so records) there is (logically) a block split and flush of the old block to disk. (Actually, the I/O is probably delayed and done in the background.)
PRIMARY KEY(id) + UNIQUE(UUID)... Two BTrees. On an INSERT, there is activity in both. This is likely to be worse than simply PRIMARY KEY(UUID). Add up the disk hits above to see what I mean.
"Disk hits" are the killer in huge tables, and especially with UUIDs. "Count the disk hits" to get a feel for performance, especially when comparing two possible techniques.
Now for your secret sauce... PRIMARY KEY(date, UUID)... You are allowing the same UUID to show up on two different days. This can help! Back to how a PK works and checking for UNIQUEness... The "compound" index (date, UUID) is checked for UNIQUEness as the record is inserted. The records are sorted by date+UUID, so all of today's records are clumped together. IF (and this might be a big IF) one day's data fits in the buffer pool (but the entire table does not), then this is what is happening every morning... INSERTs are suddenly adding new records to the "end" of the table because of the new "date". These inserts are occurring randomly within the new date. Blocks in the buffer_pool are being pushed out to disk to make room for the new blocks. But, nicely, what you see is smooth, fast, INSERTs. This is unlike what you saw with PRIMARY KEY(UUID), when many rows had to wait for a disk read before UNIQUEness could be checked. All of today's blocks stay cached, and you don't have to wait for I/O.
But, if you ever get so big that you cannot fit one day's data in the buffer pool, things will start slowing down, first at the end of the day, then it will creep earlier and earlier as the frequency of INSERTs increases.
By the way, PARTITION BY RANGE(date), together with PRIMARY KEY(uuid, date) has somewhat similar characteristics. (Yes I deliberately flipped the PK columns.)
When inserting large amounts of data in a table, keep in mind that the data ends up being physically stored on a disk somewhere. To actually read and write the data from the disk, MySQL (and most other RDBMS) uses something called a clustered index. If you specify a Primary Key or a Unique Index on a table, the column or columns participating in the key/index becomes the clustered index key. This means that on the disk, data is physically stored in the same order as the values in the key column(s).
By utilising the clustered index, the database engine can quickly determine whether a value already exists, without having to scan the whole table. In theory, if a table contains N = 1.000.000 records, the engine on average needs log2(N) = 20 operations to check if a value exists, regardless of how many columns participate in the index. For secondary indexes, a B-tree or a hash table is typically used (search the web for these terms, for a detailed explanation of how they work).
The conclusion of this article is wrong:
"... MySQL is unable to buffer enough data to guarantee a value is
unique and is therefore caused to perform a tremendous amount of
reading for each insert to guarantee uniqueness"
This is incorrect. Checking uniqueness does not really require any additional work, as the engine had to locate the place to insert the new record anyway. What causes the performance slowdown, is the use of UUID's. Remember that UUID's are randomly generated, whenever a new record is inserted. This means that the new record needs to be inserted at a random physical position on the disk, and this causes existing data to be shifted around, to accomodate the new record. If, on the other hand, the index column is a value that increases monotonically (such as an auto-increment INT), new records will always be inserted after the last record, meaning no existing data will ever need to be moved.
In your case, there won't be any performance difference between case 1 and case 2. But you will still run into trouble because of the randomness of the UUID's. It would be much better if you used an auto-incrementing value instead of the UUID. Also, since UUID's are always unique by nature, it really doesn't make much sense to index them with a UNIQUE constraint. Alternatively, if you really must use UUID's, make sure that you have a primary key on your table, that is based on auto-incrementing INT's, to ensure that new records are never randomly inserted on the disk.
This is the very purpose of a UNIQUE constraint:
A UNIQUE index creates a constraint such that all values in the index must be distinct. An error occurs if you try to add a new row [or update an existing row] with a key value that matches [another] existing row.
Earlier in the same manual page, it is stated that
A column list of the form (col1,col2,...) creates a multiple-column index. Index key values are formed by concatenating the values of the given columns.
How this constraint is implemented is not documented, but it must somehow equate to a preliminary SELECT with the values to be inserted/updated. The cost of such a check is often negligible, because, by definition, the fields are indexed (this overhead becomes relevant when dealing with bulk inserts).
The number of columns covered by the index is not meaningful in terms of performance (for example, compared to the number of rows in the table). It does impact the disk space occupied by the index, but this should really not matter in your design decisions.

Better to insert at end of InnoDB primary key, or scattered throughout?

What are the performance characteristics of inserting many small records from many clients into an InnoDB table, where the inserts all happen at the end of the primary key (e.g. with UUIDs where the leading digits are based on a timestamp) vs. scattered throughout the primary key (e.g. with UUIDs where the leading digits aren't based on a timestamp)? Is one preferable to the other?
Appending keys to the end of an index is preferred because the index does not need to be reordered.
When inserting rows in the middle of the primary key index, since actual table data is stored on the same page as the primary keys in InnoDB, page data must be reordered (and relocated if a page fills up). MySQL does leave room for growth in each page, but some reordering and relocating is inevitable.
Page size in InnoDB is 16K, so if inserted rows are small, the effect is less.
Appending rows to the end of an index also requires less locks, though there may be more contention. Try to insert multiple rows in the same statement.
Appending also causes less fragmentation on disk, so sequential pages stay closer together. Disk fragmentation doesn't matter much, however, unless you are querying large amounts of sequential rows or performing table scanning instead of using indexes.
I wouldn't create an incremental surrogate primary key just so you can insert rows in order unless your number of writes (inserts) is higher than your reads (or perhaps because your rows are large and you are experiencing performance issues). If your reads are higher than your writes, being able to use a natural primary key may be a huge performance benefit.
Appending is more performant, but you should choose your method by considering all factors.

Mysql: Unique Index = Performance Characteristics for large datasets?

what is the performance characteristic for Unique Indexes in Mysql and Indexes in general in MySQl (like the Primary Key Index):
Given i will insert or update a record in my databse: Will the speed of updating the record (=building/updating the indexes) be different if the table has 10 Thousand records compared to 100 Million records. Or said differently, does the Index buildingtime after changing one row depend on the total indexsize?
Does this also apply for any other indexes in Mysql like the Primary Key index?
Thank you very much
Tom
Most indexes in MySQL are really the same internally -- they're B-tree data structures. As such, updating a B-tree index is an O(log n) operation. So it does cost more as the number of entries in the index increases, but not badly.
In general, the benefit you get from an index far outweighs the cost of updating it.
A typical MySQL implementation of an index is as a set of sorted values (not sure if any storage engine uses different strategies, but I believe this holds for the popular ones) -- therefore, updating the index inevitably takes longer as it grows. However, the slow-down need not be all that bad -- locating a key in a sorted index of N keys is O(log N), and it's possible (though not trivial) to make the update O(1) (at least in the amortized sense) after the finding. So, if you square the number of records as in your example, and you pick a storage engine with highly optimized implementation, you could reasonably hope for the index update to take only twice as long on the big table as it did on the small table.
Note that if new primary key values are always larger than the previous (i.e. autoincrement integer field), your index will not need to be rebuilt.