I've read in a couple of online articles pertaining to the performance of using UUID's as primary keys in MySQL - and a common theme, whether they are for-or-against is the idea that non-sequential data hurts index performance.
https://blog.codinghorror.com/primary-keys-ids-versus-guids/
The generated GUIDs should be partially sequential for best performance
https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/
Create function to rearrange UUID fields and use it (after showing how rearranging UUID can drastically improve performance)
However, I simply cannot understand how non-sequential data impacts indexes such as B-TREES, HASHES, CLUSTERED indexes.
You can add my UUID blog to your list. (It applies to MySQL equally well.)
Note that the performance problem does not occur until the index (whether clustered, or not whether BTree, hash, or other) is too big to be cached in RAM. At that point the "next" UUID you reach for (or try to insert) is unlikely to be in RAM, thereby necessitating I/O, which impacts performance.
In contrast, inserting rows keyed on a datetime, and doing so somewhat chronologically, will mostly be inserting into the same block of a BTree. This means that the "next" row is unlikely to need I/O.
I/O is the biggest factor in performance.
My blog points out how Type 1 uuids can be turned into something akin to a timestamp, thereby achieving the "locality of reference" that leads to less I/O, hence more speed. MySQL 8.0 has builtin functions that do the same thing as my Stored Functions. Still, you need Type 1 and need to call the function, in order to decrease the I/O.
Related
I am administering a MySQL db for a payments processing system. For various legacy reasons it was originally built using CHAR(14) for many of the primary keys, which store a sequential ID based on a prefix identifying the type of data followed by a base36 encoded string representing a large number in sequence, e.g.
'PA00003NFMWHMQ' translating to 'payment 286103946050'
The advantage here is a semi-unique key that is still sequential, disadvantage being large values used for both clustered and non-clustered indexes, slowing down joins and lookups and requiring extra memory/storage.
I'm considering migrating them all to integers prior to releasing an API although I like the uniqueness too. I'm also wary of premature optimisation.
I'm not looking for a definite answer here, only some experienced opinions.
Thanks!
My first thought is "are you going to have to hang onto this ID for backwards compatibility anyway?" IDs that have meaning, like yours, tend to get stored and referenced in external systems. Will you wind up with a table that has an integer primary key for internal use and a char(14) legacy id and two indexes? That may still be an improvement, but it impacts whether this change is worth it. Keep this in mind in the rest of my commentary.
If you can switch completely to auto incremented integers and get rid of special ID generation code, that should certainly make things simpler and inserts faster. How much simpler and faster you need to determine. Is it just one extra function somewhere deep in the creation code that doesn't bother anybody? Or is does it impact the code and design all over the place?
...disadvantage being large values used for both clustered and non-clustered indexes, slowing down joins and lookups and requiring extra memory/storage.
As with any performance claim, the first thing would be to investigate if they're true. Is a char(14) key really slowing down joins and consuming memory and storage?
A char(14) (14 bytes) isn't much larger than an integer (4 bytes). The extra 10 bytes per row is just 10 MB per million records. But that's just to store the key. Every reference adds another 10 bytes. And every index featuring it another 10+ bytes. Still, I wouldn't assume this is a major storage and memory issue without measuring it.
Disk and memory are generally much cheaper than developer time. This doesn't mean to be wasteful, but consider whether saving a few gigs is worth however long this is likely to take (and the testing). Or if you can buy a bigger disk and more memory instead. For example, I have one project that could benefit from using enum fields instead of strings. But I haven't bothered because that would mean more developer time to make the change, and also to maintain the enum field. Instead it's still cheaper to pay for extra disk. That may change, and when it does I'll reconsider.
Similar with joins. If they're indexed, they should perform well regardless of whether it's a char or int. But you need testing.
I'd suggest you make a sanitized copy of the database, or generate one of decent size perhaps using your test factories, and run some performance tests with char(14) and with int. Be sure to test reality and whether this change will have a real impact on performance. Just running bare SQL queries may give you an outsized impression of their impact on performance. Also call the real functions you'd use in production, they might be swamping any SQL impact.
'PA00003NFMWHMQ' translating to 'payment 286103946050'
I'm considering migrating them all to integers prior to releasing an API
Exposing primary keys (or any other piece of implementation information) to the outside world has security and compatibility considerations. Its knowledge an attacker can use, for example they can predict what the next key will be. Don't do it.
Instead, assign each thing you're exposing a random API ID like a UUIDv4 (don't use MySQL's UUID function, it is guessable UUIDv1). Store them as binary(16) if space is a big concern.
Then it doesn't matter what your primary key is. You can change your design when you like.
The advantage here is a semi-unique key that is still sequential...
This is a puzzler. Primary keys must be unique, so I'm not sure what you mean by "semi-unique". Do you mean across tables? That the ID of a row in column A is probably unique from the row in column B? If that's the case, consider UUID primary keys. Or consider whether this is truly an advantage that you can actually use because of the semi part.
We are in the process of migrating from MySQL to PGSQL and we have a 100 million row table.
When I was trying to ascertain how much space both systems use, I found much less difference for tables, but found huge differences for indexes.
MySQL indexes were occupying more size than the table data itself and postgres was using considerably lesser sizes.
When digging through for the reason, I found that MySQL uses B+ trees to store the indexes and postgres uses B-trees.
MySQL usage of indexes was a little different, it stores the data along with the indexes (due to which the increased size), but postgres doesn't.
Now the questions:
Comparing B-tree and B+ trees on database speak, it is better to use B+trees since they are better for range queries O(m) + O(logN) - where m in the range and lookup is logarithmic in B+trees?
Now in B-trees the lookup is logarithmic for range queries it shoots up to O(N) since it does not have the linked list underlying structure for the data nodes. With that said, why does postgres uses B-trees? Does it perform well for range queries (it does, but how does it handle internally with B-trees)?
The above question is from a postgres point of view, but from a MySQL perspective, why does it use more storage than postgres, what is the performance benefit of using B+trees in reality?
I could have missed/misunderstood many things, so please feel free to correct my understanding here.
Edit for answering Rick James questions
I am using InnoDB engine for MySQL
I built the index after populating the data - same way I did in postgres
The indexes are not UNIQUE indexes, just normal indexes
There were no random inserts, I used csv loading in both postgres and MySQL and only after this I created the indexes.
Postgres block size for both indexes and data is 8KB, I am not sure for MySQL, but I didn't change it, so it must be the defaults.
I would not call the rows big, they have around 4 text fields with 200 characters long, 4 decimal fields and 2 bigint fields - 19 numbers long.
The P.K is a bigint column with 19 numbers,I am not sure if this is bulky? On what scale should be differentiate bulky vs non-bulky?
The MySQL table size was 600 MB and Postgres was around 310 MB both including indexes - this amounts to 48% bigger size if my math is right.But is there a way that I can measure the index size alone in MySQL excluding the table size? That can lead to better numbers I guess.
Machine info : I had enough RAM - 256GB to fit all the tables and indexes together, but I don't think we need to traverse this route at all, I didn't see any noticeable performance difference in both of them.
Additional Questions
When we say fragmentation occurs ? Is there a way to do de-fragmentation so that we can say that beyond this, there is nothing to be done.I am using Cent OS by the way.
Is there a way to measure index size along in MySQL, ignoring the primary key as it is clustered, so that we can actually see what type is occupying more size if any.
First, and foremost, if you are not using InnoDB, close this question, rebuild with InnoDB, then see if you need to re-open the question. MyISAM is not preferred and should not be discussed.
How did you build the indexes in MySQL? There are several ways to explicitly or implicitly build indexes; they lead to better or worse packing.
MySQL: Data and Indexes are stored in B+Trees composed of 16KB blocks.
MySQL: UNIQUE indexes (including the PRIMARY KEY) must be updated as you insert rows. So, a UNIQUE index will necessarily have a lot of block splits, etc.
MySQL: The PRIMARY KEY is clustered with the data, so it effectively takes zero space. If you load the data in PK order, then the block fragmentation is minimal.
Non-UNIQUE secondary keys may be built on the fly, which leads to some fragmentation. Or they can be constructed after the table is loaded; this leads to denser packing.
Secondary keys (UNIQUE or not) implicitly include the PRIMARY KEY in them. If the PK is "large" then the secondary keys are bulky. What is your PK? Is this the 'answer'?
In theory, totally random inserts into a BTree lead to a the blocks being about 69% full. Maybe this is the answer. Is MySQL 45% bigger (1/69%)?
With 100M rows, probably many operations are I/O-bound because you don't have enough RAM to cache all the data and/or index blocks needed. If everything is cached, then B-Tree versus B+Tree won't make much difference. Let's analyze what needs to happen for a range query when things are not fully cached.
With either type of Tree, the operation starts with a drill-down in the Tree. For MySQL, 100M rows will have a B+Tree of about 4 levels deep. The 3 non-leaf nodes (again 16KB blocks) will be cached (if they weren't already) and be reused. Even for Postgres, this caching probably occurs. (I don't know Postgres.) Then the range scan starts. With MySQL it walks through the rest of the block. (Rule of Thumb: 100 rows in a block.) Ditto for Postgres?
At the end of the block something different has to happen. For MySQL, there is a link to the next block. That block (with 100 more rows) is fetched from disk (if not cached). For a B-Tree the non-leaf nodes need to be traversed again. 2, probably 3 levels are still cached. I would expect the need for another non-leaf node to be fetched from disk only 1/10K rows. (10K = 100*100) That is, Postgres might hit the disk 1% more often than MySQL, even on a "cold" system.
On the other hand, if the rows are so fat that only 1 or 2 can fit in a 16K block, the "100" I kept using is more like "2", and the 1% becomes maybe 50%. That is, if you have big rows this could be the "answer". Is it?
What is the block size in Postgres? Note that many of the computations above depend on the relative size between the block and the data. Could this be an answer?
Conclusion: I've given you 4 possible answers. Would you like to augment the question to confirm or refute that each of these apply? (Existence of secondary indexes, large PK, inefficient building of secondary indexes, large rows, block size, ...)
Addenda about PRIMARY KEY
For InnoDB, another thing to note... It is best to have a PRIMARY KEY in the definition of the table before loading the data. It is also best to sort the data in PK order before LOAD DATA. Without specifying any PRIMARY KEY or UNIQUE key, InnoDB builds a hidden 6-byte PK; this is usually sub-optimal.
At databases you have often queries who delivers some data ranges like id's from 100 to 200.
In this case
B-Tree needs to follow the path from the root to the leafs for every single entry to get the data-pointer.
B+-Trees can 'walk' through the leafs and has to follow the path to the leafs only the first time (i.e. for the id 100)
This is because B+-Trees stores only the data (or data-pointer) in the leafs and the leafs are linked so that you can perform a rapid in-order-traversal.
B+-Tree
Another point is:
At B+Trees the inner nodes stores only pointer to other nodes without any data-pointer, so you have more space for pointers and you need less IO-Operations and you can store more node-pointers at a memory-page.
So for range-queries B+-Trees are the optimum data-strucure. For single selections B-Trees might be better (causes of the depth/size of the tree), cause the data-pointer are located also inside the tree.
MySQL and PostgreSQL aren't really comparable here Innodb uses an index to store table data (and secondary indexes just point at the pkey). This is great for single row pkey lookups and with B+ trees, do ok with range queries on the pkey field, but have performance drawbacks for everything else.
PostgreSQL uses heap tables and puts indexes as separate. It supports a number of different indexing algorithms. Depending on your range query, a btree index may not help you and you may need a GiST Index instead. Similarly GIN indexes work well with member lookups (for arrays, fts etc).
I think btree is used because it excels at the simple use case: what roes contain the following data? This becomes a building block of GIN for example.
But it isn't true that PostgreSQL cannot use B+ trees. GiST is built on B+ Tree indexes in a generalized format. So PostgreSQL gives you the option to use B+ trees where they come in handy.
I was reading ebook chapter about indexes, and indexing strategies, many of these aspects I already know, but I stucked on clustered indexes in InnoDB, here is the quote:
Clustering gives the largest improvement for I/O-bound workloads. If
the data fits in memory the order in which it’s accessed doesn’t
really matter, so clustering doesn’t give much benefit.
I belive that this is truth, but how am I supposed to guess if the data would fit in memory? How the database decide when to process the data in-memory, and when not?
Let's say we have a table Emp with columns ID, Name, and Phone filled with 100 000 records
If, one example, I will put the clustered index on the ID column, and perform this query
SELECT * FROM Employee;
How do I know if this will use a benefits from clustered index?
It's somehow relative to this thread
Difference between In memory databases and disk memory database
but yet I am not sure how the database will behave
Your example might be 20MB.
"In memory" really means "in the InnoDB buffer_pool", whose size is controlled by innodb_buffer_pool_size, which should be set to about 70% of available RAM.
If your query hits the disk instead of finding everything cached in the buffer_pool, it will run (this is just a Rule of Thumb) 10 times as slow.
What you are saying on "clustered index" is misleading. Let me turn things around...
InnoDB really needs a PRIMARY KEY.
A PK is (by definition in MySQL) UNIQUE.
There can be only one PK for a table.
The PK can be a "natural" key composed of one (or more) columns that 'naturally' work.
If you don't have a "natural" choice, then use id INT UNSIGNED NOT NULL AUTO_INCREMENT.
The PK and the data are stored in the same BTree. (Actually a B+Tree.) This leads to "the PK is clustered with the data".
The real question is not whether something is clustered, but whether it is cached in RAM. (Remember the 10x RoT.)
If the table is small, it will stay in cache (once all its blocks are touched), hence avoid disk hits.
If some subset of a huge table is "hot", it will tend to stay in cache.
If you must access a huge table "randomly", you will suffer a slowdown due to lots of disk hits. (This happens when using UUIDs as PRIMARY KEY or other type of INDEX.)
How the database decide when to process the data in-memory, and when not?
That's 'wrong', too. All processing is in memory. On a block-by-block basis, pieces of the tables and indexes are moved into / out of the buffer_pool. A block (in InnoDB) is 16KB. And the buffer_pool is a "cache" of such blocks.
SELECT * FROM Employee;
is simple, but costly. It operates thus:
"Open" table Employee (if not already open -- a different 'cache' handles this).
Go to the start of the table. This involves drilling down the left side of the PK's BTree to the first leaf node (block). And fetch it into the buffer_pool if not already cached.
Read a row -- this will be in that leaf node.
Read next row -- this is probably in the same block. If not, get the 'next' block (read from disk if necessary).
Repeat step 4 until finished with the table.
Things get more interesting if you have a WHERE clause. And then it depends on whether the PK or some other INDEX is involved.
Etc, etc.
I was optimizing a 3 GB table as a MEMORY table in order to do some analysis on it, and I was curious if adding indexes even help a MEMORY table. Since the data is all in memory anyway, is this just redundant?
No, they're not redundant.
Yes, continue to use indexes.
The speed of access to a memory table on smaller tables with a non-indexed column may seem almost identical to the indexed ones due to how fast full table scans can be in memory, but as the table grows or as you join them together to make larger result sets there will be a difference.
Regardless of the storage method the engine uses (disk/memory), proper indexes will improve performance as long the storage engine supports them. How the indexes are implemented may vary, but I know they are implemented in the table types MEMORY, INNODB, and MyISAM. BTW: The default method for indexes in MEMORY tables is with a hash instead of a B-Tree.
Also, I generally don't recommend coding to your storage engine. What's a memory table today may need to changed to innodb tomorrow--the SQL and schema should stand on it's own.
No, indexing has little to do with data access speed. An index reorganizes data in order to optimize specific queries.
For example if you add a balanced binary tree index to a one-million-row column, you will be able to find the item you want in about 20 read operations, instead of a average half million.
So placing that million rows in memory, which is 100x faster than the disk, will speed a brute force search by 100x. Adding the index will further improve the speed by a factor of twenty-five thousand by allowing the DB to perform a smarter search instead of a merely faster search.
Things are more complicated than this, because other factors get into play, and you rarely get such large a benefit from an index. Smarter searches are also slower on a one-by-one basis: those 20 index seeks cost much more than 20 brute force seeks. Then there's index maintenance, etc.
But my suggestion is to keep the data in memory if you can -- and index them.
Sequential keys allow one to use clustered index. How material is that benefit? How much is lost if 1% (say) of the keys are out of sequential order by one or two rank?
Thanks,
JDelage
Short:
Clustered index, in general, can be used on anything that is sortable. Sequentiality (no gaps) is not required - your records will be maintained in order with common index maintenance principles (only difference is that with clustered index the leafs are big because they hold the data, too).
Long:
Good clustering can give you orders of magnitude improvements.
Basically with good clustering you will be reading data very efficiently on any spinning media.
The measure on which you should evaluate if the clustering is good should be done by examining the most common queries (that actually will read data and can not be answered by indexes alone).
So, for example if you have composite natural keys as primary key on which the table is clustered AND if you always access the data according to the subset of the key then with simple sequential disk reads you will get answers to your query in the most efficient way.
However, if the most common way to access this data is not according to the natural key (for example the application spends 95% of time looking for last 5 records within the group AND the date of update is not part of the clustered index), then you will not be doing sequential reads and your choice of the clustered index might not be the best.
So, all this is at the level of physical implementation - this is where things depend on the usage.
Note:
Today not so relevant, but tomorrow I would expect most DBs to run off the SSDs - where access times are nicer and nicer and with that (random access reads are similar in speed to sequential reads on SSDs) the importance of clustered indexes would diminish.
You need to understand the purpose of the clustered-index.
It may be helpful in some cases, to speed up inserts, but mostly, we use clustered indexes to make queries faster.
Consider the case where you want to read a range of keys from a table - this is very common - it's called a range scan.
Range scans on a clustered index are massively better than a range scan on a secondary index (not using a covering index). These are the main case for using clustered indexes. It mostly saves 1 IO operation per row in your result. This can be the difference between a query needing, say, 10 IO operations and 1000.
It really is amazing, particularly if you have no blobs and lots of records per page.
If you have no SPECIFIC performance problem that you need to fix, don't worry about it.
But do also remember, that it is possible to make a composite primary key, and that your "unique ID" need not be the whole primary key. A common (very good) technique is to add something which you want to range-scan, as the FIRST part of the PK, and add a unique ID (meaningless) afterwards.
So consider the case where you want to scan your table by time - you can make the time the first part of the PK (it is not going to be unique, so it's not enough on its own), and a unique ID the second.
Do not however, do premature optimisation. If your database fits in memory (Say 32Gb), you don't care about IO operations. It's never going to do any reads anyway.