I'm working on an application that will be implementing a hex value as a business key (in addition to an auto increment field as primary key) similar to the URL id seen in Gmail. I will be adding a unique constraint to the column and was originally thinking of storing the value as a bigint to get away from searching a varchar field but was wondering if that's necessary if the field is unique.
Internal joins would be done using the auto increment field and the hex value would be used in the where clause for filtering.
What sort of performance hit would there be in simply storing the value as a varchar(x), or perhaps a char(x) over the additional work in doing the conversion to and from hex to store the value as an integer in the database? Is it worth the additional complexity?
I did a quick test on a small number of rows (50k) and had similar search result times. If there is a large performance issue would it be linear, or exponential?
I'm using InnoDB as the engine.
Is your hex value a GUID? Although I used to worry about the performance of such long items as indexes, I have found that on modern databases the performance difference on even millions of records is fairly insignificant.
A potentially larger problem is the memory that the index consumes (16 byte vs 4 byte int, for example), but on servers that I control I can allocate for that. As long as the index can be in memory, I find that there is more overhead from other operations that the size of the index element doesn't make a noticeable difference.
On the upside, if you use a GUID you gain server independence for records created and more flexibility in merging data on multiple servers (which is something I care about, as our system aggregates data from child systems).
There is a graph on this article that seems to back up my suspicion: Myths, GUID vs Autoincrement
The hex value is generated from a UUID (Java's implementation); it's hashed and truncated to smaller length (likely 16 characters). The algorithm for which is still under discussion (currently SHA). An advantage I see of storing the value in hex vs integer is that if we needed to grow the size (which I don't see happening with this application at 16 char) we could simply increase the truncated length and leave the old values without fear of collision. Converting to integer values wouldn't work as nicely for that.
The reason for the truncation vs simply using a GUID/UUID is simply to make the URL's and API's (which is where these will be used) more friendly.
All else being equal, keeping the data smaller will make it run faster. Mostly because it'll take less space, so less disk i/o, less memory needed to hold the index, etc etc. 50k rows isn't enough to notice that though...
Related
I am administering a MySQL db for a payments processing system. For various legacy reasons it was originally built using CHAR(14) for many of the primary keys, which store a sequential ID based on a prefix identifying the type of data followed by a base36 encoded string representing a large number in sequence, e.g.
'PA00003NFMWHMQ' translating to 'payment 286103946050'
The advantage here is a semi-unique key that is still sequential, disadvantage being large values used for both clustered and non-clustered indexes, slowing down joins and lookups and requiring extra memory/storage.
I'm considering migrating them all to integers prior to releasing an API although I like the uniqueness too. I'm also wary of premature optimisation.
I'm not looking for a definite answer here, only some experienced opinions.
Thanks!
My first thought is "are you going to have to hang onto this ID for backwards compatibility anyway?" IDs that have meaning, like yours, tend to get stored and referenced in external systems. Will you wind up with a table that has an integer primary key for internal use and a char(14) legacy id and two indexes? That may still be an improvement, but it impacts whether this change is worth it. Keep this in mind in the rest of my commentary.
If you can switch completely to auto incremented integers and get rid of special ID generation code, that should certainly make things simpler and inserts faster. How much simpler and faster you need to determine. Is it just one extra function somewhere deep in the creation code that doesn't bother anybody? Or is does it impact the code and design all over the place?
...disadvantage being large values used for both clustered and non-clustered indexes, slowing down joins and lookups and requiring extra memory/storage.
As with any performance claim, the first thing would be to investigate if they're true. Is a char(14) key really slowing down joins and consuming memory and storage?
A char(14) (14 bytes) isn't much larger than an integer (4 bytes). The extra 10 bytes per row is just 10 MB per million records. But that's just to store the key. Every reference adds another 10 bytes. And every index featuring it another 10+ bytes. Still, I wouldn't assume this is a major storage and memory issue without measuring it.
Disk and memory are generally much cheaper than developer time. This doesn't mean to be wasteful, but consider whether saving a few gigs is worth however long this is likely to take (and the testing). Or if you can buy a bigger disk and more memory instead. For example, I have one project that could benefit from using enum fields instead of strings. But I haven't bothered because that would mean more developer time to make the change, and also to maintain the enum field. Instead it's still cheaper to pay for extra disk. That may change, and when it does I'll reconsider.
Similar with joins. If they're indexed, they should perform well regardless of whether it's a char or int. But you need testing.
I'd suggest you make a sanitized copy of the database, or generate one of decent size perhaps using your test factories, and run some performance tests with char(14) and with int. Be sure to test reality and whether this change will have a real impact on performance. Just running bare SQL queries may give you an outsized impression of their impact on performance. Also call the real functions you'd use in production, they might be swamping any SQL impact.
'PA00003NFMWHMQ' translating to 'payment 286103946050'
I'm considering migrating them all to integers prior to releasing an API
Exposing primary keys (or any other piece of implementation information) to the outside world has security and compatibility considerations. Its knowledge an attacker can use, for example they can predict what the next key will be. Don't do it.
Instead, assign each thing you're exposing a random API ID like a UUIDv4 (don't use MySQL's UUID function, it is guessable UUIDv1). Store them as binary(16) if space is a big concern.
Then it doesn't matter what your primary key is. You can change your design when you like.
The advantage here is a semi-unique key that is still sequential...
This is a puzzler. Primary keys must be unique, so I'm not sure what you mean by "semi-unique". Do you mean across tables? That the ID of a row in column A is probably unique from the row in column B? If that's the case, consider UUID primary keys. Or consider whether this is truly an advantage that you can actually use because of the semi part.
I'm developing a program and ran into a bug where inserting a value in a tables column, that has the type int, and the value is larger than Integer.MAX_VALUE it spits out an error saying the number is too large. I read that the fix for this is to quite simply just alter the table to BigInt and that should fix it. But that made me thinking, why don't all programmers just use the max column values (such as Varchar(255), BigInt, etc.) rather than something smaller like Varchar(30) or Int?
Wouldn't this almost completely eliminate an error like mine occurring when you're not sure of whats going to be inserted, especially if it's based off of users input? Is there any cons into just using the largest possible type you need for the columns? Would the table size be bigger even if you just "2" in a big int column (even though that would work with int?). Is there a performance loss?
Thanks!
For Varchar, the reason you generally don't just use MAX is because it stores it differently and puts limitations on your index maintenance operations. For instance, you cannot rebuild an index "online" with a varchar(max) field on it. While there's a little hand waving involved, basically varchar(max) data gets stored off row so there's overhead in maintaining that extra data store.
For numeric types, the main thing is space. Bigint is an 8 byte signed integer whereas an int is only 4 bytes. If you dont need a space bigger than 2.4 billion, that's just wasted space (and often a lot of it if you have, say, 2.4 billion rows of data).
Data Compression can solve some of those issues, but not without the cost of having to de-compress the data when it's queried.
So the reasons are varied, but with the possible exception of using larger size varchars (not varchar(max)), picking the "right" data type for your data is just a good idea.
I can't speak to any RDBMS other than SQL Server (but I imagine this applies to all of them)... A BIG INT takes up twice as much space as an INT... which means less data fitting onto a page meaning less data in cache meaning slower performance.
In SQL Server there are actually 4 INT types:
TINYINT (1 byte),
SMALLINT (2 bytes),
INT (4 bytes),
BIGINT (8 bytes).
A good database developer will put very careful thought into choosing the proper data type based on the data that's expected to be put in the column. Aside from the issue of storage space, data types function as data constraints. So if I choose TINYINT as my data type, that means I only expect to see values between 0 and 255 and will reject anything that falls outside of that range.
If a coworker were to submit a table design with all VARCHAR(255) & BIGINTs, I'd reject it and have them size everything appropriately. It's lazy thinking like that, that causes huge problem on the DB side of the house.
why don't all programmers just use the max column values (such as
Varchar(255), BigInt, etc.) rather than something smaller like
Varchar(30) or Int?
Some do exactly that. It's also not at all uncommon to see developers store numeric or date/time values in varchar columns too.
I often see performance and storage costs called out as the reason not to do this. Those are considerations (which vary by DBMS) but a more important one in the world of relational databases is data integrity. The chosen datatype is a critical part of the data model because it determines the domain of data that can be stored. On top of that, relational databases provide check, referential, and NULL constraints to further limit column values.
Wouldn't this almost completely eliminate an error like mine occurring
when you're not sure of whats going to be inserted, especially if it's
based off of users input?
Of course, but why stop at a 64-bit integer? Why not NUMERIC(1000)? That's a rhetorical question to point out that one must know about the business domain so data can be properly modeled and validation rules enforced. A 64-bit integer is certainly overkill to store a person's number of children but you may end up with a value of several billion due to careless data entry. The column data type is the last defense for bad data and is especially important when it's based off of users input.
All that being said, one can use a RDBMS as nothing more than a dumb storage engine and enforce data integrity rules (if any) in application code. In that case, storage and performance are the only consideration.
Is there much of a speed difference for index lookups by using string for the primary key versus the actual uuid type, specifically if the string has a prefix like user-94a942de-05d3-481c-9e0c-da319eb69206 (making the lookup have to traverse 5-6 characters before getting to something unique)?
This is a micro-optimization and is unlikely to cause a real performance problem until you get to enormous scales. Use the key that best fits your design. That said, here's the details...
UUID is a built in PostgreSQL type. It's basically a 128 bit integer. It should perform as an index just as well as any other large integer. Postgres has no built in UUID generating function. You can install various modules to do it on the database, or you can do it on the client. Generating the UUID on the client distributes the extra work (not much extra work) away from the server.
MySQL does not have a built in UUID type. Instead there's a UUID function which can generate a UUID as a string of hex numbers. Because it's a string, UUID keys may have a performance and storage hit. It may also interfere with replication.
The string UUID will be longer; hex characters only encode 4 bits of data per byte so a hex string UUID needs 256 bits to store 128 bits of information. This means more storage and memory per column which can impact performance.
Normally this would mean comparisons are twice as long, since the key being compared is twice as long. However, UUIDs are normally unique in the first few bytes, so the whole UUID does not need to be compared to know they're different. Long story short: comparing string vs binary UUIDs shouldn't cause a noticeable performance difference in a real application... though the fact that MySQL UUIDs are UTF8 encoded might add cost.
Using UUIDs on PostgreSQL is fine, it's a built-in type. MySQL's implementation of UUID keys is pretty incomplete, I'd steer away from it. Steer away from MySQL while you're at it.
The real problem with UUIDs comes when the table (or at least the index) is too big to be cached in RAM. When this happens, the 'next' uuid needs to be stored into (or fetch from) some random block that is unlikely to be cached. This leads to more and more I/O as the table grows.
AUTO_INCREMENT ids usually don't suffer that I/O growth because INSERTs always go at the 'end' of the table and SELECTs usually cluster near the end. This leads to effective use of the cache, thereby avoiding the death-by-IO.
My UUID blog discusses how to make "Type-1" UUIDs less costly to performance, at least for MySQL.
Use the built-in UUID type that maps to a 128-bit int. Not just for performance, but to prevent strings like "password1" from showing up in that column.
Is there any performance benefit in using the exact data types needed for a column? Or is it just storage optimisation?
For example, I'm creating a users table and I know for certainty that there will only be 200 users in total. When I'm manipulating the data in the the server, doing some select/update/insert/delete, is there any performance difference between using TINYINT - UN for the users_id column or using just INT?
The same applies to the user's name. I know, for now, that the user with the longest name length is 48, but I don't know if in the future there won't be a new user inserted in the table with a name with 65 characters in length. Is there any performance benefit in reserving only the needed lenght, for now, using VARCHAR(48) or can I avoid having to check constantly the column allowed length for each new user and use just VARCHAR(255)?
There is little advantage in either case.
For the number, you do gain a slight performance advantage. Typically, integers are 4 and a tinyint is 1 byte. So, if you have multiple smaller fields, then your records will be smaller. Smaller records then imply fewer data pages and ultimately slightly faster queries. This shows up when you start to have lots of records.
For the varchar, you don't even have that advantage. Both varchar(48) and varchar(255) occupy the same amount of space (there is one addition byte for lengths greater than 255). The values determine the space for this data type.
In other cases, it can make a big difference. In particular, storing dates as the native format is usually important, both to take advantage of date/time functions and to make better use of indexes.
tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.