MySQL: primary key is a 8-byte string. Is it better to use BIGINT or BINARY(8)? - mysql

We need to store many rows in a MySQL (InnoDB) table, all of them having a 8-byte binary string as primary key.
I was wondering wether it was best to use the BIGINT column type (which contains 64-bit, thus 8-byte, integers) or BINARY(8), which is fixed length.
Since we're using those ids as strings in our application, and not numbers, storing them as binary strings sounds more coherent to me. However, I wonder if there are performance issues with this. Does it make any difference?
If that matters, we are reading/storing these ids using hex notation (like page_id = 0x1122334455667788).
We wouldn't use integers in queries anyway, since we're writing a PHP application and, as you surely know, there isn't a "unsigned long long int" type, so all integers are machine-dependant size.

I'd use the binary(8) if this matches your design.
Otherwise you'll always have a conversion overhead in performance or complexity somewhere. There won't be much (if any) difference between the types at the RDBMS level

Related

Alternate field for autoincrement PK

In my tables I use an auto-increment PK on tables where I store for example posts and comments.
I don't want to expose the PK to the HTTP client, however, I still use it internally in my API implementation to perform quick lookups.
When a user wants to retrieve a post by id, I want to have an alternate unique key on the table.
I wonder what is the best (most common) way to use as type for this field.
The most obvious to me would be to use a UUID or GUID.
I wonder if there is a straightforward way to generate a random numeric key for this instead for performance.
What is your take on the best approach for this situation?
MySQL has a function that generates a 128-bit UUID, version 1 as described in RFC 4122, and returns it as a hex string with dashes, by the custom of UUID formatting.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
A true UUID is meant to be globally unique in space and time. Usually it's overkill unless you need a distributed set of independent servers to generate unique values without some central uniqueness validation, which could create a bottleneck.
MySQL also has a function UUID_SHORT() which generates a 64-bit numeric value. This does not conform with the RFC, but it might be useful for your case.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid-short
Read the description of the UUID_SHORT() implementation. Since the upper bits are seldom changing, and the lower bits are simply monotonically incrementing, it avoids the performance and fragmentation issues caused by inserting random UUID values into an index.
The UUID_SHORT value also fits in a MySQL BIGINT UNSIGNED without having to use UNHEX().

What is the performance hit of using a string type vs a uuid type for a UUID primary key?

Is there much of a speed difference for index lookups by using string for the primary key versus the actual uuid type, specifically if the string has a prefix like user-94a942de-05d3-481c-9e0c-da319eb69206 (making the lookup have to traverse 5-6 characters before getting to something unique)?
This is a micro-optimization and is unlikely to cause a real performance problem until you get to enormous scales. Use the key that best fits your design. That said, here's the details...
UUID is a built in PostgreSQL type. It's basically a 128 bit integer. It should perform as an index just as well as any other large integer. Postgres has no built in UUID generating function. You can install various modules to do it on the database, or you can do it on the client. Generating the UUID on the client distributes the extra work (not much extra work) away from the server.
MySQL does not have a built in UUID type. Instead there's a UUID function which can generate a UUID as a string of hex numbers. Because it's a string, UUID keys may have a performance and storage hit. It may also interfere with replication.
The string UUID will be longer; hex characters only encode 4 bits of data per byte so a hex string UUID needs 256 bits to store 128 bits of information. This means more storage and memory per column which can impact performance.
Normally this would mean comparisons are twice as long, since the key being compared is twice as long. However, UUIDs are normally unique in the first few bytes, so the whole UUID does not need to be compared to know they're different. Long story short: comparing string vs binary UUIDs shouldn't cause a noticeable performance difference in a real application... though the fact that MySQL UUIDs are UTF8 encoded might add cost.
Using UUIDs on PostgreSQL is fine, it's a built-in type. MySQL's implementation of UUID keys is pretty incomplete, I'd steer away from it. Steer away from MySQL while you're at it.
The real problem with UUIDs comes when the table (or at least the index) is too big to be cached in RAM. When this happens, the 'next' uuid needs to be stored into (or fetch from) some random block that is unlikely to be cached. This leads to more and more I/O as the table grows.
AUTO_INCREMENT ids usually don't suffer that I/O growth because INSERTs always go at the 'end' of the table and SELECTs usually cluster near the end. This leads to effective use of the cache, thereby avoiding the death-by-IO.
My UUID blog discusses how to make "Type-1" UUIDs less costly to performance, at least for MySQL.
Use the built-in UUID type that maps to a 128-bit int. Not just for performance, but to prevent strings like "password1" from showing up in that column.

MySQL - using String as Primary Key

I saw a similar post on Stack Overflow already, but wasn't quite satisfied.
Let's say I offer a Web service. http://foo.com/SERVICEID
SERVICEID is a unique String ID used to reference the service (base 64, lower/uppercase + numbers), similar to how URL shortener services generate ID's for a URL.
I understand that there are inherent performance issues with comparing strings versus integers.
But I am curious of how to maximally optimize a primary key of type String.
I am using MySQL, (currently using MyISAM engine, though I admittedly don't understand all the engine differences).
Thanks.
update for my purpose the string was actually just a base62 encoded integer, so the primary key was an integer, and since you're not likely to ever exceed bigint's size it just doesn't make too much sense to use anything else (for my particular use case)
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.
For instance, country codes or state abbreviations already have standardised character codes and this would be a good reason to use a character based primary key rather than make up an arbitrary integer ID for each in addition.
If your external ID is base64, your internal ID is a binary string. Use that as the key in your database with type BINARY(n) (if fixed length) or VARBINARY if variable length. The binary version is 3/4 shorter than the base64 one.
And just convert from/to base64 in your service.
Using string as the type of primary column is not a good approach because If our values can not be generated sequentially and with an Incremental pattern, this may cause database fragmentation and decrease the database performance.

Storing very large integers in MySQL

I need to store a very large number (tens of millions) of 512-bit SHA-2 hashes in a MySQL table. To save space, I'd like to store them in binary form, rather than a string a hex digits. I'm using an ORM (DBix::Class) so the specific details of the storage will be abstracted from the code, which can inflate them to any object or structure that I choose.
MySQL's BIGINT type is 64 bits. So I could theoretically split the hash up amongst eight BIGINT columns. That seems pretty ridiculous though. My other thought was just using a single BLOB column, but I have heard that they can be slow to access due to MySQL's treating them as variable-length fields.
If anyone could offer some widsom that will save me a couple hours of benchmarking various methods, I'd appreciate it.
Note: Automatic -1 to anyone who says "just use postgres!" :)
Have you considered 'binary(64)' ? See MySQL binary type.
Use the type BINARY(64) ?

MySQL performance of unique varchar field vs unique bigint

I'm working on an application that will be implementing a hex value as a business key (in addition to an auto increment field as primary key) similar to the URL id seen in Gmail. I will be adding a unique constraint to the column and was originally thinking of storing the value as a bigint to get away from searching a varchar field but was wondering if that's necessary if the field is unique.
Internal joins would be done using the auto increment field and the hex value would be used in the where clause for filtering.
What sort of performance hit would there be in simply storing the value as a varchar(x), or perhaps a char(x) over the additional work in doing the conversion to and from hex to store the value as an integer in the database? Is it worth the additional complexity?
I did a quick test on a small number of rows (50k) and had similar search result times. If there is a large performance issue would it be linear, or exponential?
I'm using InnoDB as the engine.
Is your hex value a GUID? Although I used to worry about the performance of such long items as indexes, I have found that on modern databases the performance difference on even millions of records is fairly insignificant.
A potentially larger problem is the memory that the index consumes (16 byte vs 4 byte int, for example), but on servers that I control I can allocate for that. As long as the index can be in memory, I find that there is more overhead from other operations that the size of the index element doesn't make a noticeable difference.
On the upside, if you use a GUID you gain server independence for records created and more flexibility in merging data on multiple servers (which is something I care about, as our system aggregates data from child systems).
There is a graph on this article that seems to back up my suspicion: Myths, GUID vs Autoincrement
The hex value is generated from a UUID (Java's implementation); it's hashed and truncated to smaller length (likely 16 characters). The algorithm for which is still under discussion (currently SHA). An advantage I see of storing the value in hex vs integer is that if we needed to grow the size (which I don't see happening with this application at 16 char) we could simply increase the truncated length and leave the old values without fear of collision. Converting to integer values wouldn't work as nicely for that.
The reason for the truncation vs simply using a GUID/UUID is simply to make the URL's and API's (which is where these will be used) more friendly.
All else being equal, keeping the data smaller will make it run faster. Mostly because it'll take less space, so less disk i/o, less memory needed to hold the index, etc etc. 50k rows isn't enough to notice that though...