Better alternative to autonumber primary keys - ms-access

I am looking for a better primary key than the autonumber data type, namely for the reason that it's limited to a long integer, when I really just need the field to reflect a number or text string that will never ever repeat, no matter HOW many records are added or deleted from the table. The problem is I am not sure how to implement something like turning the current date and time into a hexadecimal string and using that as a unique field I can use as a primary key.
Am I just being too paranoid about running out of space?
-- EDITED 03-16-2010 # 1237 hours --
I had a person who, at the time, I thought was a wonderful reference for Access related questions tell me that Replication IDs are just a counter for the number of times an item was replicated... hence I never explored it further. After the number of replies, I have up-modded, and accepted an answer. I guess I was just having a stupid newbie Accesss developer moment. Seriously though, thank you again for everyone who replied!

GUID.
They are pretty unique
http://en.wikipedia.org/wiki/Globally_Unique_Identifier
You did not mention your programming language.
C# would be something like
String myKey = Guid.NewGuid().toString();

Why do you think you'll run out of space? Perhaps you do not realize how big a 64-bit integer is, exactly. It allows for around 10 billion billion records. If you created 100 records per second, it would take over five billion years to run out of integers.

Why are you limited to a long integer? When you specify an AutoNumber field, you can tell it to use a Replication ID instead of Long Integer and it will be a unique 128-bit value called a GUID.
Although you can use the current date and time as a primary key, here's why not to:
The current date and time are not as unique as you might think. If you make records very rapidly, you could end up with two being inserted between clock ticks, causing both to end up with the same time. Or your computer's clock could just get reset backwards. Or DST could end and if you're storing local times, you'll end up with duplicate times.

Per John's answer, you're unlikely to run out of long integers. But if you'd prefer a unique string, the easiest solution is a UUID. It doesn't take inputs, but the odds of ever generating two identical UUIDs are negligible.
For example, in Python:
import uuid
uuid.uuid4()
There are UUID functions available in most all languages: http://en.wikipedia.org/wiki/Uuid

A very simple solution is to use an Autonumber with option "Random" instead of "Increment". I read somewhere that since the numbers are not contiguous, it has the added bonus of improving concurrency when adding new records from several clients simultaneously.

Related

Limit of SQL 'auto increment' as primary key

I want to create a system of online billboard where everyone can post a topic as my project.
I try to design the database using SQL to store the information of each topic, including the topic's id as primary key.
At first I design the id using integer datatype with auto-increment, as I think it's the simplest way. Then I thought about it and found out that the integer has limit(the number may be high but it is there), so I'm here finding another method.
Now I think of some pseudo-random algorithms, or use the hashing of topic's name but still not clear.
I also find the GUID from research in here, but not sure can it be used.
I wish you suggest me some ways of how to deal with the limit size of integer as primary key, or suggest me any keywords for me to do further research.
This answer assumes MySQL/MariaDB, because it uses the terminology "auto-increment" for such columns (as opposed to other databases that use identity or serial).
If int isn't big enough, you can use bigint.
Although I might consider it unlikely that you'll exceed the thresholds for int (it works for many applications), bigint would require great effort on you and your computers part for a long, long time to exceed the maximum value.
This is explained in the documentation.
With int, the maximum value supported by SQL Server is 2,147,483,647.
Just for completeness, I will also add that yet another option is to change the data type of the column to bigint (maximum value 9,223,372,036,854,775,807 - this will allow you to insert a million rows per second, for almost 300,000 years in a row).
Or if you fear that you will overflow even that, you can consider using decimal(38,0) - the maximum here is a number consisting of 38 9's (which will allow you to maintain that same pace for a whopping 31,709,791,983,764,586,504,312,531 years). 
http://sqlblog.com/blogs/hugo_kornelis

Theoretical situation about MySQL

I searched Google for a question I ask myself since this morning but couldn't find any information or article about it.
I was wondering, in the following situation, to improve performance (a little % still) :
Context: I have two column : ID, AddedAt (AddedAt is the Unix Timestamp of when the row is created).
Theoretically, if you insert a new row, ID will be +1 and AddedAt will be the current time.
Now, let's say it is impossible in the current situation to have two simultaneous insert, would it be better to use AddedAt as a PK and remove the ID column ? AddedAt will be only one and unique column that does PK and UNIX Timestamp. So in the final, I will have one column instead of two.
The only bad side I see is maybe the size of the key that will be created on AddedAt since unix timestamp now's day is 10 digits.
Would it be better, in this situation ? What's your opinion ?
EDIT: What about using timestamp + ms ?
Timestamps are in seconds. While you might not have simultaneous inserts, as the world tends to speed up you might get multiple inserts in a second. Build your system to function soundly--don't use timesamps as primary keys.
Also, with statement replication sometime timestamps arent consistent across dbs... Row based replication alleviates this, but still its another reason for concern when using them.
From an good convention standpoint, Primary Keys should have some clear meaning to others outside yourself if it's anything other than just us a plain old auto incrementing id field. Generally, people expect numbers or char values for keys, not things like blobs, timestamps, datetimes, etc... This is especially true if later it's used for as a foreign key in another table, using timestamp as a foreign key can be confusing to later developers. Sure, if you have a varchar GUID field you know is unique, use it as the key. Just remember when used as a foreign key your going to eat up also quite a bit of memory if you have a huge string.
Assuming you can guarantee that two events won't occur within the same 1-second interval, then sure, you could use the timestamp field as a PK.
That being said, why are you worried about key sizes? A timestamp may be 10 digits, but its internal storage requirements is only 4 bytes. By comparison, an int is also 4 bytes, so you wouldn't be losing anything - unless you're using bigints, in which case it's 8 bytes.
Also, note that timestamp fields are subject to the y2038k problem. They're essentially unix timestamps that auto-format into a human readable date for you. If your app is going to be around for more than 26 years, then you should stick with an int/bigint, which has a wraparound range of "however fast you insert rows", not a fixed date/time.
The primary key is not only a technical thing, it is the business representation of something that makes each object represented by a row unique.
A timestamp is a unique field of your object because you cannot (in your case) insert two objects at the same time, but it is NOT the primary definition of a business object (if you had a business object called "timestamp" then yes, the time when it was inserted should be the primary key)
An ID stands for "my client has a physical id that represents him": in the past, we would give numbers to clients on papers, bills...
Never forget that computer science is not the objective per se but the means to achieve your goals.
I would leave the ID column as the primary key as there may be scenarios in which the unix timestamp will give you a value you're not expecting. One could be inserting very fast in succession returns the same timestamp, and another is if the server admin decides to monkey with the servers time settings.
Doing joins will probably much more obvious as people typically expect the primary key to be some sort of unique id, not a timestamp.
Yes of course, but performance gain will be minimal only while adding new record.
Moreover you will be forced to use timestamp for foreign_keys in all related objects.
It is worth considering only if you expect many inserts per second and a lot of records (to save storage on id column and its index), but as you said timestamp will be unique, so it's max 1 record per second :-)

MySQL PRIMARY KEYs: UUID / GUID vs BIGINT (timestamp+random)

tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.

MySQL: Using DATETIME as primary key

My database will be storing a large number of data points, so I am using an unsigned BIGINT as the primary key.
Would it ever make sense to use a DATETIME object as the primary key?
Thanks,
Yes if course it makes sense for a date/time to be a key or part of a key if you need to uniquely identify discrete points or periods of time. I can't say if that applies to your scenario but as a general rule there's no fundamental reason why keys can't be based on time - almost any data warehouse does it.
No because it can't be guaranteed to be unique. Stick with BIGINT. You can put a nice index on the DateTime for querying and it will be good enough.
It wouldn't make sense, as you would be limited to one record per second without any actual reason for that.
It makes sense if your data comes from a single time-ordered set. Say, a record of financial transactions. If you have multiple data points which naturally occurred at different instants, but have the same timestamp due to rounding, change the low-order bits to discriminate them.
This is more problematic in MySQL than in other databases, because timestamps are stored with only 1-second precision. (Edit: as of 5.6.4, MySQL has microseconds precision on time types)
If you happen to have multiple observations per second this will fail. For this reason it's probably better not to unless you can guarantee that there will never be more than one point per second.

UUID performance in MySQL?

We're considering using UUID values as primary keys for our MySQL database. The data being inserted is generated from dozens, hundreds, or even thousands of remote computers and being inserted at a rate of 100-40,000 inserts per second, and we'll never do any updates.
The database itself will typically get to around 50M records before we start to cull data, so not a massive database, but not tiny either. We're also planing to run on InnoDB, though we are open to changing that if there is a better engine for what we're doing.
We were ready to go with Java's Type 4 UUID, but in testing have been seeing some strange behavior. For one, we're storing as varchar(36) and I now realize we'd be better off using binary(16) - though how much better off I'm not sure.
The bigger question is: how badly does this random data screw up the index when we have 50M records? Would we be better off if we used, for example, a type-1 UUID where the leftmost bits were timestamped? Or maybe we should ditch UUIDs entirely and consider auto_increment primary keys?
I'm looking for general thoughts/tips on the performance of different types of UUIDs when they are stored as an index/primary key in MySQL. Thanks!
At my job, we use UUID as PKs. What I can tell you from experience is DO NOT USE THEM as PKs (SQL Server by the way).
It's one of those things that when you have less than 1000 records it;s ok, but when you have millions, it's the worst thing you can do. Why? Because UUID are not sequential, so everytime a new record is inserted MSSQL needs to go look at the correct page to insert the record in, and then insert the record. The really ugly consequence with this is that the pages end up all in different sizes and they end up fragmented, so now we have to do de-fragmentation periodic.
When you use an autoincrement, MSSQL will always go to the last page, and you end up with equally sized pages (in theory) so the performance to select those records is much better (also because the INSERTs will not block the table/page for so long).
However, the big advantage of using UUID as PKs is that if we have clusters of DBs, there will not be conflicts when merging.
I would recommend the following model:
PK INT Identity
Additional column automatically generated as UUID.
This way, the merge process is possible (UUID would be your REAL key, while the PK would just be something temporary that gives you good performance).
NOTE: That the best solution is to use NEWSEQUENTIALID (like I was saying in the comments), but for legacy app with not much time to refactor (and even worse, not controlling all inserts), it is not possible to do.
But indeed as of 2017, I'd say the best solution here is NEWSEQUENTIALID or doing Guid.Comb with NHibernate.
A UUID is a Universally Unique ID. It's the universally part that you should be considering here.
Do you really need the IDs to be universally unique? If so, then UUIDs may be your only choice.
I would strongly suggest that if you do use UUIDs, you store them as a number and not as a string. If you have 50M+ records, then the saving in storage space will improve your performance (although I couldn't say by how much).
If your IDs do not need to be universally unique, then I don't think that you can do much better then just using auto_increment, which guarantees that IDs will be unique within a table (since the value will increment each time)
Something to take into consideration is that Autoincrements are generated one at a time and cannot be solved using a parallel solution. The fight for using UUIDs eventually comes down to what you want to achieve versus what you potentially sacrifice.
On performance, briefly:
A UUID like the one above is 36
characters long, including dashes. If
you store this VARCHAR(36), you're
going to decrease compare performance
dramatically. This is your primary
key, you don't want it to be slow.
At its bit level, a UUID is 128 bits,
which means it will fit into 16 bytes,
note this is not very human readable,
but it will keep storage low, and is
only 4 times larger than a 32-bit int,
or 2 times larger than a 64-bit int.
I will use a VARBINARY(16)
Theoretically, this can work without a
lot of overhead.
I recommend reading the following two posts:
Brian "Krow" Aker's Idle Thoughts - Myths, GUID vs Autoincrement
To UUID or not to UUID ?
I reckon between the two, they answer your question.
I tend to avoid UUID simply because it is a pain to store and a pain to use as a primary key but there are advantages. The main one is they are UNIQUE.
I usually solve the problem and avoid UUID by using dual key fields.
COLLECTOR = UNIQUE ASSIGNED TO A MACHINE
ID = RECORD COLLECTED BY THE COLLECTOR (auto_inc field)
This offers me two things. Speed of auto-inc fields and uniqueness of data being stored in a central location after it is collected and grouped together. I also know while browsing the data where it was collected which is often quite important for my needs.
I have seen many cases while dealing with other data sets for clients where they have decided to use UUID but then still have a field for where the data was collected which really is a waste of effort. Simply using two (or more if needed) fields as your key really helps.
I have just seen too many performance hits using UUID. They feel like a cheat...
Instead of centrally generating unique keys for each insertion, how about allocating blocks of keys to individual servers? When they run out of keys, they can request a new block. Then you solve the problem of overhead by connecting for each insert.
Keyserver maintains next available id
Server 1 requests id block.
Keyserver returns (1,1000)
Server 1 can insert a 1000 records until it needs to request a new block
Server 2 requests index block.
Keyserver returns (1001,2000)
etc...
You could come up with a more sophisticated version where a server could request the number of needed keys, or return unused blocks to the keyserver, which would then of course need to maintain a map of used/unused blocks.
I realize this question is rather old but I did hit upon it in my research. Since than a number of things happened (SSD are ubiquitous InnoDB got updates etc).
In my research I found this rather interesting post on performance:
claiming that due to the randomness of a GUID/UUID index trees can get rather unbalanced. in the MariaDB KB I found another post suggested a solution.
But since than the new UUID_TO_BIN takes care of this. This function is only available in MySQL (tested version 8.0.18) and not in MariaDB (version 10.4.10)
TL;DR: Store UUID as converted/optimized BINARY(16) values.
I would assign each server a numeric ID in a transactional manner.
Then, each record inserted will just autoincrement its own counter.
Combination of ServerID and RecordID will be unique.
ServerID field can be indexed and future select performance
based on ServerID (if needed) may be much better.
The short answer is that many databases have performance problems (in particular with high INSERT volumes) due to a conflict between their indexing method and UUIDs' deliberate entropy in the high-order bits. There are several common hacks:
choose a different index type (e.g. nonclustered on MSSQL) that doesn't mind it
munge the data to move the entropy to lower-order bits (e.g. reordering bytes of V1 UUIDs on MySQL)
make the UUID a secondary key with an auto-increment int primary key
... but these are all hacks--and probably fragile ones at that.
The best answer, but unfortunately the slowest one, is to demand your vendor improve their product so it can deal with UUIDs as primary keys just like any other type. They shouldn't be forcing you to roll your own half-baked hack to make up for their failure to solve what has become a common use case and will only continue to grow.
What about some hand crafted UID? Give each of the thousands of servers an ID and make primary key a combo key of autoincrement,MachineID ???
Since the primary key is generated decentralised, you don't have the option of using an auto_increment anyway.
If you don't have to hide the identity of the remote machines, use Type 1 UUIDs instead of UUIDs. They are easier to generate and can at least not hurt the performance of the database.
The same goes for varchar (char, really) vs. binary: it can only help matters. Is it really important, how much performance is improved?
The main case where UUIDs cause miserable performance is ...
When the INDEX is too big to be cached in the buffer_pool, each lookup tends to be a disk hit. For HDD, this can slow down the access by 10x or worse. (No, that is not a typo for "10%".) With SSDs, the slowdown is less, but still significant.
This applies to any "hash" (MD5, SHA256, etc), with one exception: A type-1 UUID with its bits rearranged.
Background and manual optimization: UUIDs
MySQL 8.0: see UUID_TO_BIN() and BIN_TO_UUID()
MariaDB 10.7 carries this further with its UUID datatype.