Is a GUID unique 100% of the time? - language-agnostic

Is a GUID unique 100% of the time?
Will it stay unique over multiple threads?

While each generated GUID is not
guaranteed to be unique, the total
number of unique keys (2128 or
3.4×1038) is so large that the probability of the same number being
generated twice is very small. For
example, consider the observable
universe, which contains about 5×1022
stars; every star could then have
6.8×1015 universally unique GUIDs.
From Wikipedia.
These are some good articles on how a GUID is made (for .NET) and how you could get the same guid in the right situation.
https://ericlippert.com/2012/04/24/guid-guide-part-one/
https://ericlippert.com/2012/04/30/guid-guide-part-two/
https://ericlippert.com/2012/05/07/guid-guide-part-three/
​​

If you are scared of the same GUID values then put two of them next to each other.
Guid.NewGuid().ToString() + Guid.NewGuid().ToString();
If you are too paranoid then put three.

The simple answer is yes.
Raymond Chen wrote a great article on GUIDs and why substrings of GUIDs are not guaranteed unique. The article goes in to some depth as to the way GUIDs are generated and the data they use to ensure uniqueness, which should go to some length in explaining why they are :-)

As a side note, I was playing around with Volume GUIDs in Windows XP. This is a very obscure partition layout with three disks and fourteen volumes.
\\?\Volume{23005604-eb1b-11de-85ba-806d6172696f}\ (F:)
\\?\Volume{23005605-eb1b-11de-85ba-806d6172696f}\ (G:)
\\?\Volume{23005606-eb1b-11de-85ba-806d6172696f}\ (H:)
\\?\Volume{23005607-eb1b-11de-85ba-806d6172696f}\ (J:)
\\?\Volume{23005608-eb1b-11de-85ba-806d6172696f}\ (D:)
\\?\Volume{23005609-eb1b-11de-85ba-806d6172696f}\ (P:)
\\?\Volume{2300560b-eb1b-11de-85ba-806d6172696f}\ (K:)
\\?\Volume{2300560c-eb1b-11de-85ba-806d6172696f}\ (L:)
\\?\Volume{2300560d-eb1b-11de-85ba-806d6172696f}\ (M:)
\\?\Volume{2300560e-eb1b-11de-85ba-806d6172696f}\ (N:)
\\?\Volume{2300560f-eb1b-11de-85ba-806d6172696f}\ (O:)
\\?\Volume{23005610-eb1b-11de-85ba-806d6172696f}\ (E:)
\\?\Volume{23005611-eb1b-11de-85ba-806d6172696f}\ (R:)
| | | | |
| | | | +-- 6f = o
| | | +---- 69 = i
| | +------ 72 = r
| +-------- 61 = a
+---------- 6d = m
It's not that the GUIDs are very similar but the fact that all GUIDs have the string "mario" in them. Is that a coincidence or is there an explanation behind this?
Now, when googling for part 4 in the GUID I found approx 125.000 hits with volume GUIDs.
Conclusion: When it comes to Volume GUIDs they aren't as unique as other GUIDs.

It should not happen. However, when .NET is under a heavy load, it is possible to get duplicate guids. I have two different web servers using two different sql servers. I went to merge the data and found I had 15 million guids and 7 duplicates.

Yes, a GUID should always be unique. It is based on both hardware and time, plus a few extra bits to make sure it's unique. I'm sure it's theoretically possible to end up with two identical ones, but extremely unlikely in a real-world scenario.
Here's a great article by Raymond Chen on Guids:
https://blogs.msdn.com/oldnewthing/archive/2008/06/27/8659071.aspx
​
​
​

Guids are statistically unique. The odds of two different clients generating the same Guid are infinitesimally small (assuming no bugs in the Guid generating code). You may as well worry about your processor glitching due to a cosmic ray and deciding that 2+2=5 today.
Multiple threads allocating new guids will get unique values, but you should get that the function you are calling is thread safe. Which environment is this in?

Eric Lippert has written a very interesting series of articles about GUIDs.
There are on the order 230 personal computers in the world (and of
course lots of hand-held devices or non-PC computing devices that have
more or less the same levels of computing power, but lets ignore
those). Let's assume that we put all those PCs in the world to the
task of generating GUIDs; if each one can generate, say, 220 GUIDs per
second then after only about 272 seconds -- one hundred and fifty
trillion years -- you'll have a very high chance of generating a
collision with your specific GUID. And the odds of collision get
pretty good after only thirty trillion years.
GUID Guide, part one
GUID Guide, part two
GUID Guide, part three

Theoretically, no, they are not unique. It's possible to generate an identical guid over and over. However, the chances of it happening are so low that you can assume they are unique.
I've read before that the chances are so low that you really should stress about something else--like your server spontaneously combusting or other bugs in your code. That is, assume it's unique and don't build in any code to "catch" duplicates--spend your time on something more likely to happen (i.e. anything else).
I made an attempt to describe the usefulness of GUIDs to my blog audience (non-technical family memebers). From there (via Wikipedia), the odds of generating a duplicate GUID:
1 in 2^128
1 in 340 undecillion (don’t worry, undecillion is not on the
quiz)
1 in 3.4 × 10^38
1 in 340,000,000,000,000,000,000,000,000,000,000,000,000

None seems to mention the actual math of the probability of it occurring.
First, let's assume we can use the entire 128 bit space (Guid v4 only uses 122 bits).
We know that the general probability of NOT getting a duplicate in n picks is:
(1-1/2128)(1-2/2128)...(1-(n-1)/2128)
Because 2128 is much much larger than n, we can approximate this to:
(1-1/2128)n(n-1)/2
And because we can assume n is much much larger than 0, we can approximate that to:
(1-1/2128)n^2/2
Now we can equate this to the "acceptable" probability, let's say 1%:
(1-1/2128)n^2/2 = 0.01
Which we solve for n and get:
n = sqrt(2* log 0.01 / log (1-1/2128))
Which Wolfram Alpha gets to be 5.598318 × 1019
To put that number into perspective, lets take 10000 machines, each having a 4 core CPU, doing 4Ghz and spending 10000 cycles to generate a Guid and doing nothing else. It would then take ~111 years before they generate a duplicate.

From http://www.guidgenerator.com/online-guid-generator.aspx
What is a GUID?
GUID (or UUID) is an acronym for 'Globally Unique Identifier' (or 'Universally Unique Identifier'). It is a 128-bit integer number used to identify resources. The term GUID is generally used by developers working with Microsoft technologies, while UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the generation algorithm is unique enough that if 1,000,000,000 GUIDs per second were generated for 1 year the probability of a duplicate would be only 50%. Or if every human on Earth generated 600,000,000 GUIDs there would only be a 50% probability of a duplicate.

Is a GUID unique 100% of the time?
Not guaranteed, since there are several ways of generating one. However, you can try to calculate the chance of creating two GUIDs that are identical and you get the idea: a GUID has 128 bits, hence, there are 2128 distinct GUIDs – much more than there are stars in the known universe. Read the wikipedia article for more details.

MSDN:
There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.

If your system clock is set properly and hasn't wrapped around, and if your NIC has its own MAC (i.e. you haven't set a custom MAC) and your NIC vendor has not been recycling MACs (which they are not supposed to do but which has been known to occur), and if your system's GUID generation function is properly implemented, then your system will never generate duplicate GUIDs.
If everyone on earth who is generating GUIDs follows those rules then your GUIDs will be globally unique.
In practice, the number of people who break the rules is low, and their GUIDs are unlikely to "escape". Conflicts are statistically improbable.

I experienced a duplicate GUID.
I use the Neat Receipts desktop scanner and it comes with proprietary database software. The software has a sync to cloud feature, and I kept getting an error upon syncing. A gander at the logs revealed the awesome line:
"errors":[{"code":1,"message":"creator_guid: is already
taken","guid":"C83E5734-D77A-4B09-B8C1-9623CAC7B167"}]}
I was a bit in disbelief, but surely enough, when I found a way into my local neatworks database and deleted the record containing that GUID, the error stopped occurring.
So to answer your question with anecdotal evidence, no. A duplicate is possible. But it is likely that the reason it happened wasn't due to chance, but due to standard practice not being adhered to in some way. (I am just not that lucky) However, I cannot say for sure. It isn't my software.
Their customer support was EXTREMELY courteous and helpful, but they must have never encountered this issue before because after 3+ hours on the phone with them, they didn't find the solution. (FWIW, I am very impressed by Neat, and this glitch, however frustrating, didn't change my opinion of their product.)

For more better result the best way is to append the GUID with the timestamp (Just to make sure that it stays unique)
Guid.NewGuid().ToString() + DateTime.Now.ToString();

GUID algorithms are usually implemented according to the v4 GUID specification, which is essentially a pseudo-random string. Sadly, these fall into the category of "likely non-unique", from Wikipedia (I don't know why so many people ignore this bit): "... other GUID versions have different uniqueness properties and probabilities, ranging from guaranteed uniqueness to likely non-uniqueness."
The pseudo-random properties of V8's JavaScript Math.random() are TERRIBLE at uniqueness, with collisions often coming after only a few thousand iterations, but V8 isn't the only culprit. I've seen real-world GUID collisions using both PHP and Ruby implementations of v4 GUIDs.
Because it's becoming more and more common to scale ID generation across multiple clients, and clusters of servers, entropy takes a big hit -- the chances of the same random seed being used to generate an ID escalate (time is often used as a random seed in pseudo-random generators), and GUID collisions escalate from "likely non-unique" to "very likely to cause lots of trouble".
To solve this problem, I set out to create an ID algorithm that could scale safely, and make better guarantees against collision. It does so by using the timestamp, an in-memory client counter, client fingerprint, and random characters. The combination of factors creates an additive complexity that is particularly resistant to collision, even if you scale it across a number of hosts:
http://usecuid.org/

I have experienced the GUIDs not being unique during multi-threaded/multi-process unit-testing (too?). I guess that has to do with, all other tings being equal, the identical seeding (or lack of seeding) of pseudo random generators. I was using it for generating unique file names. I found the OS is much better at doing that :)
Trolling alert
You ask if GUIDs are 100% unique. That depends on the number of GUIDs it must be unique among. As the number of GUIDs approach infinity, the probability for duplicate GUIDs approach 100%.

In a more general sense, this is known as the "birthday problem" or "birthday paradox". Wikipedia has a pretty good overview at:
Wikipedia - Birthday Problem
In very rough terms, the square root of the size of the pool is a rough approximation of when you can expect a 50% chance of a duplicate. The article includes a probability table of pool size and various probabilities, including a row for 2^128. So for a 1% probability of collision you would expect to randomly pick 2.6*10^18 128-bit numbers. A 50% chance requires 2.2*10^19 picks, while SQRT(2^128) is 1.8*10^19.
Of course, that is just the ideal case of a truly random process. As others mentioned, a lot is riding on the that random aspect - just how good is the generator and seed? It would be nice if there was some hardware support to assist with this process which would be more bullet-proof except that anything can be spoofed or virtualized. I suspect that might be the reason why MAC addresses/time-stamps are no longer incorporated.

The Answer of "Is a GUID is 100% unique?" is simply "No" .
If You want 100% uniqueness of GUID then do following.
generate GUID
check if that GUID is Exist in your table column where you are looking for uniquensess
if exist then goto step 1 else step 4
use this GUID as unique.

The hardest part is not about generating a duplicated Guid.
The hardest part is designed a database to store all of the generated ones to check if it is actually duplicated.
From WIKI:
For example, the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
enter image description here
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes

GUID stands for Global Unique Identifier
In Brief:
(the clue is in the name)
In Detail:
GUIDs are designed to be unique; they are calculated using a random method based on the computers clock and computer itself, if you are creating many GUIDs at the same millisecond on the same machine it is possible they may match but for almost all normal operations they should be considered unique.

I think that when people bury their thoughts and fears in statistics, they tend to forget the obvious. If a system is truly random, then the result you are least likely to expect (all ones, say) is equally as likely as any other unexpected value (all zeros, say). Neither fact prevents these occurring in succession, nor within the first pair of samples (even though that would be statistically "truly shocking"). And that's the problem with measuring chance: it ignores criticality (and rotten luck) entirely.
IF it ever happened, what's the outcome? Does your software stop working? Does someone get injured? Does someone die? Does the world explode?
The more extreme the criticality, the worse the word "probability" sits in the mouth. In the end, chaining GUIDs (or XORing them, or whatever) is what you do when you regard (subjectively) your particular criticality (and your feeling of "luckiness") to be unacceptable. And if it could end the world, then please on behalf of all of us not involved in nuclear experiments in the Large Hadron Collider, don't use GUIDs or anything else indeterministic!

Enough GUIDs to assign one to each and every hypothetical grain of sand on every hypothetical planet around each and every star in the visible universe.
Enough so that if every computer in the world generates 1000 GUIDs a second for 200 years, there might (MIGHT) be a collision.
Given the number of current local uses for GUIDs (one sequence per table per database for instance) it is extraordinarily unlikely to ever be a problem for us limited creatures (and machines with lifetimes that are usually less than a decade if not a year or two for mobile phones).
... Can we close this thread now?

Related

Pros and cons of Flake ids and cryptographic Ids

A distributed system can generate unique ids either by Flake or cryptographic ids (e.g., 128 bit murmur3).
Wonder what are the pros and cons of each method.
I'm going to assume 128-bit ids, kind-a like UUIDs. Let's start at a baseline, though
TL;DR: Use random ids. If and only if you have database performance issues try flake ids.
Auto-increment ids
Auto-increment ids are when your backend system assigns a unique, densely-packed id to each new entity. This is usually done by a database, but not always.
The clear advantage is that the id is guaranteed unique to your system, though 128 bits is probably overkill.
The first disadvantage is that you leak information every time you expose your id. You leak what other ids there are (an attacker can easily guess what to look for). You also leak how busy your system is (your competition now knows how many ids you create in a time period and can infer, say financial information).
The second disadvantage is that your backend is no longer as scalable. You are tied to some slow, less scalable id generator that will always be a bottleneck in a large system.
Random ids
Random ids are when you just generate 128 random bytes. v4 UUIDs 122-bit random ids (e.g. 2bbfb5ba-f5a2-11e7-8c3f-9a214cf093ae). These are also practically unique.
Random ids get rid of both of the disadvantages of auto-increment ids: they leak no information and are infinitely scalable.
The disadvantage comes when storing ids in b-trees (à la databases) because they randomize the memory/disk pages that the tree accesses. This may be a source of slow-downs to your system.
To me this is still the ideal id scheme, and you should have a good reason to move off of it. (i.e. profiler data).
Flake ids
Flake ids are random ids with except that the high k bits are taken from the lower bits of a timestamp. For example, you may get the following three ids in a row, where the top bits are really close together.
2bbfb5baf5a211e78c3f9a214cf093ae
2bbf9d4ec10c41049fb1671d6616b213
2bc6bb66e5964fb59050fcf3beed51b1
While you may leak some information, it isn't much if your k and timestamp granularity are designed well.
But if you mal-design the ids they can be less-than-helpful, either too infrequently updated—leading the b-trees to rely on the top random bits negating the usefulness—or too frequently—where you thrash the database because your updates.
Note: By time granularity, I mean how frequently the low bits of a timestamp change. Depending on your data throughput, you probably want this to be hour, deca-minutes, or minutes. It's a balance.
If you see the ids otherwise semantic-less (i.e. never infer anything from the top bits) then you can change any of these parameters at any time without interruption—even going back to purely random where k = 0.
Cryptographic ids
I'm assuming by this you mean ids have some semantic information encrypted in them. Maybe like hashids?
Disadvantages abound:
You'll have different length ids for different data, unless you have a fixed-length protocol.
You'll be tempted to add more and more info to the ids.
Look random, but no mitigation to add flake-like timestamps to the front
Ids become tied to the system that made it. You may start asking that system for decrypted versions of the id instead of just asking for the data it points to.
Your system burns time decrypting ids to extract data.
You add encryption problems
what happens if the secret-key is leaked? (Better not have too sensitive of data in there, customer name, or heaven forbid a credit card number)
coordinating key rotation.
Small ids like hashid can be brute-forced attack.
As you can see, I am not a fan of semantic ids in general. There are a few places where I use them, though I call them tokens. These don't get stored as keys in a database (or likely not stored anywhere).
For example I use encryption for pagination tokens: encrypted {last-id / context} of a pagination API. I prefer this over having the client pass the last element of the prior page because we keep the database context hidden from the user. It's simpler for everyone, and the encryption is little more than obfuscation (no sensitive information).

I know a GUID is nearly unique. But is it acceptable practice to assume it is unique?

So I completely understand the mathematical unlikeliness of creating two GUID values with the same number. But is it acceptable practice to assume they are unique?
For example I am working with a system for dealing with medical files. When I began to layout the database structure the manager (Not very technically knowledgeable, but likes to think he is and delegates things that would be better left for the more technically minded to decide) says he wants to use GUID's to separate different medical records instead of INT because it is "More unique". I explained how an INT is always going to be unique because it is sequential. I suggested we use BigINT if it will make him feel more comfortable since there are more numbers in that then if the population of the planet increased to the point people would only fit standing next to one another across the planet, but he is insisting on using GUIDs.
My feeling is although it is NEARLY IMPOSSIBLE for there to be a mix up, when dealing with medical records, why take the chance? What is the advantage of using a GUID vs an INT in this scenario?
But is it acceptable practice to assume it is unique?
Yes. That is the entire purpose of UUID, to be used as a reliable unique identifier without centralized coordination. (A GUID is Microsoft’s variation of a UUID.)
Only you (or your appropriate management) can make the final judgement for your particular project.
But if you truly begin to appreciate the enormity of the numerical range of 12x bits (which is actually incomprehensible to the human mind), then you know you can remove the usage of a properly generated UUID from your list of worries.
By “properly generated” I mean things like using the date-time Versions, or for lower number of values use the random (Version 4) if backed by a cryptographically-strong random number generator. Nearly every modern operating system today includes a UUID generation library. Or you can use the OSSP UUID project. Improperly-generated would include roll-your-own implementations you may see bandied about the inter webs.
As for the suggestion to use a database’s auto-incrementing serial/sequence number, every database person I know with years of real-world experience has been burned by those. I’ve never heard of or read of anyone ever having a collision with properly-generated UUIDs. I'm not saying sequences are necessarily bad or don't have their place, I'm just saying that all I can do is laugh when I hear people turn away from a UUID because of some beyond-astrononomically incomprehensibly minute possibility of a UUID collision and choose a sequence instead.
when dealing with medical records, why take the chance?
Your medical system is far far more likely to fail because of faulty data-entry or other human error with handling records. But do you post 3 clerks on duty to independently triple-enter the same data to reduce that chance of error? No. And that risk is incomprehensibly mathematically more likely to happen than a UUID problem. Yet every medical facility I know of accepts that enormous risk without even thinking about it.
What is the advantage of using a GUID vs an INT
The advantages include:
No need to manage your sequences.Examples include: Resetting for development, test, and production environments. Or when restoring a backup. Or fixing the sequence after faults in the system’s serial generation library (my own experience).
Avoid users’ intuited assumptions being confused about missing numbers in the sequence. I've had that conversation far too often.
Federating data between distributed systems.This is the biggest advantage, each system can act independently yet easily share data back and forth with other systems. Without UUIDs, the administrative overhead and the risk of error are bothersome at first and only grow over time.
Downsides include:
Larger memory and storage usage.Serial numbers are usually 32-bit integers, sometimes 64-bit. A good database with native support for UUID as a data type will use 128 bits.
Less readable by humans.One workaround is to just read several of the first or last digits for casual work.
Possibly less efficient indexing, with very large number of entries.
using an incrementing integer ID ensures only uniqueness within its own domain/type, an advantage of UUIDs/GUIDs is that they uniquely identify the owning thing in the entire universe.
So if you have multiple objects, say MedicalRecord, ID = 5, VaccinationForm, ID = 5 then you need to specify both the type ("medicalRecord" or "vaccinationForm" with the ID value of 5) whereas with a GUID you only need to store a single quanta of information to uniquely identify it.
It can be argued that using GUIDs is a waste of space as they are 16 bytes long (a 128-bit value).
If your system is self-contained and not interfacing with others you might want to use SQL Server's "sequence" concept, where instead of each table storing its own identity sequence, the sequence is maintained for all tables, making it a Locally-Unique ID value. You can use any size integer too.
See here: https://msdn.microsoft.com/en-us/library/ff878091.aspx

Handling close-to-impossible collisions on should-be-unique values

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.
A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.
Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states
The odds of the changeset hash
colliding by accident in your first
billion commits is basically zero. But
we will notice if it happens. And
you'll get to be famous as the guy who
broke SHA1 by accident.
But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.
Rather, I'd like to know (maybe by means of examples from your own work):
Are there any best practices as to covering these improbable cases anyway?
Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?
Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?
Or should even these cases get handled gracefully?
For me, especially the following are interesting, although they are somewhat touchy-feely:
If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?
The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.
However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.
Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.
If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.
Or should even these cases get handled gracefully?
The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).
Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:
1 / 2 ** 128 which is approximately equal to 3 * 10 ** -39.
The probability of seeing no collisions (p) given n samples can be computed using the logic used to explain the birthday problem.
p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)
where !denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:
Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png
Between 10**17 and 10**18 hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with 10**19 hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.
Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).

Unique, numeric, incremental identifier

I need to generate unique, incremental, numeric transaction id's for each request I make to a certain XML RPC. These numbers only need to be unique across my domain, but will be generated on multiple machines.
I really don't want to have to keep track of this number in a database and deal with row locking etc on every single transaction. I tried to hack this using a microsecond timestamp, but there were collisions with just a few threads - my application needs to support hundreds of threads.
Any ideas would be appreciated.
Edit: What if each transaction id just has to be larger than the previous request's?
If you're going to be using this from hundreds of threads, working on multiple machines, and require an incremental ID, you're going to need some centralized place to store and lock the last generated ID number. This doesn't necessarily have to be in a database, but that would be the most common option. A central server that did nothing but serve IDs could provide the same functionality, but that probably defeats the purpose of distributing this.
If they need to be incremental, any form of timestamp won't be guaranteed unique.
If you don't need them to be incremental, a GUID would work. Potentially doing some type of merge of the timestamp + a hardware ID on each system could give unique identifiers, but the ID number portion would not necessarily be unique.
Could you use a pair of Hardware IDs + incremental timestamps? This would make each specific machine's IDs incremental, but not necessarily be unique across the entire domain.
---- EDIT -----
I don't think using any form of timestamp is going to work for you, for 2 reasons.
First, you'll never be able to guarantee that 2 threads on different machines won't try to schedule at exactly the same time, no matter what resolution of timer you use. At a high enough resolution, it would be unlikely, but not guaranteed.
Second, to make this work, even if you could resolve the collision issue above, you'd have to get every system to have exactly the same clock with microsecond accuracy, which isn't really practical.
This is a very difficult problem, particularly if you don't want to create a performance bottleneck. You say that the IDs need to be 'incremental' and 'numeric' -- is that a concrete business constraint, or one that exists for some other purpose?
If these aren't necessary you can use UUIDs, which most common platforms have libraries for. They allow you to generate many (millions!) of IDs in very short timespans and be quite comfortable with no collisions. The relevant article on wikipedia claims:
In other words, only after generating
1 billion UUIDs every second for the
next 100 years, the probability of
creating just one duplicate would be
about 50%.
If you remove 'incremental' from your requirements, you could use a GUID.
I don't see how you can implement incremental across multiple processes without some sort of common data.
If you target a Windows platform, did you try Interlocked API ?
Google for GUID generators for whatever language you are looking for, and then convert that to a number if you really need it to be numeric. It isn't incremental though.
Or have each thread "reserve" a thousand (or million, or billion) transaction IDs and hand them out one at a time, and "reserve" the next bunch when it runs out. Still not really incremental.
I'm with the GUID crowd, but if that's not possible, could you consider using db4o or SQL Lite over a heavy-weight database?
If each client can keep track of its own "next id", then you could talk to a sentral server and get a range of id's, perhaps a 1000 at a time. Once a client runs out of id's, it will have to talk to the server again.
This would make your system have a central source of id's, and still avoid having to talk to the database for every id.

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".