Where do UUID namespaces come from? - language-agnostic

The UUID specification defines 4 predefined namespaces which it describes as "potentially interesting" - meaning among other things, "if other people have generated UUIDs in this namespace you can verify them":
6ba7b810-9dad-11d1-80b4-00c04fd430c8 for DNS
6ba7b811-9dad-11d1-80b4-00c04fd430c8 for URL
6ba7b812-9dad-11d1-80b4-00c04fd430c8 for ISO OID
6ba7b814-9dad-11d1-80b4-00c04fd430c8 for X.500 DN
Where did these come from?
Specifically;
If I'm generating my own namespace UUID do I need to avoid anything in particular?
I'm aware how big the UUID space is, but does this have any implication on collisions?
Why have they chosen the 4th octet to increase as a kind of UUID 'version number'?
Do my questions imply that I'm missing something fundamental about UUIDs?

First, to be clear, this whole discussion is limited to version 3 & 5 UUIDs. In my (anecdotal) experience, version 4 (random) UUIDs are most commonly used.
4122's namespaced UUID generation algorithm ambiguously begins:
Allocate a UUID to use as a "name space ID"
There is no other mention of "name space ID" allocation, and neither I nor python have found any standardized spaces beyond the four listed in RFC 4122.
So the answer to your first question,
If I'm generating my own namespace UUID do I need to avoid anything in particular?
You only need to avoid the four standard namespaces.
The next question,
I'm aware how big the UUID space is, but does this have any implication on collisions?
Has two parts:
Will UUIDs within your namespace collide? Verbatim from 4122:
The UUIDs generated from two different names in [your] namespace should be different (with very high probability).
Will your namespace UUID collide with other namespaces? I couldn't find a direct answer, since there's no standard for "name space ID" allocation, but the argument in section 4.1.1 seems relevant:
Interoperability, in any form, with variants other than the one
defined here is not guaranteed, and is not likely to be an issue in
practice.
Why have they chosen the 4th octet to increase as a kind of UUID 'version number'?
This one's a bit of a mystery. Luckily, we have a spec for UUIDs, so we can mine them for some insight.
Note that the (0-index) 8th octet starts with 8 in all cases, so we're dealing with RFC 4122 variant UUIDs. Phew.
Now check octet 6 for the version: 1, we're dealing with version 1 time-based UUIDs.
This answer has a handy algorithm for extracting python datetimes from version 1 UUIDs. Applying the algorithm yields a time in February 4th, 1998. I have yet to find meaning in this date. Incrementing the 3rd octet adds the smallest encodable time interval (100ns) to the date.
Do my questions imply that I'm missing something fundamental about UUIDs?
Nope. There is very little discussion of UUID namespaces, since random UUIDs are so easy.

If I'm generating my own namespace UUID do I need to avoid anything in particular?
No. Your namespace UUID can be any UUID generated in any of the normal ways. So, for example, you would probably want to generate a version 1 or version 4 UUID to use as your namespace UUID. This can be done with the uuidgen program on Linux or OS X. Or you can easily generate a version 1 or version 4 UUID online.

Related

I know a GUID is nearly unique. But is it acceptable practice to assume it is unique?

So I completely understand the mathematical unlikeliness of creating two GUID values with the same number. But is it acceptable practice to assume they are unique?
For example I am working with a system for dealing with medical files. When I began to layout the database structure the manager (Not very technically knowledgeable, but likes to think he is and delegates things that would be better left for the more technically minded to decide) says he wants to use GUID's to separate different medical records instead of INT because it is "More unique". I explained how an INT is always going to be unique because it is sequential. I suggested we use BigINT if it will make him feel more comfortable since there are more numbers in that then if the population of the planet increased to the point people would only fit standing next to one another across the planet, but he is insisting on using GUIDs.
My feeling is although it is NEARLY IMPOSSIBLE for there to be a mix up, when dealing with medical records, why take the chance? What is the advantage of using a GUID vs an INT in this scenario?
But is it acceptable practice to assume it is unique?
Yes. That is the entire purpose of UUID, to be used as a reliable unique identifier without centralized coordination. (A GUID is Microsoft’s variation of a UUID.)
Only you (or your appropriate management) can make the final judgement for your particular project.
But if you truly begin to appreciate the enormity of the numerical range of 12x bits (which is actually incomprehensible to the human mind), then you know you can remove the usage of a properly generated UUID from your list of worries.
By “properly generated” I mean things like using the date-time Versions, or for lower number of values use the random (Version 4) if backed by a cryptographically-strong random number generator. Nearly every modern operating system today includes a UUID generation library. Or you can use the OSSP UUID project. Improperly-generated would include roll-your-own implementations you may see bandied about the inter webs.
As for the suggestion to use a database’s auto-incrementing serial/sequence number, every database person I know with years of real-world experience has been burned by those. I’ve never heard of or read of anyone ever having a collision with properly-generated UUIDs. I'm not saying sequences are necessarily bad or don't have their place, I'm just saying that all I can do is laugh when I hear people turn away from a UUID because of some beyond-astrononomically incomprehensibly minute possibility of a UUID collision and choose a sequence instead.
when dealing with medical records, why take the chance?
Your medical system is far far more likely to fail because of faulty data-entry or other human error with handling records. But do you post 3 clerks on duty to independently triple-enter the same data to reduce that chance of error? No. And that risk is incomprehensibly mathematically more likely to happen than a UUID problem. Yet every medical facility I know of accepts that enormous risk without even thinking about it.
What is the advantage of using a GUID vs an INT
The advantages include:
No need to manage your sequences.Examples include: Resetting for development, test, and production environments. Or when restoring a backup. Or fixing the sequence after faults in the system’s serial generation library (my own experience).
Avoid users’ intuited assumptions being confused about missing numbers in the sequence. I've had that conversation far too often.
Federating data between distributed systems.This is the biggest advantage, each system can act independently yet easily share data back and forth with other systems. Without UUIDs, the administrative overhead and the risk of error are bothersome at first and only grow over time.
Downsides include:
Larger memory and storage usage.Serial numbers are usually 32-bit integers, sometimes 64-bit. A good database with native support for UUID as a data type will use 128 bits.
Less readable by humans.One workaround is to just read several of the first or last digits for casual work.
Possibly less efficient indexing, with very large number of entries.
using an incrementing integer ID ensures only uniqueness within its own domain/type, an advantage of UUIDs/GUIDs is that they uniquely identify the owning thing in the entire universe.
So if you have multiple objects, say MedicalRecord, ID = 5, VaccinationForm, ID = 5 then you need to specify both the type ("medicalRecord" or "vaccinationForm" with the ID value of 5) whereas with a GUID you only need to store a single quanta of information to uniquely identify it.
It can be argued that using GUIDs is a waste of space as they are 16 bytes long (a 128-bit value).
If your system is self-contained and not interfacing with others you might want to use SQL Server's "sequence" concept, where instead of each table storing its own identity sequence, the sequence is maintained for all tables, making it a Locally-Unique ID value. You can use any size integer too.
See here: https://msdn.microsoft.com/en-us/library/ff878091.aspx

Are there any inobvious ways of abusing GUIDs?

GUIDs are typically used for uniquely identifying all kinds of entities - requests from external systems, files, whatever. Work like magic - you call a "GiveMeGuid()" (UuidCreate() on Windows) function - and a fresh new GUID is here at your service.
Given my code really calls that "GiveMeGuid()" function each time I need a new GUID is there any not so obvious way to misuse it?
Just found an answer to an old question: How deterministic Are .Net GUIDs?. Requoting it:
It's not a complete answer, but I can tell you that the 13th hex digit is always 4 because it denotes the version of the algorithm used to generate the GUID (id est, v4); also, and I quote Wikipedia:
Cryptanalysis of the WinAPI GUID generator shows that, since the sequence of V4 GUIDs is pseudo-random, given the initial state one can predict up to the next 250 000 GUIDs returned by the function UuidCreate. This is why GUIDs should not be used in cryptography, e.g., as random keys.
So, if you got lucky and get same seed, you'll break 250k mirrors in sequence. To quote another Wikipedia piece:
While each generated GUID is not guaranteed to be unique, the total number of unique keys (2128 or 3.4×1038) is so large that the probability of the same number being generated twice is extremely small.
Bottom line: maybe a misuse form it's to consider GUID always unique.
It depends. Some implementations of GUID generation are time dependant, so calling CreateGuid in quick succession MAY create clashing GUIDs.
edit: I now remember the problem. I was once working on some php code where the GUID generating function was reseeding the RNG with the system time each call. Don't do this.
The only way I can see of misusing a Guid is trying to interpret the value in some logical manner. Not that it really invites you to do so, which is one of the characteristics around Guid's that I really like.
Some GUIDs include some identifier of the machine it was generated on, so it can be used in client/server environments, but some can't. Be sure if yours doesn't to not use them in, for instance, a database multiple clients access.
Maybe the entropy could be manipulated by playing with some parameters used to generate the GUIDs in the first place (e.g. interface identifiers).

What exactly is GUID? Why and where I should use it?

What exactly is GUID? Why and where I should use it?
I've seen references to GUID in a lot of places, and in wikipedia,
but it is not very clear telling you where to use it.
If someone could answer this, it would be nice.
Thanks
GUID technically stands for globally unique identifier. What it is, actually, is a 128 bit structure that is unlikely to ever repeat or create a collision. If you do the maths, the domain of values is in the undecillions.
Use guids when you have multiple independent systems or clients generating ID's that need to be unique.
For example, if I have 5 client apps creating and inserting transactional data into a table that has a unique constraint on the ID, then use guids. This prevents having to force a client to request an issued ID from the server first.
This is also great for object factories and systems that have numerous object types stored in different tables where you don't want any 2 objects to have the same ID. This makes caching and scavenging schemas much easier to implement.
A GUID is a "Globally Unique IDentifier". You use it anywhere that you need an identifier that guaranteed to be different than every other.
Usually, you only need a value to be "locally unique" -- the Primary Key identity in a database table,for example, needs only be different from the other rows in that table, but can be the same as the ID in other tables. (no need for a GUID here)
GUIDs are generally used when you will be defining an ID that must be different from an ID that someone else (outside of your control) will be defining. One such place in the Interface identifier on ActiveX controls. Anyone can create an ActiveX, and not know with what other control someone will be using them with --- and there's nothing to stop everyone from giving their controls the same name. GUIDs keep them distinct.
GUIDs are a combination of the time (in very small fractions of a second) (so it assured to be different from any GUID defined before or later), and a number defining your location (sometimes taken from the MAC address of you network card) (so it's assured to be different from any other GUID defined right now by someone else).
They are also sometimes known as UUIDs (universally unique ID).
As addition to all the other answers, here is an online GUID generator:
http://www.guidgenerator.com/
What is a GUID?
GUID (or UUID) is an acronym for
'Globally Unique Identifier' (or
'Universally Unique Identifier'). It
is a 128-bit integer number used to
identify resources. The term GUID is
generally used by developers working
with Microsoft technologies, while
UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the
generation algorithm is unique enough
that if 1,0000,000,000 GUIDs per
second were generated for 1 year the
probability of a duplicate would be
only 50%. Or if every human on Earth
generated 600,000,000 GUIDs there
would only be a 50% probability of a
duplicate.
How are GUIDs used?
GUIDs are used in software development
as database keys, component
identifiers, or just about anywhere
else a truly unique identifier is
required. GUIDs are also used to
identify all interfaces and objects in
COM programming.
A GUID is a "Globally Unique ID". Also called a UUID (Universally Unique ID).
It's basically a 128 bit number that is generated in a way (see RFC 4112 http://www.ietf.org/rfc/rfc4122.txt) that makes it nearly impossible for duplicates to be generated. This way, I can generate GUIDs without some third party organization having to give them to me to ensure they are unique.
One widespread use of GUIDs is as identifiers for COM entities on Windows (classes, typelibs, interfaces, etc.). Using GUIDs, developers could build their COM components without going to Microsoft to get a unique identifier. Even though identifying COM entities is a major use of GUIDs, they are used for many things that need unique identifiers. Some developers will generate GUIDs for database records to provide them an ID that can be used even when they must be unique across many different databases.
Generally, you can think of a GUID as a serial number that can be generated by anyone at anytime and they'll know that the serial number will be unique.
Other ways to get unique identifiers include getting a domain name. To ensure the uniqueness of domain names, you have to get it from some organization (ultimately administered by ICANN).
Because GUIDs can be unwieldy (from a human readable point of view they are a string of hexadecimal numbers, usually grouped like so: aaaaaaaa-bbbb-cccc-dddd-ffffffffffff), some namespaces that need unique names across different organization use another scheme (often based on Internet domain names).
So, the namespace for Java packages by convention starts with the orgnaization's domain name (reversed) followed by names that are determined in some organization specfic way. For example, a Java package might be named:
com.example.jpackage
This means that dealing with name collisions becomes the responsibility of each organization.
XML namespaces are also made unique in a similar way - by convention, someone creating an XML namespace is supposed to make it 'underneath' a registered domain name under their control. For example:
xmlns="http://www.w3.org/1999/xhtml"
Another way that unique IDs have been managed is for Ethernet MAC addresses. A company that makes Ethernet cards has to get a block of addresses assigned to them by the IEEE (I think it's the IEEE). In this case the scheme has worked pretty well, and even if a manufacturer screws up and issues cards with duplicate MAC addresses, things will still work OK as long as those cards are not on the same subnet, since beyond a subnet, only the IP address is used to route packets. Although there are some other uses of MAC addresses that might be affected - one of the algorithms for generating GUIDs uses the MAC address as one parameter. This GUID generation method is not as widely used anymore because it is considered a privacy threat.
One example of a scheme to come up with unique identifiers that didn't work very well was the Microsoft provided ID's for 'VxD' drivers in Windows 9x. Developers of third party VxD drivers were supposed to ask Microsoft for a set of IDs to use for any drivers the third party wrote. This way, Microsoft could ensure there were not duplicate IDs. Unfortunately, many driver writers never bothered, and simply used whatever ID was in the example VxD they used as a starting point. I'm not sure how much trouble this caused - I don't think VxD ID uniqueness was absolutely necessary, but it probably affected some functionality in some APIs.
GUID or UUID (globally vs Universally) Unique IDentifier is, well, a unique ID :) When you need something really unique machine generated, there are libraries to get you one.
See GUID on wikipedia for details.
As to when you don't need a GUID, it is when a counter that you control (one way or another, like a SERIAL SQL type or a sequence) gets incremented. Indexing a "text" value (GUID in textual form) or a 128 bit binary value (which a GUID is) is far more expensive than an integer.
Someone said they are conceptually 128-bit random values, and that is substantially true, but having done a little reading on UUID (GUID usually refers to Microsoft's implementation of UUID), I see that there are several different UUID versions, and most of them are not actually random. So it is possible to generate a UUID for a machine (or something else) and be able to reliably repeat that process to obtain the same UUID down the road, which is important for some applications.
For me it's easier to think of them as simply "128-bit random values". Which is essentially what they are. There are some algorithms for including a bit of information in a few digits of your GUID (thus the random part gets a bit smaller), but still they are pretty large almost-random values.
Since they are so large, it is extremely unlikely that two GUIDs will ever be generated that are the same. For all practical purposes, every GUID ever generated is unique in the world.
I'll leave it to you to figure out where to use them, but other answers already have some examples. Let your imagination run wild. :)
Can be a hard thing to understand because of all the maths that goes on behind generating them. Think of it as a unique id. You can get Visual Studio to generate one for you, or .NET if you happen to be using C# or one of the many other applications or websites. They are considered unique because there is such a silly small chance you'll see the same one twice that it isn't worth considering.
128-bit Globally Unique ID. You can generate GUIDs from now until sunset and you never generate the same GUID twice, and neither will anyone else. They are used a lot with COM.
As for example of something you would use them for, we use them in one of our products. Our users can generate categories and cards on various devices. We want to make sure that we don't confuse a category made on one device with a category created on a different one, so it's important that IDs are unique no matter who generates them, where they generate them, and when they generate them. So we use GUIDs (actually we use our own scheme using 64-bit numbers but they are similar to GUIDs).
I worked on an ACD call center system a few years back where we wanted to gather call detail records from multiple call processors into a single database. I setup a column in MS SQL to generate a GUID for the database key rather than using a system-generated sequential ID (identity column). Back then, this required setting the default value to NewID (or generating it in the code, but the NewID() function was safer). Of course, having a large value for a key may raise a few eyebrows, but I would rather give up the space than risk a collision.
I didn't see anyone address using a GUID as a database key so I thought it might help to know you could do that too.
GUID stands for "Globally Unique Identifier" and you use it when you want to have, erm, a Globally Unique Identifier.
In RSS feeds, for example, you should have a GUID for each item in the feed. That way, the feed reader software can keep track of whether you have read that item or not. Without a GUID, it would be impossible to tell.
A GUID differs from something like a database ID in that no matter who creates an object -- you, me, the guy down the street -- our GUIDs will always be different. There should be no collisions using a GUID.
You'll also see the term UUID, which stands for "Universally Unique Identifier." There is essentially no difference between the two. UUID is the more appropriate term. GUID is the term used by Microsoft.
If you need to generate an identifier that needs to be unique during the whole lifetime of your application, you use a GUID.
Imagine you have a server with sessions, if you give each session a GUID, you are certain that it will be unique for every session ever created by your server. This is useful for tracing bugs.
One particularly useful application of GUIDs that I've found is using them to track unique visitors in webapps where the visitors are anonymous (i.e. not logged in or registered).
GUID = Global Unique IDentifier.
Use it when you want to uniquely identify something in a global context.
This generator can be handy.
The Wikipedia article on GUIDs is pretty clear on what they are used for - maybe rephrasing your question would help - what do you need a GUID for?
To actually see what it looks like on a windows computer, go to cmd or powershell.
Powershell => [guid]::NewGuid()
CMD => powershell [guid]::NewGuid()

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".

Is a GUID unique 100% of the time?

Is a GUID unique 100% of the time?
Will it stay unique over multiple threads?
While each generated GUID is not
guaranteed to be unique, the total
number of unique keys (2128 or
3.4×1038) is so large that the probability of the same number being
generated twice is very small. For
example, consider the observable
universe, which contains about 5×1022
stars; every star could then have
6.8×1015 universally unique GUIDs.
From Wikipedia.
These are some good articles on how a GUID is made (for .NET) and how you could get the same guid in the right situation.
https://ericlippert.com/2012/04/24/guid-guide-part-one/
https://ericlippert.com/2012/04/30/guid-guide-part-two/
https://ericlippert.com/2012/05/07/guid-guide-part-three/
​​
If you are scared of the same GUID values then put two of them next to each other.
Guid.NewGuid().ToString() + Guid.NewGuid().ToString();
If you are too paranoid then put three.
The simple answer is yes.
Raymond Chen wrote a great article on GUIDs and why substrings of GUIDs are not guaranteed unique. The article goes in to some depth as to the way GUIDs are generated and the data they use to ensure uniqueness, which should go to some length in explaining why they are :-)
As a side note, I was playing around with Volume GUIDs in Windows XP. This is a very obscure partition layout with three disks and fourteen volumes.
\\?\Volume{23005604-eb1b-11de-85ba-806d6172696f}\ (F:)
\\?\Volume{23005605-eb1b-11de-85ba-806d6172696f}\ (G:)
\\?\Volume{23005606-eb1b-11de-85ba-806d6172696f}\ (H:)
\\?\Volume{23005607-eb1b-11de-85ba-806d6172696f}\ (J:)
\\?\Volume{23005608-eb1b-11de-85ba-806d6172696f}\ (D:)
\\?\Volume{23005609-eb1b-11de-85ba-806d6172696f}\ (P:)
\\?\Volume{2300560b-eb1b-11de-85ba-806d6172696f}\ (K:)
\\?\Volume{2300560c-eb1b-11de-85ba-806d6172696f}\ (L:)
\\?\Volume{2300560d-eb1b-11de-85ba-806d6172696f}\ (M:)
\\?\Volume{2300560e-eb1b-11de-85ba-806d6172696f}\ (N:)
\\?\Volume{2300560f-eb1b-11de-85ba-806d6172696f}\ (O:)
\\?\Volume{23005610-eb1b-11de-85ba-806d6172696f}\ (E:)
\\?\Volume{23005611-eb1b-11de-85ba-806d6172696f}\ (R:)
| | | | |
| | | | +-- 6f = o
| | | +---- 69 = i
| | +------ 72 = r
| +-------- 61 = a
+---------- 6d = m
It's not that the GUIDs are very similar but the fact that all GUIDs have the string "mario" in them. Is that a coincidence or is there an explanation behind this?
Now, when googling for part 4 in the GUID I found approx 125.000 hits with volume GUIDs.
Conclusion: When it comes to Volume GUIDs they aren't as unique as other GUIDs.
It should not happen. However, when .NET is under a heavy load, it is possible to get duplicate guids. I have two different web servers using two different sql servers. I went to merge the data and found I had 15 million guids and 7 duplicates.
Yes, a GUID should always be unique. It is based on both hardware and time, plus a few extra bits to make sure it's unique. I'm sure it's theoretically possible to end up with two identical ones, but extremely unlikely in a real-world scenario.
Here's a great article by Raymond Chen on Guids:
https://blogs.msdn.com/oldnewthing/archive/2008/06/27/8659071.aspx
​
​
​
Guids are statistically unique. The odds of two different clients generating the same Guid are infinitesimally small (assuming no bugs in the Guid generating code). You may as well worry about your processor glitching due to a cosmic ray and deciding that 2+2=5 today.
Multiple threads allocating new guids will get unique values, but you should get that the function you are calling is thread safe. Which environment is this in?
Eric Lippert has written a very interesting series of articles about GUIDs.
There are on the order 230 personal computers in the world (and of
course lots of hand-held devices or non-PC computing devices that have
more or less the same levels of computing power, but lets ignore
those). Let's assume that we put all those PCs in the world to the
task of generating GUIDs; if each one can generate, say, 220 GUIDs per
second then after only about 272 seconds -- one hundred and fifty
trillion years -- you'll have a very high chance of generating a
collision with your specific GUID. And the odds of collision get
pretty good after only thirty trillion years.
GUID Guide, part one
GUID Guide, part two
GUID Guide, part three
Theoretically, no, they are not unique. It's possible to generate an identical guid over and over. However, the chances of it happening are so low that you can assume they are unique.
I've read before that the chances are so low that you really should stress about something else--like your server spontaneously combusting or other bugs in your code. That is, assume it's unique and don't build in any code to "catch" duplicates--spend your time on something more likely to happen (i.e. anything else).
I made an attempt to describe the usefulness of GUIDs to my blog audience (non-technical family memebers). From there (via Wikipedia), the odds of generating a duplicate GUID:
1 in 2^128
1 in 340 undecillion (don’t worry, undecillion is not on the
quiz)
1 in 3.4 × 10^38
1 in 340,000,000,000,000,000,000,000,000,000,000,000,000
None seems to mention the actual math of the probability of it occurring.
First, let's assume we can use the entire 128 bit space (Guid v4 only uses 122 bits).
We know that the general probability of NOT getting a duplicate in n picks is:
(1-1/2128)(1-2/2128)...(1-(n-1)/2128)
Because 2128 is much much larger than n, we can approximate this to:
(1-1/2128)n(n-1)/2
And because we can assume n is much much larger than 0, we can approximate that to:
(1-1/2128)n^2/2
Now we can equate this to the "acceptable" probability, let's say 1%:
(1-1/2128)n^2/2 = 0.01
Which we solve for n and get:
n = sqrt(2* log 0.01 / log (1-1/2128))
Which Wolfram Alpha gets to be 5.598318 × 1019
To put that number into perspective, lets take 10000 machines, each having a 4 core CPU, doing 4Ghz and spending 10000 cycles to generate a Guid and doing nothing else. It would then take ~111 years before they generate a duplicate.
From http://www.guidgenerator.com/online-guid-generator.aspx
What is a GUID?
GUID (or UUID) is an acronym for 'Globally Unique Identifier' (or 'Universally Unique Identifier'). It is a 128-bit integer number used to identify resources. The term GUID is generally used by developers working with Microsoft technologies, while UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the generation algorithm is unique enough that if 1,000,000,000 GUIDs per second were generated for 1 year the probability of a duplicate would be only 50%. Or if every human on Earth generated 600,000,000 GUIDs there would only be a 50% probability of a duplicate.
Is a GUID unique 100% of the time?
Not guaranteed, since there are several ways of generating one. However, you can try to calculate the chance of creating two GUIDs that are identical and you get the idea: a GUID has 128 bits, hence, there are 2128 distinct GUIDs – much more than there are stars in the known universe. Read the wikipedia article for more details.
MSDN:
There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.
If your system clock is set properly and hasn't wrapped around, and if your NIC has its own MAC (i.e. you haven't set a custom MAC) and your NIC vendor has not been recycling MACs (which they are not supposed to do but which has been known to occur), and if your system's GUID generation function is properly implemented, then your system will never generate duplicate GUIDs.
If everyone on earth who is generating GUIDs follows those rules then your GUIDs will be globally unique.
In practice, the number of people who break the rules is low, and their GUIDs are unlikely to "escape". Conflicts are statistically improbable.
I experienced a duplicate GUID.
I use the Neat Receipts desktop scanner and it comes with proprietary database software. The software has a sync to cloud feature, and I kept getting an error upon syncing. A gander at the logs revealed the awesome line:
"errors":[{"code":1,"message":"creator_guid: is already
taken","guid":"C83E5734-D77A-4B09-B8C1-9623CAC7B167"}]}
I was a bit in disbelief, but surely enough, when I found a way into my local neatworks database and deleted the record containing that GUID, the error stopped occurring.
So to answer your question with anecdotal evidence, no. A duplicate is possible. But it is likely that the reason it happened wasn't due to chance, but due to standard practice not being adhered to in some way. (I am just not that lucky) However, I cannot say for sure. It isn't my software.
Their customer support was EXTREMELY courteous and helpful, but they must have never encountered this issue before because after 3+ hours on the phone with them, they didn't find the solution. (FWIW, I am very impressed by Neat, and this glitch, however frustrating, didn't change my opinion of their product.)
For more better result the best way is to append the GUID with the timestamp (Just to make sure that it stays unique)
Guid.NewGuid().ToString() + DateTime.Now.ToString();
GUID algorithms are usually implemented according to the v4 GUID specification, which is essentially a pseudo-random string. Sadly, these fall into the category of "likely non-unique", from Wikipedia (I don't know why so many people ignore this bit): "... other GUID versions have different uniqueness properties and probabilities, ranging from guaranteed uniqueness to likely non-uniqueness."
The pseudo-random properties of V8's JavaScript Math.random() are TERRIBLE at uniqueness, with collisions often coming after only a few thousand iterations, but V8 isn't the only culprit. I've seen real-world GUID collisions using both PHP and Ruby implementations of v4 GUIDs.
Because it's becoming more and more common to scale ID generation across multiple clients, and clusters of servers, entropy takes a big hit -- the chances of the same random seed being used to generate an ID escalate (time is often used as a random seed in pseudo-random generators), and GUID collisions escalate from "likely non-unique" to "very likely to cause lots of trouble".
To solve this problem, I set out to create an ID algorithm that could scale safely, and make better guarantees against collision. It does so by using the timestamp, an in-memory client counter, client fingerprint, and random characters. The combination of factors creates an additive complexity that is particularly resistant to collision, even if you scale it across a number of hosts:
http://usecuid.org/
I have experienced the GUIDs not being unique during multi-threaded/multi-process unit-testing (too?). I guess that has to do with, all other tings being equal, the identical seeding (or lack of seeding) of pseudo random generators. I was using it for generating unique file names. I found the OS is much better at doing that :)
Trolling alert
You ask if GUIDs are 100% unique. That depends on the number of GUIDs it must be unique among. As the number of GUIDs approach infinity, the probability for duplicate GUIDs approach 100%.
In a more general sense, this is known as the "birthday problem" or "birthday paradox". Wikipedia has a pretty good overview at:
Wikipedia - Birthday Problem
In very rough terms, the square root of the size of the pool is a rough approximation of when you can expect a 50% chance of a duplicate. The article includes a probability table of pool size and various probabilities, including a row for 2^128. So for a 1% probability of collision you would expect to randomly pick 2.6*10^18 128-bit numbers. A 50% chance requires 2.2*10^19 picks, while SQRT(2^128) is 1.8*10^19.
Of course, that is just the ideal case of a truly random process. As others mentioned, a lot is riding on the that random aspect - just how good is the generator and seed? It would be nice if there was some hardware support to assist with this process which would be more bullet-proof except that anything can be spoofed or virtualized. I suspect that might be the reason why MAC addresses/time-stamps are no longer incorporated.
The Answer of "Is a GUID is 100% unique?" is simply "No" .
If You want 100% uniqueness of GUID then do following.
generate GUID
check if that GUID is Exist in your table column where you are looking for uniquensess
if exist then goto step 1 else step 4
use this GUID as unique.
The hardest part is not about generating a duplicated Guid.
The hardest part is designed a database to store all of the generated ones to check if it is actually duplicated.
From WIKI:
For example, the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
enter image description here
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes
GUID stands for Global Unique Identifier
In Brief:
(the clue is in the name)
In Detail:
GUIDs are designed to be unique; they are calculated using a random method based on the computers clock and computer itself, if you are creating many GUIDs at the same millisecond on the same machine it is possible they may match but for almost all normal operations they should be considered unique.
I think that when people bury their thoughts and fears in statistics, they tend to forget the obvious. If a system is truly random, then the result you are least likely to expect (all ones, say) is equally as likely as any other unexpected value (all zeros, say). Neither fact prevents these occurring in succession, nor within the first pair of samples (even though that would be statistically "truly shocking"). And that's the problem with measuring chance: it ignores criticality (and rotten luck) entirely.
IF it ever happened, what's the outcome? Does your software stop working? Does someone get injured? Does someone die? Does the world explode?
The more extreme the criticality, the worse the word "probability" sits in the mouth. In the end, chaining GUIDs (or XORing them, or whatever) is what you do when you regard (subjectively) your particular criticality (and your feeling of "luckiness") to be unacceptable. And if it could end the world, then please on behalf of all of us not involved in nuclear experiments in the Large Hadron Collider, don't use GUIDs or anything else indeterministic!
Enough GUIDs to assign one to each and every hypothetical grain of sand on every hypothetical planet around each and every star in the visible universe.
Enough so that if every computer in the world generates 1000 GUIDs a second for 200 years, there might (MIGHT) be a collision.
Given the number of current local uses for GUIDs (one sequence per table per database for instance) it is extraordinarily unlikely to ever be a problem for us limited creatures (and machines with lifetimes that are usually less than a decade if not a year or two for mobile phones).
... Can we close this thread now?