I am developing a chat feature for my website.
In my MySQL database I use a 64bit signed integer for the chat_id attribute - which is an autoincrement.
So I am worried that once my system obtains a lot of traffic, the chat_id value could overflow.
So my question is does a 64bit integer practically overflow ?
And if so, is there a 128bit integer in MySql, JavaScript and PHP ?
64-bit signed has as highest value 9,223,372,036,854,775,807. It is extremely unlikely that the service you are building you will every have this many chat sessions. Not because I don't believe it will be popular once, but rather because this number is incredibly large.
To give an indication how large this number is (from Wikipedia):
In Java the time in milliseconds is a 64-bit signed integer. It will take 292 millions years to overflow...
So, no, you won't need more than a 64-bit signed integer for a unique chat_id, even more because you use an incrementer.
The largest 64 bit unsigned integer is a staggering 18,446,744,073,709,551,615 - this is 18 quintillion, 446 quadrillion, 744 trillion, 73 billion, 709 million, 551 thousand and 615
If by chat ID you're referring to each unique chat started between two users you'd struggle to reach this amount even if you held the largest messaging service in the entire world.
If for some reason you believe you'd need a value greater than this, store it in a VARCHAR as a combination of digits and characters instead... e.g. 00-09, then 0a-0Z (really crude and won't actually be a good idea in this exact form but you get the point) etc going up systematically like that for as long as you like, that way if you need more space you can just increase the size of the field
When it comes to being an ID value, no, its theoretical limit can not really be reached in practical usage for your scenario.
However, if you tried to store some randomly generated number in it instead, such as an UUID (which requires more than 64bit to store, but I've seen people trying to "speed it up" by downgrading it into a long number and store as such), then you could have matching values when you get to larger number of entries. Please notice this part about UUID, because most folks just observe how it looks theoretically improbable to have a match, then they pump dozens of millions of records within a month and find out that they start having collisions on a regular basis very often due to a way how collision chance rises dramatically as more and more of such "fake UUIDs" are generated.
To sum it up - for your auto-increment, it's fine. But don't try to "squeeze" a 128bit UUID into it and randomize it for some (future) migrations or something. "Hacks" often backfire.
Related
Does anyone know if PayPal's "PNREF" (returned from zero-dollar authorizations) is always 12 characters?
This I ask because I want to optimize my mySQL storage.
And also, I trust SO's answer more than PP's :-D
Don't "optimize" your storage. Not only do server-grade terabyte sized drives cost just a few hundred dollars, making the cost of storing a handful of bytes nearly zero, but VARCHAR(255) columns only take up as much space as you have content because they are variable length.
If you ran a million transactions and saved ten bytes on each, you've saved all of ten megabytes of data, or about $0.0001 worth of storage. I'm presuming if you've run a million transactions you can afford the bytes. The PayPal fees will be literally several quadrillion times higher.
In actuality there's zero savings between 12 characters in VARCHAR(12) and VARCHAR(255). Internally these are represented as a single length byte plus N bytes for the content. For regular 7-bit values that means 13 bytes per entry.
The only difference is you're arbitrarily limiting the former to 12 characters and will get truncation errors (if this flag is set, as it is on newer versions of MySQL) if you insert longer values, or you'll lose data and have no idea until it's probably too late to fix it.
Just use VARCHAR(255) so that your code doesn't explode when PayPal decides today's the day to use 14 characters. These things can change without warning and without any logical reason.
So I completely understand the mathematical unlikeliness of creating two GUID values with the same number. But is it acceptable practice to assume they are unique?
For example I am working with a system for dealing with medical files. When I began to layout the database structure the manager (Not very technically knowledgeable, but likes to think he is and delegates things that would be better left for the more technically minded to decide) says he wants to use GUID's to separate different medical records instead of INT because it is "More unique". I explained how an INT is always going to be unique because it is sequential. I suggested we use BigINT if it will make him feel more comfortable since there are more numbers in that then if the population of the planet increased to the point people would only fit standing next to one another across the planet, but he is insisting on using GUIDs.
My feeling is although it is NEARLY IMPOSSIBLE for there to be a mix up, when dealing with medical records, why take the chance? What is the advantage of using a GUID vs an INT in this scenario?
But is it acceptable practice to assume it is unique?
Yes. That is the entire purpose of UUID, to be used as a reliable unique identifier without centralized coordination. (A GUID is Microsoft’s variation of a UUID.)
Only you (or your appropriate management) can make the final judgement for your particular project.
But if you truly begin to appreciate the enormity of the numerical range of 12x bits (which is actually incomprehensible to the human mind), then you know you can remove the usage of a properly generated UUID from your list of worries.
By “properly generated” I mean things like using the date-time Versions, or for lower number of values use the random (Version 4) if backed by a cryptographically-strong random number generator. Nearly every modern operating system today includes a UUID generation library. Or you can use the OSSP UUID project. Improperly-generated would include roll-your-own implementations you may see bandied about the inter webs.
As for the suggestion to use a database’s auto-incrementing serial/sequence number, every database person I know with years of real-world experience has been burned by those. I’ve never heard of or read of anyone ever having a collision with properly-generated UUIDs. I'm not saying sequences are necessarily bad or don't have their place, I'm just saying that all I can do is laugh when I hear people turn away from a UUID because of some beyond-astrononomically incomprehensibly minute possibility of a UUID collision and choose a sequence instead.
when dealing with medical records, why take the chance?
Your medical system is far far more likely to fail because of faulty data-entry or other human error with handling records. But do you post 3 clerks on duty to independently triple-enter the same data to reduce that chance of error? No. And that risk is incomprehensibly mathematically more likely to happen than a UUID problem. Yet every medical facility I know of accepts that enormous risk without even thinking about it.
What is the advantage of using a GUID vs an INT
The advantages include:
No need to manage your sequences.Examples include: Resetting for development, test, and production environments. Or when restoring a backup. Or fixing the sequence after faults in the system’s serial generation library (my own experience).
Avoid users’ intuited assumptions being confused about missing numbers in the sequence. I've had that conversation far too often.
Federating data between distributed systems.This is the biggest advantage, each system can act independently yet easily share data back and forth with other systems. Without UUIDs, the administrative overhead and the risk of error are bothersome at first and only grow over time.
Downsides include:
Larger memory and storage usage.Serial numbers are usually 32-bit integers, sometimes 64-bit. A good database with native support for UUID as a data type will use 128 bits.
Less readable by humans.One workaround is to just read several of the first or last digits for casual work.
Possibly less efficient indexing, with very large number of entries.
using an incrementing integer ID ensures only uniqueness within its own domain/type, an advantage of UUIDs/GUIDs is that they uniquely identify the owning thing in the entire universe.
So if you have multiple objects, say MedicalRecord, ID = 5, VaccinationForm, ID = 5 then you need to specify both the type ("medicalRecord" or "vaccinationForm" with the ID value of 5) whereas with a GUID you only need to store a single quanta of information to uniquely identify it.
It can be argued that using GUIDs is a waste of space as they are 16 bytes long (a 128-bit value).
If your system is self-contained and not interfacing with others you might want to use SQL Server's "sequence" concept, where instead of each table storing its own identity sequence, the sequence is maintained for all tables, making it a Locally-Unique ID value. You can use any size integer too.
See here: https://msdn.microsoft.com/en-us/library/ff878091.aspx
I have been working with SQL for about 2 years now, and it has always been on my mind.
Best practices say assign the length of the column to what you are
expecting
The SQL wants a specific row dedicated as a Primary key but it's also in best practice for an A_i field... But what length to assign it? If left blank it defaults to 11, which represents 999,999,999
Which seems fine, but best practices also state never to actually clear anything from a database; just append a 0 or 1 to represent deleted, this is for archival/recovery purposes.. Also can be used for auditing of what users want to clear..
Take this example:
I have a website which is around for years, following the best practices in terms of not deleting anything from the database; my database/website traffic is very heavy with tons of unique users/visitors per day.
Now, if I leave the SQL default length of 11, what would happen if my table reaches the maximum length and then another user decides to register? It would throw an error and not continue, which will bring up a small amount of downtime for new users reason being is that a Database administrator will have to login to the SQL and change the length.. Which is not much effort, but it IS effort which can be avoided during the early development..
What I do, when creating a table is give a length of 255 which in the back of my mind, something is telling me 'this is not good practice' but it avoids the very slim possibility of the example stated above.
When compared to a text field, which does not have a specified length, why cannot this be the same in terms of an A_I field.
Don't get me wrong, I completely understand the data types available.
I have performed an amount of research both through google and SO, but the results have pointed to questions about altering the table to increase the current length. This is not what i'm asking for.
Overall:
So overall, what I am trying to ask; what is the ideal length for an A_I field? to minimize the slim risk of an error being thrown if it maxes out the length but also keeping best practices in mind.
The reason is simple,
being as a primary key, the ID should be just well fit for what you are expecting.
If you specify a varchar, the drawback is bigger size on index,
which could be slow down both read and write performance.
int(11) .. does not store up to 99,999,999,999.
It only store up to 2,147,483,647.
If you set it to unsigned,
then it can allow 4,294,967,295 of records (4 billion!)
Facebook has just over 1 billion of users!
So, I dun see anyone can has a 4 time bigger user base anytime soon...
Couple of the best practices has been explained very well in this article:
http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/
Smaller Columns Are Faster
integer are fixed length, but varchar are not fixed length
Index and Use Same Column Types for Joins
Analyze your application, or system. Estimate How many users will register per day? per year? once you know this, Then decide how "safe" you want to be - in terms of how many years you want the system to run without the need to modify this. Say 100 years is enough... So, multiply the expected number of annual user registrations by 100 and make sure the PK is large enough to accompodate that many values.
I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".
Is a GUID unique 100% of the time?
Will it stay unique over multiple threads?
While each generated GUID is not
guaranteed to be unique, the total
number of unique keys (2128 or
3.4×1038) is so large that the probability of the same number being
generated twice is very small. For
example, consider the observable
universe, which contains about 5×1022
stars; every star could then have
6.8×1015 universally unique GUIDs.
From Wikipedia.
These are some good articles on how a GUID is made (for .NET) and how you could get the same guid in the right situation.
https://ericlippert.com/2012/04/24/guid-guide-part-one/
https://ericlippert.com/2012/04/30/guid-guide-part-two/
https://ericlippert.com/2012/05/07/guid-guide-part-three/
If you are scared of the same GUID values then put two of them next to each other.
Guid.NewGuid().ToString() + Guid.NewGuid().ToString();
If you are too paranoid then put three.
The simple answer is yes.
Raymond Chen wrote a great article on GUIDs and why substrings of GUIDs are not guaranteed unique. The article goes in to some depth as to the way GUIDs are generated and the data they use to ensure uniqueness, which should go to some length in explaining why they are :-)
As a side note, I was playing around with Volume GUIDs in Windows XP. This is a very obscure partition layout with three disks and fourteen volumes.
\\?\Volume{23005604-eb1b-11de-85ba-806d6172696f}\ (F:)
\\?\Volume{23005605-eb1b-11de-85ba-806d6172696f}\ (G:)
\\?\Volume{23005606-eb1b-11de-85ba-806d6172696f}\ (H:)
\\?\Volume{23005607-eb1b-11de-85ba-806d6172696f}\ (J:)
\\?\Volume{23005608-eb1b-11de-85ba-806d6172696f}\ (D:)
\\?\Volume{23005609-eb1b-11de-85ba-806d6172696f}\ (P:)
\\?\Volume{2300560b-eb1b-11de-85ba-806d6172696f}\ (K:)
\\?\Volume{2300560c-eb1b-11de-85ba-806d6172696f}\ (L:)
\\?\Volume{2300560d-eb1b-11de-85ba-806d6172696f}\ (M:)
\\?\Volume{2300560e-eb1b-11de-85ba-806d6172696f}\ (N:)
\\?\Volume{2300560f-eb1b-11de-85ba-806d6172696f}\ (O:)
\\?\Volume{23005610-eb1b-11de-85ba-806d6172696f}\ (E:)
\\?\Volume{23005611-eb1b-11de-85ba-806d6172696f}\ (R:)
| | | | |
| | | | +-- 6f = o
| | | +---- 69 = i
| | +------ 72 = r
| +-------- 61 = a
+---------- 6d = m
It's not that the GUIDs are very similar but the fact that all GUIDs have the string "mario" in them. Is that a coincidence or is there an explanation behind this?
Now, when googling for part 4 in the GUID I found approx 125.000 hits with volume GUIDs.
Conclusion: When it comes to Volume GUIDs they aren't as unique as other GUIDs.
It should not happen. However, when .NET is under a heavy load, it is possible to get duplicate guids. I have two different web servers using two different sql servers. I went to merge the data and found I had 15 million guids and 7 duplicates.
Yes, a GUID should always be unique. It is based on both hardware and time, plus a few extra bits to make sure it's unique. I'm sure it's theoretically possible to end up with two identical ones, but extremely unlikely in a real-world scenario.
Here's a great article by Raymond Chen on Guids:
https://blogs.msdn.com/oldnewthing/archive/2008/06/27/8659071.aspx
Guids are statistically unique. The odds of two different clients generating the same Guid are infinitesimally small (assuming no bugs in the Guid generating code). You may as well worry about your processor glitching due to a cosmic ray and deciding that 2+2=5 today.
Multiple threads allocating new guids will get unique values, but you should get that the function you are calling is thread safe. Which environment is this in?
Eric Lippert has written a very interesting series of articles about GUIDs.
There are on the order 230 personal computers in the world (and of
course lots of hand-held devices or non-PC computing devices that have
more or less the same levels of computing power, but lets ignore
those). Let's assume that we put all those PCs in the world to the
task of generating GUIDs; if each one can generate, say, 220 GUIDs per
second then after only about 272 seconds -- one hundred and fifty
trillion years -- you'll have a very high chance of generating a
collision with your specific GUID. And the odds of collision get
pretty good after only thirty trillion years.
GUID Guide, part one
GUID Guide, part two
GUID Guide, part three
Theoretically, no, they are not unique. It's possible to generate an identical guid over and over. However, the chances of it happening are so low that you can assume they are unique.
I've read before that the chances are so low that you really should stress about something else--like your server spontaneously combusting or other bugs in your code. That is, assume it's unique and don't build in any code to "catch" duplicates--spend your time on something more likely to happen (i.e. anything else).
I made an attempt to describe the usefulness of GUIDs to my blog audience (non-technical family memebers). From there (via Wikipedia), the odds of generating a duplicate GUID:
1 in 2^128
1 in 340 undecillion (don’t worry, undecillion is not on the
quiz)
1 in 3.4 × 10^38
1 in 340,000,000,000,000,000,000,000,000,000,000,000,000
None seems to mention the actual math of the probability of it occurring.
First, let's assume we can use the entire 128 bit space (Guid v4 only uses 122 bits).
We know that the general probability of NOT getting a duplicate in n picks is:
(1-1/2128)(1-2/2128)...(1-(n-1)/2128)
Because 2128 is much much larger than n, we can approximate this to:
(1-1/2128)n(n-1)/2
And because we can assume n is much much larger than 0, we can approximate that to:
(1-1/2128)n^2/2
Now we can equate this to the "acceptable" probability, let's say 1%:
(1-1/2128)n^2/2 = 0.01
Which we solve for n and get:
n = sqrt(2* log 0.01 / log (1-1/2128))
Which Wolfram Alpha gets to be 5.598318 × 1019
To put that number into perspective, lets take 10000 machines, each having a 4 core CPU, doing 4Ghz and spending 10000 cycles to generate a Guid and doing nothing else. It would then take ~111 years before they generate a duplicate.
From http://www.guidgenerator.com/online-guid-generator.aspx
What is a GUID?
GUID (or UUID) is an acronym for 'Globally Unique Identifier' (or 'Universally Unique Identifier'). It is a 128-bit integer number used to identify resources. The term GUID is generally used by developers working with Microsoft technologies, while UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the generation algorithm is unique enough that if 1,000,000,000 GUIDs per second were generated for 1 year the probability of a duplicate would be only 50%. Or if every human on Earth generated 600,000,000 GUIDs there would only be a 50% probability of a duplicate.
Is a GUID unique 100% of the time?
Not guaranteed, since there are several ways of generating one. However, you can try to calculate the chance of creating two GUIDs that are identical and you get the idea: a GUID has 128 bits, hence, there are 2128 distinct GUIDs – much more than there are stars in the known universe. Read the wikipedia article for more details.
MSDN:
There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.
If your system clock is set properly and hasn't wrapped around, and if your NIC has its own MAC (i.e. you haven't set a custom MAC) and your NIC vendor has not been recycling MACs (which they are not supposed to do but which has been known to occur), and if your system's GUID generation function is properly implemented, then your system will never generate duplicate GUIDs.
If everyone on earth who is generating GUIDs follows those rules then your GUIDs will be globally unique.
In practice, the number of people who break the rules is low, and their GUIDs are unlikely to "escape". Conflicts are statistically improbable.
I experienced a duplicate GUID.
I use the Neat Receipts desktop scanner and it comes with proprietary database software. The software has a sync to cloud feature, and I kept getting an error upon syncing. A gander at the logs revealed the awesome line:
"errors":[{"code":1,"message":"creator_guid: is already
taken","guid":"C83E5734-D77A-4B09-B8C1-9623CAC7B167"}]}
I was a bit in disbelief, but surely enough, when I found a way into my local neatworks database and deleted the record containing that GUID, the error stopped occurring.
So to answer your question with anecdotal evidence, no. A duplicate is possible. But it is likely that the reason it happened wasn't due to chance, but due to standard practice not being adhered to in some way. (I am just not that lucky) However, I cannot say for sure. It isn't my software.
Their customer support was EXTREMELY courteous and helpful, but they must have never encountered this issue before because after 3+ hours on the phone with them, they didn't find the solution. (FWIW, I am very impressed by Neat, and this glitch, however frustrating, didn't change my opinion of their product.)
For more better result the best way is to append the GUID with the timestamp (Just to make sure that it stays unique)
Guid.NewGuid().ToString() + DateTime.Now.ToString();
GUID algorithms are usually implemented according to the v4 GUID specification, which is essentially a pseudo-random string. Sadly, these fall into the category of "likely non-unique", from Wikipedia (I don't know why so many people ignore this bit): "... other GUID versions have different uniqueness properties and probabilities, ranging from guaranteed uniqueness to likely non-uniqueness."
The pseudo-random properties of V8's JavaScript Math.random() are TERRIBLE at uniqueness, with collisions often coming after only a few thousand iterations, but V8 isn't the only culprit. I've seen real-world GUID collisions using both PHP and Ruby implementations of v4 GUIDs.
Because it's becoming more and more common to scale ID generation across multiple clients, and clusters of servers, entropy takes a big hit -- the chances of the same random seed being used to generate an ID escalate (time is often used as a random seed in pseudo-random generators), and GUID collisions escalate from "likely non-unique" to "very likely to cause lots of trouble".
To solve this problem, I set out to create an ID algorithm that could scale safely, and make better guarantees against collision. It does so by using the timestamp, an in-memory client counter, client fingerprint, and random characters. The combination of factors creates an additive complexity that is particularly resistant to collision, even if you scale it across a number of hosts:
http://usecuid.org/
I have experienced the GUIDs not being unique during multi-threaded/multi-process unit-testing (too?). I guess that has to do with, all other tings being equal, the identical seeding (or lack of seeding) of pseudo random generators. I was using it for generating unique file names. I found the OS is much better at doing that :)
Trolling alert
You ask if GUIDs are 100% unique. That depends on the number of GUIDs it must be unique among. As the number of GUIDs approach infinity, the probability for duplicate GUIDs approach 100%.
In a more general sense, this is known as the "birthday problem" or "birthday paradox". Wikipedia has a pretty good overview at:
Wikipedia - Birthday Problem
In very rough terms, the square root of the size of the pool is a rough approximation of when you can expect a 50% chance of a duplicate. The article includes a probability table of pool size and various probabilities, including a row for 2^128. So for a 1% probability of collision you would expect to randomly pick 2.6*10^18 128-bit numbers. A 50% chance requires 2.2*10^19 picks, while SQRT(2^128) is 1.8*10^19.
Of course, that is just the ideal case of a truly random process. As others mentioned, a lot is riding on the that random aspect - just how good is the generator and seed? It would be nice if there was some hardware support to assist with this process which would be more bullet-proof except that anything can be spoofed or virtualized. I suspect that might be the reason why MAC addresses/time-stamps are no longer incorporated.
The Answer of "Is a GUID is 100% unique?" is simply "No" .
If You want 100% uniqueness of GUID then do following.
generate GUID
check if that GUID is Exist in your table column where you are looking for uniquensess
if exist then goto step 1 else step 4
use this GUID as unique.
The hardest part is not about generating a duplicated Guid.
The hardest part is designed a database to store all of the generated ones to check if it is actually duplicated.
From WIKI:
For example, the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
enter image description here
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes
GUID stands for Global Unique Identifier
In Brief:
(the clue is in the name)
In Detail:
GUIDs are designed to be unique; they are calculated using a random method based on the computers clock and computer itself, if you are creating many GUIDs at the same millisecond on the same machine it is possible they may match but for almost all normal operations they should be considered unique.
I think that when people bury their thoughts and fears in statistics, they tend to forget the obvious. If a system is truly random, then the result you are least likely to expect (all ones, say) is equally as likely as any other unexpected value (all zeros, say). Neither fact prevents these occurring in succession, nor within the first pair of samples (even though that would be statistically "truly shocking"). And that's the problem with measuring chance: it ignores criticality (and rotten luck) entirely.
IF it ever happened, what's the outcome? Does your software stop working? Does someone get injured? Does someone die? Does the world explode?
The more extreme the criticality, the worse the word "probability" sits in the mouth. In the end, chaining GUIDs (or XORing them, or whatever) is what you do when you regard (subjectively) your particular criticality (and your feeling of "luckiness") to be unacceptable. And if it could end the world, then please on behalf of all of us not involved in nuclear experiments in the Large Hadron Collider, don't use GUIDs or anything else indeterministic!
Enough GUIDs to assign one to each and every hypothetical grain of sand on every hypothetical planet around each and every star in the visible universe.
Enough so that if every computer in the world generates 1000 GUIDs a second for 200 years, there might (MIGHT) be a collision.
Given the number of current local uses for GUIDs (one sequence per table per database for instance) it is extraordinarily unlikely to ever be a problem for us limited creatures (and machines with lifetimes that are usually less than a decade if not a year or two for mobile phones).
... Can we close this thread now?