Generating unique tokens that can't be guessed

Generating unique tokens that can't be guessed - language-agnostic

I have a system that needs to schedule some stuff and return identifiers to the scheduled tasks to some foreign objects. The user would basically do this:
identifier = MyLib.Schedule(something)
# Nah, let's unschedule it.
MyLib.Unschedule(identifier)
I use this kind of pattern a lot in internal code, and I always use plain integers as the identifier. But if the identifiers are used by untrusted code, a malicious user could break the entire system by doing a single Unschedule(randint()).
I need the users of the code to be able to only unschedule identifiers they have actually scheduled.
The only solution I can think of is to generate i.e 64-bit random numbers as identifiers, and keep track of which identifiers are currently handed out to avoid the ridiculously unlikely duplicates. Or 128-bit? When can I say "this is random enough, no duplicates could possibly occur", if ever?
Or better yet, is there a more sensible way to do this? Is there a way to generate identifier tokens that the generator can easily keep track of (avoiding duplicates) but is indistinguishable from random numbers to the recipient?
EDIT - Solution based on the accepted answer:
from Crypto.Cipher import AES
import struct, os, itertools
class AES_UniqueIdentifier(object):
def __init__(self):
self.salt = os.urandom(8)
self.count = itertools.count(0)
self.cipher = AES.new(os.urandom(16), AES.MODE_ECB)
def Generate(self):
return self.cipher.encrypt(self.salt +
struct.pack("Q", next(self.count)))
def Verify(self, identifier):
"Return true if identifier was generated by this object."
return self.cipher.decrypt(identifier)[0:8] == self.salt

Depending on how many active IDs you have, 64 bits can be too little. By the birthday paradox, you'd end up with essentially the level of protection you might expect from 32 bit identifiers.
Besides, probably the best way to create these is to use some salted hash function, such as SHA-1 or MD5 or whatever your framework already has, with a randomly chosen salt (kept secret), and those generate at least 128 bits anyway, exactly for the reason mentioned above. If you use something that creates longer hash values, I don't really see any reason to truncate them.
To create identifiers you can check without storing them, take something easy to detect, such as having the same 64 bit patterns twice (giving a total of 128 bits) and encrypt that with some constant secret key, using AES or some other cipher with a block size of 128 bits (or whatever you picked). If and when the user sends some alleged key, decrypt and check for your easy-to-spot pattern.

It sounds to me like you might be over thinking this problem. This sounds 100% like an application for a GUID/UUID. Python even has a built in way to generate them. The whole point of GUID/UUIDs is that the odds of collision are astronomical, and by using a string instead of an encrypted token you can skip the decrypting operation in the verify step. I think this would also eliminate a whole slew of problems you might encounter regarding key management, and increase the speed of the whole process.
EDIT:
With a UUID, your verify method would just be a comparison between the given UUID and the stored one. Since the odds of a collision between two UUIDs is incredibly low, you shouldn't have to worry about false positives. In your example, it appears that the same object is doing both encryption and decryption, without a third party reading the stored data. If this is the case, you aren't gaining anything by passing around encrypted data except that the bits your passing around aren't easy to guess. I think a UUID would give you the same benefits, without the overhead of the encryption operations.

You make your identifier long enough, so it can't be reasonable guessed. In addition, let Unschedule wait for 1 second, if the token is not in use, so a brute force attack is not feasible anymore. Like the other answer said, session IDs in Webapplications are exactly the same problem, and I already saw session IDs which where 64 random characters long.

This is the same problem as dealing with session identifiers in ordinary web applications. Predictable session ids can easily lead to session hijacking.
Have a look at how session ids are generated. Here the content of a typical PHPSESSID cookie:
bf597801be237aa8531058dab94a08a9
If you want to be dead sure no brute-force attack is feasible, do the calculations backward: How many attempts can a cracker do per second? How many different unique id's are used at a random point in time? How many id's are there in total? How long would it take for the cracker to cover, say 1 % of the total space of ids? Adjust number of bits accordingly.

Do you need this pattern in a distributed or local environment?
If you're local, most OO languages should support the notion of object identity, so if you create an opaque handle - just create a new object.
handle = new Object(); // in Java
No other client can fake this.
If you need to use this in distributes environments, you may keep a pool of handles per session, so that a foreign session can never use a stolen handle.

Related

Does RAND() has low entropy to generate salts?

I have to encrypt several data in my DB (no password)
I would use a salt to prevent rainbow attack.
I'm generating a salt like that:
mysalt = UNHEX(SHA2(RAND(),512))
is RAND() (mysql function) an enough source of entropy? I should have all my salts value different each other, but if my PRNG has too much collision isn't the case. Does it depend from number of records in DB? If it is the case what is the limit with RAND()? Which could be a good alternative technique if that above isn't good? Finally is that good to salt passwords too?

I have to encrypt several data in my DB (no password) I would use a salt to prevent rainbow attack.
As already commented, you should use proper terminology to make sure we're talking about the same thing. Based on the comments let's assume you're generating an IV (initialization vector) for a block mode (cbc, cfb or ofb). The IV is enabling the safe key reuse. Nothing to do with rainbow tables.
mysalt = UNHEX(SHA2(RAND(),512))
RAND() generates a float between 0 and 1. Your function effectively uses the float as a string, so proper notation would be sha2(convert(RAND(),CHAR),128), AES is using 128 bit IV anyway. No reason to generate more.
is RAND() (mysql function) an enough source of entropy?
Depends on the modes used. Some modes (CTR, CFB) the IV is required to be unique (event a simple counter is good enough), for the CBC and OFB mode the IV needs to be "unpredictable" (stronger term than random).
I am unable to find source state of the RAND() function and then I am unable to guarantee it cannot be brute-forced (it can be done in case of timestamp-based source for NLFSR PRNG). I'm not really willing to dig into the mysql source code (as far I recall mysql uses the OS rand function, so it will depend on the underlying system, I may be wrong here). So RAND() can be safe to generate the IV for CBC, under assumptions that the initial state is random enough. Seems no-one here is able to confirm nor deny this assumption. Using RANDOM_BYTES() it is guaranteed to use a cryptographically secure random source.
Does it depend from number of records in DB?
If it is the case what is the limit with RAND()?
You need a unique combination of the key-iv. The more records you have the higher probability of the value collision. And here we come again with with the initial RAND state size which is system-depended, though in the comment someone claims it's 4 bytes. That's really not too much.
Which could be a good alternative technique if that above isn't good?
Finally is that good to salt passwords too?
As already commented, RANDOM_BYTES are using the random generator from the SSL library, which is required to be cryptographically feasible.

Pros and cons of Flake ids and cryptographic Ids

A distributed system can generate unique ids either by Flake or cryptographic ids (e.g., 128 bit murmur3).
Wonder what are the pros and cons of each method.

I'm going to assume 128-bit ids, kind-a like UUIDs. Let's start at a baseline, though
TL;DR: Use random ids. If and only if you have database performance issues try flake ids.
Auto-increment ids
Auto-increment ids are when your backend system assigns a unique, densely-packed id to each new entity. This is usually done by a database, but not always.
The clear advantage is that the id is guaranteed unique to your system, though 128 bits is probably overkill.
The first disadvantage is that you leak information every time you expose your id. You leak what other ids there are (an attacker can easily guess what to look for). You also leak how busy your system is (your competition now knows how many ids you create in a time period and can infer, say financial information).
The second disadvantage is that your backend is no longer as scalable. You are tied to some slow, less scalable id generator that will always be a bottleneck in a large system.
Random ids
Random ids are when you just generate 128 random bytes. v4 UUIDs 122-bit random ids (e.g. 2bbfb5ba-f5a2-11e7-8c3f-9a214cf093ae). These are also practically unique.
Random ids get rid of both of the disadvantages of auto-increment ids: they leak no information and are infinitely scalable.
The disadvantage comes when storing ids in b-trees (à la databases) because they randomize the memory/disk pages that the tree accesses. This may be a source of slow-downs to your system.
To me this is still the ideal id scheme, and you should have a good reason to move off of it. (i.e. profiler data).
Flake ids
Flake ids are random ids with except that the high k bits are taken from the lower bits of a timestamp. For example, you may get the following three ids in a row, where the top bits are really close together.
2bbfb5baf5a211e78c3f9a214cf093ae
2bbf9d4ec10c41049fb1671d6616b213
2bc6bb66e5964fb59050fcf3beed51b1
While you may leak some information, it isn't much if your k and timestamp granularity are designed well.
But if you mal-design the ids they can be less-than-helpful, either too infrequently updated—leading the b-trees to rely on the top random bits negating the usefulness—or too frequently—where you thrash the database because your updates.
Note: By time granularity, I mean how frequently the low bits of a timestamp change. Depending on your data throughput, you probably want this to be hour, deca-minutes, or minutes. It's a balance.
If you see the ids otherwise semantic-less (i.e. never infer anything from the top bits) then you can change any of these parameters at any time without interruption—even going back to purely random where k = 0.
Cryptographic ids
I'm assuming by this you mean ids have some semantic information encrypted in them. Maybe like hashids?
Disadvantages abound:
You'll have different length ids for different data, unless you have a fixed-length protocol.
You'll be tempted to add more and more info to the ids.
Look random, but no mitigation to add flake-like timestamps to the front
Ids become tied to the system that made it. You may start asking that system for decrypted versions of the id instead of just asking for the data it points to.
Your system burns time decrypting ids to extract data.
You add encryption problems
what happens if the secret-key is leaked? (Better not have too sensitive of data in there, customer name, or heaven forbid a credit card number)
coordinating key rotation.
Small ids like hashid can be brute-forced attack.
As you can see, I am not a fan of semantic ids in general. There are a few places where I use them, though I call them tokens. These don't get stored as keys in a database (or likely not stored anywhere).
For example I use encryption for pagination tokens: encrypted {last-id / context} of a pagination API. I prefer this over having the client pass the last element of the prior page because we keep the database context hidden from the user. It's simpler for everyone, and the encryption is little more than obfuscation (no sensitive information).

What algorithm should be used when doing filechecksums to find dupes?

Is taking a MD5 sum still suitable for checking for file dupes? I know that it isn't secure, but does that really matter in the case of trying to find file dupes?
Should I be using something in the SHA family instead?
What is best practice in this use case?

In this particular case, choice of algorithm probably isn't that significant. The key reasons for using SHA1 over MD5 all relate to creating cryptographically secure signatures.
MD5 should be perfectly acceptable for this task, as you probably don't need to worry about people maliciously crafting files to generate false duplicates.

If you care about performances I think it would be better to check for matching file size first, then using a fast hash function (CRC32 or MD5 which should be faster than SHA1) and for possible duplicated files found this way trying with MD5, SHA1 or SHA256 (depending on the criticality of the task).

SHA1 is slightly better as a checksum than MD5. It is what Git uses.

MD5 has known vulnerabilities at this point, but that may not be a problem for your application. It's still reasonably good for distinguishing piles of bits. If something comes up with no match, then you know you haven't already seen it, since the algorithm is deterministic. If something comes back as a match, you should actually compare it to the blob that it ostensibly matched before acting as if it's really a duplicate. MD5 is relatively fast, but if you can't afford full-text comparisons on hash collisions, you should probably use a stronger hash, like SHA-256.

For the describe purpose there is no real preferable solution, both hash-functions will solve the problem. Anyway, MD5 will usually be slightly faster than SHA1.
Example in python:
#!/usr/bin/env python
import hashlib, cProfile
def repeat(f, loops=10000000):
def wrapper():
for i in range(loops): f()
return wrapper
#repeat
def test_md5():
md5 = hashlib.md5(); md5.update("hello"); md5.hexdigest()
#repeat
def test_sha1():
sha = hashlib.sha1(); sha.update("hello"); sha.hexdigest()
cProfile.run('test_md5()')
cProfile.run('test_sha1()')
#
# 40000004 function calls in 59.841 CPU seconds
#
# ....
#
# 40000004 function calls in 65.346 CPU seconds
#
# ....

What you are talking about is a checksum, which is related to (but not the same) as a cryptographic hash.
Yes, both MD5 and even CRC work just fine as checksums, as long as you are not concerned with a malicious user intentionally crafting two different files with the same checksum. If that is a concern, use SHA1 or, even better, some cryptographically unbroken hash.

While MD5 does have a few collisions, I've always used it for files and it's worked just fine.

We use MD5 at my work for exactly what you're considering. Works great. We only need to detect duplicates uploads on a per-customer basis, which reduces our exposure to the birthday problem, but md5 would still be sufficient for us if we had to detect duplicates across all uploads rather than per customer. If you can believe the internet, the probability p of a collision given n samples and a hash size of b is bounded by:
p <= n (n - 1) / (2 * 2 ^ b)
A few years back I ran this calculation for n = 10^9 and b = 128 and came up with p <= 1.469E-21. To put that in perspective, 10^9 files is one per second for 32 years. So we don't compare files in the event of a collision. If md5 says the uploads were the same, they're the same.

Are there any inobvious ways of abusing GUIDs?

GUIDs are typically used for uniquely identifying all kinds of entities - requests from external systems, files, whatever. Work like magic - you call a "GiveMeGuid()" (UuidCreate() on Windows) function - and a fresh new GUID is here at your service.
Given my code really calls that "GiveMeGuid()" function each time I need a new GUID is there any not so obvious way to misuse it?

Just found an answer to an old question: How deterministic Are .Net GUIDs?. Requoting it:
It's not a complete answer, but I can tell you that the 13th hex digit is always 4 because it denotes the version of the algorithm used to generate the GUID (id est, v4); also, and I quote Wikipedia:
Cryptanalysis of the WinAPI GUID generator shows that, since the sequence of V4 GUIDs is pseudo-random, given the initial state one can predict up to the next 250 000 GUIDs returned by the function UuidCreate. This is why GUIDs should not be used in cryptography, e.g., as random keys.
So, if you got lucky and get same seed, you'll break 250k mirrors in sequence. To quote another Wikipedia piece:
While each generated GUID is not guaranteed to be unique, the total number of unique keys (2128 or 3.4×1038) is so large that the probability of the same number being generated twice is extremely small.
Bottom line: maybe a misuse form it's to consider GUID always unique.

It depends. Some implementations of GUID generation are time dependant, so calling CreateGuid in quick succession MAY create clashing GUIDs.
edit: I now remember the problem. I was once working on some php code where the GUID generating function was reseeding the RNG with the system time each call. Don't do this.

The only way I can see of misusing a Guid is trying to interpret the value in some logical manner. Not that it really invites you to do so, which is one of the characteristics around Guid's that I really like.

Some GUIDs include some identifier of the machine it was generated on, so it can be used in client/server environments, but some can't. Be sure if yours doesn't to not use them in, for instance, a database multiple clients access.

Maybe the entropy could be manipulated by playing with some parameters used to generate the GUIDs in the first place (e.g. interface identifiers).

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.

I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.

I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.

You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.

I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')

Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.

I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.

I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.

From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008