Possibilities to repeat password-based derivated key? - function

A few months ago I entered to cryptography, and I have a doubt.
Technically, a PBKDF, converts any password (with any keylength), to a one key with a specific keylength. I understand this is for can use any user entered password with cipher algorithms, resulting no errors of keylength.
For example, if AES 128 accepts 128 bit key size, I have 2^128 possibilities to found the correct key (X) when I decrypt with brute force. But user password possibilities are infinites (in theory, in practice a far away keylength value delimits possibilities). So, a infinite number of user passwords when program applies a PBKDF, becomes to the same 128 bit derivated key (X). Anyway, minimum a 128 bit user password, applying PBKDF, results the correct derivated key (X). This is true? I'm only intented apply logic to concept.
Anyway, I remember 128 bit keylength brute force implies very much time.

Yes, of course, there are many more possible passphrases than there are
keys. On the other hand, assuming the hash function is good, finding a
collision would take 2^64 work, and finding a preimage would take 2^128
work. So this is not a problem in practice.
Edit in reply to comment:
It sounds like you're saying that you can pick a nice long random
password, but it's possible it will generate the same hash as a very
short one? Well, yes, it's possible, but with probability so low that in
practice it's not worth worrying about.
Let's consider all the possible 8-character passwords. 94 printable
characters, raised to the 8th power, gives fewer than 2^53
possibilities. Out of a universe of 2^128 hashes, the probability of
hitting one of these by accident is less than 2^-75, or less than 1 in
10^22. It's far more likely we'll be hit by a major asteroid strike and
civilization will end.

Related

Using HASH as an alternate dimension key in Snowflake?

(Submitting on behalf of a Snowflake client)
.........................
I want to create a dimension with features as JSON attribute.
I am thinking of using HASH to uniquely identify my rows, including the JSON column.
I expect to have a few million rows in that dimension.
Snowflake documentation (https://docs.snowflake.net/manuals/sql-reference/functions/hash.html) says that HASH is likely to produce duplicates for 4 billion rows or more... and warn against using HASH as a key...
Is using a HASH value as key a reasonable approach when only having a few million row members?
.........................
Any ideas, alternative recommendations, or possible work-arounds? THANK YOU.
That's a fun question.
Assuming that the hash is really, truly, random, calculating the probability of a collision is really an extension of the birthday problem. We can approximate the probability as
p(collision) ≈ 1 - e^(-(n^2)/2d))
where n is the number of values, and d is the size of the domain. Plugging in 2^32 (4 billion) in as n, and 2^64 in as d, we get p ≈ .39, so there's a pretty high likelihood of a collision.
But if n is only a few million, this probability is much lower. E.g., for n = 10,000,000, we get p ≈ .0000027. That sounds pretty safe, but there's clearly some risk. And this assumes that the hash is perfect, so you should probably nudge that probability up a bit.
You could try a longer, more standard hash, like SHA-2, which Snowflake supports. There's always some risk of collision, but if you make the hash long enough, this will become vanishingly small—which is all you can hope for with a hash.
A better alternative to hashing, though, might be to put the JSON in a separate table and use autoincrement to assign a real unique identifier to each record. You'd then join using this key. If you do it right, it should always work, and I'd expect join perf to be better as well.

Generating unique tokens that can't be guessed

I have a system that needs to schedule some stuff and return identifiers to the scheduled tasks to some foreign objects. The user would basically do this:
identifier = MyLib.Schedule(something)
# Nah, let's unschedule it.
MyLib.Unschedule(identifier)
I use this kind of pattern a lot in internal code, and I always use plain integers as the identifier. But if the identifiers are used by untrusted code, a malicious user could break the entire system by doing a single Unschedule(randint()).
I need the users of the code to be able to only unschedule identifiers they have actually scheduled.
The only solution I can think of is to generate i.e 64-bit random numbers as identifiers, and keep track of which identifiers are currently handed out to avoid the ridiculously unlikely duplicates. Or 128-bit? When can I say "this is random enough, no duplicates could possibly occur", if ever?
Or better yet, is there a more sensible way to do this? Is there a way to generate identifier tokens that the generator can easily keep track of (avoiding duplicates) but is indistinguishable from random numbers to the recipient?
EDIT - Solution based on the accepted answer:
from Crypto.Cipher import AES
import struct, os, itertools
class AES_UniqueIdentifier(object):
def __init__(self):
self.salt = os.urandom(8)
self.count = itertools.count(0)
self.cipher = AES.new(os.urandom(16), AES.MODE_ECB)
def Generate(self):
return self.cipher.encrypt(self.salt +
struct.pack("Q", next(self.count)))
def Verify(self, identifier):
"Return true if identifier was generated by this object."
return self.cipher.decrypt(identifier)[0:8] == self.salt
Depending on how many active IDs you have, 64 bits can be too little. By the birthday paradox, you'd end up with essentially the level of protection you might expect from 32 bit identifiers.
Besides, probably the best way to create these is to use some salted hash function, such as SHA-1 or MD5 or whatever your framework already has, with a randomly chosen salt (kept secret), and those generate at least 128 bits anyway, exactly for the reason mentioned above. If you use something that creates longer hash values, I don't really see any reason to truncate them.
To create identifiers you can check without storing them, take something easy to detect, such as having the same 64 bit patterns twice (giving a total of 128 bits) and encrypt that with some constant secret key, using AES or some other cipher with a block size of 128 bits (or whatever you picked). If and when the user sends some alleged key, decrypt and check for your easy-to-spot pattern.
It sounds to me like you might be over thinking this problem. This sounds 100% like an application for a GUID/UUID. Python even has a built in way to generate them. The whole point of GUID/UUIDs is that the odds of collision are astronomical, and by using a string instead of an encrypted token you can skip the decrypting operation in the verify step. I think this would also eliminate a whole slew of problems you might encounter regarding key management, and increase the speed of the whole process.
EDIT:
With a UUID, your verify method would just be a comparison between the given UUID and the stored one. Since the odds of a collision between two UUIDs is incredibly low, you shouldn't have to worry about false positives. In your example, it appears that the same object is doing both encryption and decryption, without a third party reading the stored data. If this is the case, you aren't gaining anything by passing around encrypted data except that the bits your passing around aren't easy to guess. I think a UUID would give you the same benefits, without the overhead of the encryption operations.
You make your identifier long enough, so it can't be reasonable guessed. In addition, let Unschedule wait for 1 second, if the token is not in use, so a brute force attack is not feasible anymore. Like the other answer said, session IDs in Webapplications are exactly the same problem, and I already saw session IDs which where 64 random characters long.
This is the same problem as dealing with session identifiers in ordinary web applications. Predictable session ids can easily lead to session hijacking.
Have a look at how session ids are generated. Here the content of a typical PHPSESSID cookie:
bf597801be237aa8531058dab94a08a9
If you want to be dead sure no brute-force attack is feasible, do the calculations backward: How many attempts can a cracker do per second? How many different unique id's are used at a random point in time? How many id's are there in total? How long would it take for the cracker to cover, say 1 % of the total space of ids? Adjust number of bits accordingly.
Do you need this pattern in a distributed or local environment?
If you're local, most OO languages should support the notion of object identity, so if you create an opaque handle - just create a new object.
handle = new Object(); // in Java
No other client can fake this.
If you need to use this in distributes environments, you may keep a pool of handles per session, so that a foreign session can never use a stolen handle.

Can I use part of MD5 hash for data identification?

I use MD5 hash for identifying files with unknown origin. No attacker here, so I don't care that MD5 has been broken and one can intendedly generate collisions.
My problem is I need to provide logging so that different problems are diagnosed easier. If I log every hash as a hex string that's too long, inconvenient and looks ugly, so I'd like to shorten the hash string.
Now I know that just taking a small part of a GUID is a very bad idea - GUIDs are designed to be unique, but part of them are not.
Is the same true for MD5 - can I take say first 4 bytes of MD5 and assume that I only get collision probability higher due to the reduced number of bytes compared to the original hash?
The short answer is yes, you can use the first 4 bytes as an id. Beware of the birthday paradox though:
http://en.wikipedia.org/wiki/Birthday_paradox
The risk of a collision rapidly increases as you add more files. With 50.000 there's roughly 25% chance that you'll get an id collision.
EDIT: Ok, just read the link to your other question and with 100.000 files the chance of collision is roughly 70%.
Here is a related topic you may refer to
What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?
Another way to shorten the hash is to convert it to something more efficient than HEX like Base64 or some variant there-of.
Even if you're determined to take on 4 characters, taking 4 characters of base64 gives you more bits than hex.

Handling close-to-impossible collisions on should-be-unique values

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.
A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.
Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states
The odds of the changeset hash
colliding by accident in your first
billion commits is basically zero. But
we will notice if it happens. And
you'll get to be famous as the guy who
broke SHA1 by accident.
But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.
Rather, I'd like to know (maybe by means of examples from your own work):
Are there any best practices as to covering these improbable cases anyway?
Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?
Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?
Or should even these cases get handled gracefully?
For me, especially the following are interesting, although they are somewhat touchy-feely:
If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?
The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.
However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.
Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.
If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.
Or should even these cases get handled gracefully?
The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).
Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:
1 / 2 ** 128 which is approximately equal to 3 * 10 ** -39.
The probability of seeing no collisions (p) given n samples can be computed using the logic used to explain the birthday problem.
p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)
where !denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:
Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png
Between 10**17 and 10**18 hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with 10**19 hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.
Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".