How do I assess the hash collision probability? - language-agnostic

I'm developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary files' names to my application. My application must process each file within a limited period of time, otherwise it is shut down - that's a watchdog-like security measure. Processing files is likely to take long so I need to design the application capable of handling this scenario. If my application gets shut down next time the search system wants to index the same file it will likely give it a different temporary name.
The obvious solution is to provide an intermediate layer between the search system and the backend. It will queue the request to the backend and wait for the result to arrive. If the request times out in the intermediate layer - no problem, the backend will continue working, only the intermediate layer is restarted and it can retrieve the result from the backend when the request is later repeated by the search system.
The problem is how to identify the files. Their names change randomly. I intend to use a hash function like MD5 to hash the file contents. I'm well aware of the birthday paradox and used an estimation from the linked article to compute the probability. If I assume I have no more than 100 000 files the probability of two files having the same MD5 (128 bit) is about 1,47x10-29.
Should I care of such collision probability or just assume that equal hash values mean equal file contents?

Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.
There are no accidental MD5 collisions, 1,47x10-29 is a really really really small number.
To overcome the issue of rehashing big files I would have a 3 phased identity scheme.
Filesize alone
Filesize + a hash of 64K * 4 in different positions in the file
A full hash
So if you see a file with a new size you know for certain you do not have a duplicate. And so on.

Just because the probability is 1/X it does not mean that it won't happen to you until you have X records. It's like the lottery, you're not likely to win, but somebody out there will win.
With the speed and capacity of computers these days (not even talking about security, just reliability) there is really no reason not to just use a bigger/better hash function than MD5 for anything critical. Stepping up to SHA-1 should help you sleep better at night, but if you want to be extra cautious then go to SHA-265 and never think about it again.
If performance is truly an issue then use BLAKE2 which is actually faster than MD5 but supports 256+ bits to make collisions less likely while having same or better performance. However, while BLAKE2 has been well-adopted, it probably would require adding a new dependency to your project.

I think you shouldn't.
However, you should if you have the notion of two equal files having different (real names, not md5-based). Like, in search system two document might have exactly same content, but being distinct because they're located in different places.

I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.
from random import randint
from math import log
from collections import Counter
def colltest(exp):
uniques = []
while True:
r = randint(0,2**exp)
if r in uniques:
return log(len(uniques) + 1, 2)
uniques.append(r)
for k,v in Counter([colltest(20) for i in xrange(1000)]):
print k, "hash orders of magnitude events before collission:",v
would print something like:
5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1
I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).
Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).
For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.

Here's an interactive calculator that lets you estimate probability of collision for any hash size and number of objects - http://everydayinternetstuff.com/2015/04/hash-collision-probability-calculator/

Related

What would be the most efficient way to generate a Discord-like username discriminator?

What's the most efficient way to find a random, yet unique, username discriminator; similar to how Discord and other services have begun using?
For instance, 1000 users may have the username JohnSmith but they'll all have distinct discriminators. So one user may be JohnSmith#3482 while another is JohnSmith#4782. When a new user registers (or changes their username) to JohnSmith, what would be the most efficient way to find an available discriminator?
For this example, let's assume a discriminator is numeric and between 0000-9999, and always 4 digits.
One method would be to fetch the discriminators of all users with the name JohnSmith and loop over them with an incremental counter until it found a number not occupied. However, this would be loading a lot of rows and wouldn't result in a truly random number.
Another option would be the same as the first, but generate a random number and check it against the results until an opening is found. This, however, could result in a very long process if only 1 opening exists, and would require tracking already-tested numbers to know when all options have been exhausted.
A third option, a hybrid of the two, would be to find all unused discriminators, persist them to an array, then randomly select one from the array.
Is there an easier way or more efficient manner than these?
Fetch all would be the slowest and most expensive operation. This kind of application should already be optimized to lookup a user by username for many reasons such as login. So the best way is to generate a random number and lookup if it exists.
If you are expecting a lot of collisions then it could get expensive as well. For that I would suggest increasing the total range for the discriminator so the collisions are less. 100 users => 1000->9999, 1000 users => 10000-99999.
You can try some other hacky ways, like pick the last JohnSmith, add a small random number to it mod the total number, and then check collision. That should give a similar random distribution. But having a large range is the best option.
For something similar would be how git determines the short SHA: https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection

Pros and cons of Flake ids and cryptographic Ids

A distributed system can generate unique ids either by Flake or cryptographic ids (e.g., 128 bit murmur3).
Wonder what are the pros and cons of each method.
I'm going to assume 128-bit ids, kind-a like UUIDs. Let's start at a baseline, though
TL;DR: Use random ids. If and only if you have database performance issues try flake ids.
Auto-increment ids
Auto-increment ids are when your backend system assigns a unique, densely-packed id to each new entity. This is usually done by a database, but not always.
The clear advantage is that the id is guaranteed unique to your system, though 128 bits is probably overkill.
The first disadvantage is that you leak information every time you expose your id. You leak what other ids there are (an attacker can easily guess what to look for). You also leak how busy your system is (your competition now knows how many ids you create in a time period and can infer, say financial information).
The second disadvantage is that your backend is no longer as scalable. You are tied to some slow, less scalable id generator that will always be a bottleneck in a large system.
Random ids
Random ids are when you just generate 128 random bytes. v4 UUIDs 122-bit random ids (e.g. 2bbfb5ba-f5a2-11e7-8c3f-9a214cf093ae). These are also practically unique.
Random ids get rid of both of the disadvantages of auto-increment ids: they leak no information and are infinitely scalable.
The disadvantage comes when storing ids in b-trees (à la databases) because they randomize the memory/disk pages that the tree accesses. This may be a source of slow-downs to your system.
To me this is still the ideal id scheme, and you should have a good reason to move off of it. (i.e. profiler data).
Flake ids
Flake ids are random ids with except that the high k bits are taken from the lower bits of a timestamp. For example, you may get the following three ids in a row, where the top bits are really close together.
2bbfb5baf5a211e78c3f9a214cf093ae
2bbf9d4ec10c41049fb1671d6616b213
2bc6bb66e5964fb59050fcf3beed51b1
While you may leak some information, it isn't much if your k and timestamp granularity are designed well.
But if you mal-design the ids they can be less-than-helpful, either too infrequently updated—leading the b-trees to rely on the top random bits negating the usefulness—or too frequently—where you thrash the database because your updates.
Note: By time granularity, I mean how frequently the low bits of a timestamp change. Depending on your data throughput, you probably want this to be hour, deca-minutes, or minutes. It's a balance.
If you see the ids otherwise semantic-less (i.e. never infer anything from the top bits) then you can change any of these parameters at any time without interruption—even going back to purely random where k = 0.
Cryptographic ids
I'm assuming by this you mean ids have some semantic information encrypted in them. Maybe like hashids?
Disadvantages abound:
You'll have different length ids for different data, unless you have a fixed-length protocol.
You'll be tempted to add more and more info to the ids.
Look random, but no mitigation to add flake-like timestamps to the front
Ids become tied to the system that made it. You may start asking that system for decrypted versions of the id instead of just asking for the data it points to.
Your system burns time decrypting ids to extract data.
You add encryption problems
what happens if the secret-key is leaked? (Better not have too sensitive of data in there, customer name, or heaven forbid a credit card number)
coordinating key rotation.
Small ids like hashid can be brute-forced attack.
As you can see, I am not a fan of semantic ids in general. There are a few places where I use them, though I call them tokens. These don't get stored as keys in a database (or likely not stored anywhere).
For example I use encryption for pagination tokens: encrypted {last-id / context} of a pagination API. I prefer this over having the client pass the last element of the prior page because we keep the database context hidden from the user. It's simpler for everyone, and the encryption is little more than obfuscation (no sensitive information).

Sorting a AS3 ByteArray

I am designing an Air application that needs to store thousands of records in memory, and needs to sort them efficiently, by various keys.
I thought of using a ByteArray, since that would avoid all the overhead of normal AS3 objects, and would allow me to use memory more efficiently.
However, the challenge is how to sort the records inside the ByteArray. I have thought of two possibilities:
1- Implement quick-sort or heap-sort in AS3, and sort the array this way. However, I am not sure this will be performant enough. For example, ByteArrays do not have methods to copy chunks of memory around; it has to be done byte-by-byte.
2- Create an Air Native Extension (ANE) that takes the ByteArray and sorts it, using C. the drawback of this is that it will be harder to implement for all the platforms it needs to run on.
What would you recommend? Do you have any previous experience doing something similar?
I'd say use Array or Vector objects, there's a possibility to sort Arrays on whatever key you want via sortOn(), and Vectors via sort(), so you can achieve whatever behavior you need, as the latter accepts a function as its parameter, check here. And I believe you won't get anywhere with ByteArrays, since what is actually done in sorting objects is sorting links in there, while a ByteArray will contain actual data.
You should never design anything that HAS to have hundreds of thousands of anything in the memory at once. Offload stuff while you don't need it. Do you know how much 100,000 is? Taking a single byte and multiplying by 100,000 gives you a MB. For every 1 byte of data in a record, you will generate 1MB of memory. Recording 100,000 ints takes 4MB.
If your records have 2 20 character strings (a first and last name), a String character is represented with 8 bytes, so you have just filled the memory with 640 MB of nothing more than first and last names. Most 'bad' computers have like, what... 2GB of memory? Good Job taking up 1/4 of that. Even if you managed to truncate this down to ByteArray level with superhuman uber bitshifing, you're still talking about reducing data by a factor of 8. So now you have 80MB for just first and last names and no other data. You could survive on that- except I suspect your records have more data then 2 strings. 20 strings? You're eating 800MB of data. Offload everything but 100 records at a time, and you're down to 640KB of memory for those names. And yes, you can load and offload while sorting.
Chunks of memory don't copy faster than bytes. It's all the same. The reasons Vectors of Objects are performant when switching is because they switch references/pointers/one single 32 bit/64 bit number instead of copying chunks of memory.
It's not clear what you're sorting. Bytes only go up to values of 256, so clearly you're using more bytes than 1 for each record. You want to evaluate each set of... like 2000+ bytes against each other set of 2000+ bytes? Like "Ah, last name is bytes 32-83, so extract those bytes, for every group of 4 bytes, bit shift them 0, 8, 16, 32 bits respectively, add them together, concatinate their integer values into a a String, do a comparison, now compare bytes 84-124 against the bytes in the next option, now transfer bytes 0-2000 to location 443530-441530 and....... Do these records have variable length strings or arrays in them? Oh lord.
Flash is not the place to write in assembly!
Use objects and test the speed & memory consumption. If either makes you cry, use more conventional methods of reducing load; like offloading materials temporarily into text files. The ugliest you should be getting is avoiding objects by storing each individual property in a different Vector. Vector., etc and having the same index refer to one record across the board.

Best practice for storing GPS data of a tracking app in mysql database

I have a datamodel question for a GPS tracking app. When someone uses our app it will save latitude, longitude, current speed, timestamp and burned_calories every 5 seconds. When a workout is completed, the average speed, total time/distance and burned calories of the workout will be stored in a database. So far so good..
What we want is to also store the data that is saved those every 5 seconds, so we can utilize this later on to plot graphs/charts of a workout for example.
How should we store this amount of data in a database? A single workout can contain 720 rows if someone runs for an hour. Perhaps a serialised/gzcompressed data array in a single row. I'm aware though that this is bad practice..
A relational one/many to many model would be undone? I know MySQL can easily handle large amounts of data, but we are talking about 720 * workouts
twice a week * 7000 users = over 10 million rows a week.
(Ofcourse we could only store the data of every 10 seconds to halve the no. of rows, or every 20 seconds, etc... but it would still be a large amount of data over time + the accuracy of the graphs would decrease)
How would you do this?
Thanks in advance for your input!
Just some ideas:
Quantize your lat/lon data. I believe that for technical reasons, the data most likely will be quantized already, so if you can detect that quantization, you might use it. The idea here is to turn double numbers into reasonable integers. In the worst case, you may quantize to the precision double numbers provide, which means using 64 bit integers, but I very much doubt your data is even close to that resolution. Perhaps a simple grid with about one meter edge length is enough for you?
Compute differences. Most numbers will be fairly large in terms of absolute values, but also very close together (unless your members run around half the world…). So this will result in rather small numbers. Furthermore, as long as people run with constant speed into a constant direction, you will quite often see the same differences. The coarser your spatial grid in step 1, the more likely you get exactly the same differences here.
Compute a Huffman code for these differences. You might try encoding lat and long movement separately, or computing a single code with 2d displacement vectors at its leaves. Try both and compare the results.
Store the result in a BLOB, together with the dictionary to decode your Huffman code, and the initial position so you can return data to absolute coordinates.
The result should be a fairly small set of data for each data set, which you can retrieve and decompress as a whole. Retrieving individual parts from the database is not possible, but it sounds like you wouldn't be needing that.
The benefit of Huffman coding over gzip is that you won't have to artificially introduce an intermediate byte stream. Directly encoding the actual differences you encounter, with their individual properties, should work much better.

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".