Chance of a hash collision - hash-collision

Apologies if this is a duplicate question; most of those I've found are over my head, so I may have missed the answer.
For a given hash, say MD5 (128 bits), what is the chance of a hash collision with 10^12 of them?
My maths is not great, I've come up with this equation (I think it's correct) but have no idea how to solve it:
Collision_Chance = 1 - (1 - (1 / 2^128) ) ^ (10^12)
I'm guessing it's somewhere around 10^-26, does this sound about right?
Thanks
Edit: I think my estimate is very wrong. See Birthday Paradox

What does your formula say for having 2^128 + 1 values? I believe it does not say that the collision probability is 1, so it cannot be right. actually, I know it is not – the correct formula is rather large and unwieldy, but there are good approximations using the exponential of a fraction. SO does not typeset formulas, so I won’t try and write the formulas down here.
The best key word to search for is probably “birthday attack”.

Why would a hash collision be a problem? Hashes are never designed to generate unique vaues, only to facillitate a fast first comparison.
If you are having trouble with hash collisions, you're using it wrong.

Related

How to perform Arithmetic on Ones Complement Numbers and correct overflow?

For some backstory, I'm making a program that can do arithmetic on ones complement numbers. To do this I'm converting a binary string into a BigInteger and then performing the math using said BigIntegers, and then converting that back into a binary string. The only problem occurs when the end result goes below -127 or above +127 because I don't know how to correct it due to the nature of ones complement numbers. I was hoping I could somehow instead convert them like unsigned numbers and do like what this answer says to do.
There are also a couple of other questions that I got from reading the linked question. I put them in block quotes. I'm just asking for information on what they mean, and explain it to me.
Firstly
I know that the r-1 complement for r-base number should do end around carry if the highest bit has carry.
Secondly
End-around carry is actually rather simple: it changes the modulus of the addition operation from rn to rn–1.
And lastly
Again, let's keep the carry bit where it is. If you look at the numbers as unsigned integers, we're computing 13 + 11 = 24. However, due to the wrap-around carry, addition is done modulo 15, so we end up with 9, which represents -6 (the correct result).
If someone can explain these quotes to me and provide some web pages for me to read I would greatly appreciate it! :)

AS3 function producing combinations of array, no duplicates

This sounds like a duplicate question, as there are several questions similar to this, but they don't specifically ask this (or I just haven't found it! :) )
I have an array, this one has two distinct elements, "a" and "b", and a length of four total elements:
var list:Array = ["a","a","b","b"];
I'm looking for all combinations, using all elements, no duplicates.
This should yield:
aabb
abab
abba
bbaa
baba
baab
Searching for a solution for this has given me results similar to these:
a,b,ab,ba,aab,abb,aba, etc
or
a a b b, a a b b, a a b b, etc
Mind you, the application that would ultimately use this function would have two distinct elements, "a" and "b", and a length of 50 total elements:
var list:Array = ["a","a","a","a","a","a","a","a","a","a",
"a","a","a","a","a","a","a","a","a","a",
"a","a","a","a","a",
"b","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b"]
...so a brute force solution like I used with aabb wouldn't be feasible.
Any help, especially using AS3 code, would be appreciated, even if it is simply pointing me to the right google search :)
Here is a JavaScript answer that might get you started: Permutations in JavaScript? (they're both EcmaScript implementations so converting to ActionScript should only require minor changes)
It doesn't handle the uniqueness requirement, but it might point you in the right direction.
However, there are a few things you might need to consider first. I don't think it will be feasible to pre-compute all unique permutations upfront.
Based on this answer about unique permutations it looks like there are 50! / 25! * 25! = 126,410,606,437,752 unique permutations for 25 a's and 25 b's.
To give an idea how large that number is: if each combination was 1 byte in memory (in practice it will be more than this) then that would be: 126410606437752 bytes = 126,410.6 gigabytes in memory.
Plus, the algorithm for generating the permutations has complexity O(n!) - so it might take far too long, separate to memory constraints, to generate the list of permutations.

solving Project Euler #305

Problem # 305
Let's call S the (infinite) string
that is made by concatenating the
consecutive positive integers
(starting from 1) written down in base
10.
Thus, S =
1234567891011121314151617181920212223242...
It's easy to see that any number will
show up an infinite number of times in
S.
Let's call f(n) the starting position
of the nth occurrence of n in S. For
example, f(1)=1, f(5)=81, f(12)=271
and f(7780)=111111365.
Find Summation[f(3^k)] for 1 <= k <=
13.
How can I go about solving this?
Calculating S to an arbitrary size is deceivingly easy, but as you have probably already found out, not practical, it simply becomes too big .
As is common for the newer Project Euler Problems, brute force simply does not work.
That said, you can still look at S for small values of k and maybe construct a formula that will solve the problem in parts (the first few values are easy to handle in memory). Also, look at Problem 40
Note: remember the one minute rule. (most problems can be solved in a few milliseconds)
My estimate of the running time is O(n2 log n), so this brute force approach is not feasible.
Note that you are supposed to solve Project Euler problems yourself, which IMHO applies in particular to newer problems.

What are "magic numbers" in computer programming?

When people talk about the use of "magic numbers" in computer programming, what do they mean?
Magic numbers are any number in your code that isn't immediately obvious to someone with very little knowledge.
For example, the following piece of code:
sz = sz + 729;
has a magic number in it and would be far better written as:
sz = sz + CAPACITY_INCREMENT;
Some extreme views state that you should never have any numbers in your code except -1, 0 and 1 but I prefer a somewhat less dogmatic view since I would instantly recognise 24, 1440, 86400, 3.1415, 2.71828 and 1.414 - it all depends on your knowledge.
However, even though I know there are 1440 minutes in a day, I would probably still use a MINS_PER_DAY identifier since it makes searching for them that much easier. Whose to say that the capacity increment mentioned above wouldn't also be 1440 and you end up changing the wrong value? This is especially true for the low numbers: the chance of dual use of 37197 is relatively low, the chance of using 5 for multiple things is pretty high.
Use of an identifier means that you wouldn't have to go through all your 700 source files and change 729 to 730 when the capacity increment changed. You could just change the one line:
#define CAPACITY_INCREMENT 729
to:
#define CAPACITY_INCREMENT 730
and recompile the lot.
Contrast this with magic constants which are the result of naive people thinking that just because they remove the actual numbers from their code, they can change:
x = x + 4;
to:
#define FOUR 4
x = x + FOUR;
That adds absolutely zero extra information to your code and is a total waste of time.
"magic numbers" are numbers that appear in statements like
if days == 365
Assuming you didn't know there were 365 days in a year, you'd find this statement meaningless. Thus, it's good practice to assign all "magic" numbers (numbers that have some kind of significance in your program) to a constant,
DAYS_IN_A_YEAR = 365
And from then on, compare to that instead. It's easier to read, and if the earth ever gets knocked out of alignment, and we gain an extra day... you can easily change it (other numbers might be more likely to change).
There's more than one meaning. The one given by most answers already (an arbitrary unnamed number) is a very common one, and the only thing I'll say about that is that some people go to the extreme of defining...
#define ZERO 0
#define ONE 1
If you do this, I will hunt you down and show no mercy.
Another kind of magic number, though, is used in file formats. It's just a value included as typically the first thing in the file which helps identify the file format, the version of the file format and/or the endian-ness of the particular file.
For example, you might have a magic number of 0x12345678. If you see that magic number, it's a fair guess you're seeing a file of the correct format. If you see, on the other hand, 0x78563412, it's a fair guess that you're seeing an endian-swapped version of the same file format.
The term "magic number" gets abused a bit, though, referring to almost anything that identifies a file format - including quite long ASCII strings in the header.
http://en.wikipedia.org/wiki/File_format#Magic_number
Wikipedia is your friend (Magic Number article)
Most of the answers so far have described a magic number as a constant that isn't self describing. Being a little bit of an "old-school" programmer myself, back in the day we described magic numbers as being any constant that is being assigned some special purpose that influences the behaviour of the code. For example, the number 999999 or MAX_INT or something else completely arbitrary.
The big problem with magic numbers is that their purpose can easily be forgotten, or the value used in another perfectly reasonable context.
As a crude and terribly contrived example:
while (int i != 99999)
{
DoSomeCleverCalculationBasedOnTheValueOf(i);
if (escapeConditionReached)
{
i = 99999;
}
}
The fact that a constant is used or not named isn't really the issue. In the case of my awful example, the value influences behaviour, but what if we need to change the value of "i" while looping?
Clearly in the example above, you don't NEED a magic number to exit the loop. You could replace it with a break statement, and that is the real issue with magic numbers, that they are a lazy approach to coding, and without fail can always be replaced by something less prone to either failure, or to losing meaning over time.
Anything that doesn't have a readily apparent meaning to anyone but the application itself.
if (foo == 3) {
// do something
} else if (foo == 4) {
// delete all users
}
Magic numbers are special value of certain variables which causes the program to behave in an special manner.
For example, a communication library might take a Timeout parameter and it can define the magic number "-1" for indicating infinite timeout.
The term magic number is usually used to describe some numeric constant in code. The number appears without any further description and thus its meaning is esoteric.
The use of magic numbers can be avoided by using named constants.
Using numbers in calculations other than 0 or 1 that aren't defined by some identifier or variable (which not only makes the number easy to change in several places by changing it in one place, but also makes it clear to the reader what the number is for).
In simple and true words, a magic number is a three-digit number, whose sum of the squares of the first two digits is equal to the third one.
Ex-202,
as, 2*2 + 0*0 = 2*2.
Now, WAP in java to accept an integer and print whether is a magic number or not.
It may seem a bit banal, but there IS at least one real magic number in every programming language.
0
I argue that it is THE magic wand to rule them all in virtually every programmer's quiver of magic wands.
FALSE is inevitably 0
TRUE is not(FALSE), but not necessarily 1! Could be -1 (0xFFFF)
NULL is inevitably 0 (the pointer)
And most compilers allow it unless their typechecking is utterly rabid.
0 is the base index of array elements, except in languages that are so antiquated that the base index is '1'. One can then conveniently code for(i = 0; i < 32; i++), and expect that 'i' will start at the base (0), and increment to, and stop at 32-1... the 32nd member of an array, or whatever.
0 is the end of many programming language strings. The "stop here" value.
0 is likewise built into the X86 instructions to 'move strings efficiently'. Saves many microseconds.
0 is often used by programmers to indicate that "nothing went wrong" in a routine's execution. It is the "not-an-exception" code value. One can use it to indicate the lack of thrown exceptions.
Zero is the answer most often given by programmers to the amount of work it would take to do something completely trivial, like change the color of the active cell to purple instead of bright pink. "Zero, man, just like zero!"
0 is the count of bugs in a program that we aspire to achieve. 0 exceptions unaccounted for, 0 loops unterminated, 0 recursion pathways that cannot be actually taken. 0 is the asymptote that we're trying to achieve in programming labor, girlfriend (or boyfriend) "issues", lousy restaurant experiences and general idiosyncracies of one's car.
Yes, 0 is a magic number indeed. FAR more magic than any other value. Nothing ... ahem, comes close.
rlynch#datalyser.com

Should we compare floating point numbers for equality against a *relative* error?

So far I've seen many posts dealing with equality of floating point numbers. The standard answer to a question like "how should we decide if x and y are equal?" is
abs(x - y) < epsilon
where epsilon is a fixed, small constant. This is because the "operands" x and y are often the results of some computation where a rounding error is involved, hence the standard equality operator == is not what we mean, and what we should really ask is whether x and y are close, not equal.
Now, I feel that if x is "almost equal" to y, then also x*10^20 should be "almost equal" to y*10^20, in the sense that the relative error should be the same (but "relative" to what?). But with these big numbers, the above test would fail, i.e. that solution does not "scale".
How would you deal with this issue? Should we rescale the numbers or rescale epsilon? How?
(Or is my intuition wrong?)
Here is a related question, but I don't like its accepted answer, for the reinterpret_cast thing seems a bit tricky to me, I don't understand what's going on. Please try to provide a simple test.
It all depends on the specific problem domain. Yes, using relative error will be more correct in the general case, but it can be significantly less efficient since it involves an extra floating-point division. If you know the approximate scale of the numbers in your problem, using an absolute error is acceptable.
This page outlines a number of techniques for comparing floats. It also goes over a number of important issues, such as those with subnormals, infinities, and NaNs. It's a great read, I highly recommend reading it all the way through.
As an alternative solution, why not just round or truncate the numbers and then make a straight comparison? By setting the number of significant digits in advance, you can be certain of the accuracy within that bound.
The problem is that with very big numbers, comparing to epsilon will fail.
Perhaps a better (but slower) solution would be to use division, example:
div(max(a, b), min(a, b)) < eps + 1
Now the 'error' will be relative.
Using relative error is at least not as bad as using absolute errors, but it has subtle problems for values near zero due to rounding issues. A far from perfect, but somewhat robust algorithm combines absolute and relative error approaches:
boolean approxEqual(float a, float b, float absEps, float relEps) {
// Absolute error check needed when comparing numbers near zero.
float diff = abs(a - b);
if (diff <= absEps) {
return true;
}
// Symmetric relative error check without division.
return (diff <= relEps * max(abs(a), abs(b)));
}
I adapted this code from Bruce Dawson's excellent article Comparing Floating Point Numbers, 2012 Edition, a required read for anyone doing floating-point comparisons -- an amazingly complex topic with many pitfalls.
Most of the time when code compares values, it is doing so to answer some sort of question. For example:
If I know what a function returned when given a value of X, can I assume it will return the same thing if given Y?
If I have a method of computing a function which is slow but accurate, I am willing to accept some inaccuracy in exchange for speed, and I want to test a candidate function which seems to fit the bill, are the outputs from that function close enough to the known-accurate one to be considered "correct".
To answer the first question, code should ideally do a bit-wise comparison on the value, though unless a language supports the new operators added to IEEE-754 in 2009 that may be less efficient than ideal. To answer the second question, one should define what degree of accuracy is required and test against that.
I don't think there's much merit in a general-purpose method which regards as equal things which are close, since different applications will have differing requirements for both absolute and relative tolerance, based upon what exact questions the tests are supposed to answer.