Is the use of machine epsilon appropriate for floating-point equality tests? - language-agnostic

This is a follow-up to Testing for floating-point value equality: Is there a standard name for the “precision” constant?.
There is a very similar question Double.Epsilon for equality, greater than, less than, less than or equal to, greater than or equal to.
It is well known that an equality test for two floating-point values x and y should look more like this (rather than a straightforward =):
abs( x - y ) < epsilon , where epsilon is some very small value.
How to choose a value for epsilon?
It would obviously be preferable to choose for epsilon as small a value as possible, to get the highest-possible precision for the equality check.
As an example, the .NET framework offers a constant System.Double.Epsilon (= 4.94066 × 10-324), which represents the smallest positive System.Double value that is greater than zero.
However, it turns out that this particular value can't be reliably used as epsilon, since:
0 + System.Double.Epsilon ≠ 0
1 + System.Double.Epsilon = 1 (!)
which is, if I understand correctly, because that constant is less than machine epsilon.
→ Is this correct?
→ Does this also mean that I can reliably use epsilon := machine epsilon for equality tests?
Removed these two questions, as they are already adequately answered by the second SO question linked-to above.
The linked-to Wikipedia article says that for 64-bit floating-point numbers (ie. the double type in many languages), machine epsilon is equal to:
2-53, or approx. 0.000000000000000111 (a number with 15 zeroes after the decimal point)
→ Does it follow from this that all 64-bit floating point values are guaranteed to be accurate to 14 (if not 15) digits?

How to choose a value for epsilon?
Short Answer: You take a small value which fits your applications needs.
Long Answer: Nobody can know which calculations your application does and how accurate you expect your results to be. Since rounding errors sum up machine epsilon will be almost all times far too big so you have to chose your own value. Depending on your needs, 0.01 be be sufficient, or maybe 0.00000000000001 or less will.
The question is, do you really want/need to do equality tests on floating point values? Maybe you should redesign your algorithms.

In the past when I have had to use an epsilon value it's been very much bigger than the machine epsilon value.
Although it was for 32 bit doubles (rather than 64 bit doubles) we found that an epsilon value of 10-6 was needed for most (if not all) calculated values in our particular application.
The value of epsilon you choose depends on the scale of your numbers. If you are dealing with the very large (10+10 say) then you might need a larger value of epsilon as your significant digits don't stretch very far into the fractional part (if at all). If you are dealing with the very small (10-10 say) then obviously you need an epsilon value that's smaller than this.
You need to do some experimentation, performing your calculations and checking the differences between your output values. Only when you know the range of your potential answers will you be able to decide on a suitable value for your application.

The sad truth is: There is no appropriate epsilon for floating-point comparisons. Use another approach for floating-point equality tests if you don't want to run into serious bugs.
Approximate floating-point comparison is an amazingly tricky field, and the abs(x - y) < eps approach works only for a very limited range of values, mainly because of the absolute difference not taking into account the magnitude of the compared values, but also due to the significant digit cancellation occurring in the subtraction of two floating-point values with different exponents.
There are better approaches, using relative differences or ULPs, but they have their own shortcomings and pitfalls. Read Bruce Dawson's excellent article Comparing Floating Point Numbers, 2012 Edition for a great introduction into how tricky floating-point comparisons really are -- a must-read for anyone doing floating-point programming IMHO! I'm sure countless thousands of man-years have been spent finding out the subtle bugs due to naive floating-point comparisons.

I also have questions regarding what would be the correct procedure. However I believe one should do:
abs(x - y) <= 0.5 * eps * max(abs(x), abs(y))
instead of:
abs(x - y) < eps
The reason for this arises from the definition of the machine epsilon. Using python code:
import numpy as np
real = np.float64
eps = np.finfo(real).eps
## Let's get the machine epsilon
x, dx = real(1), real(1)
while x+dx != x: dx/= real(2) ;
print "eps = %e dx = %e eps*x/2 = %e" % (eps, dx, eps*x/real(2))
Which gives: eps = 2.220446e-16 dx = 1.110223e-16 eps*x/2 = 1.110223e-16
## Now for x=16
x, dx = real(16), real(1)
while x+dx != x: dx/= real(2) ;
print "eps = %e dx = %e eps*x/2 = %e" % (eps, dx, eps*x/real(2))
Which now gives: eps = 2.220446e-16 dx = 1.776357e-15 eps*x/2 = 1.776357e-15
## For x not equal to 2**n
x, dx = real(36), real(1)
while x+dx != x: dx/= real(2) ;
print "eps = %e dx = %e eps*x/2 = %e" % (eps, dx, eps*x/real(2))
Which returns: eps = 2.220446e-16 dx = 3.552714e-15 eps*x/2 = 3.996803e-15
However, despite the difference between dx and eps*x/2, we see that dx <= eps*x/2,
thus it serves the purpose for equality tests, checking for tolerances when testing for convergence in numerical procedures, etc.
Such is similar to what is in:
www.ibiblio.org/pub/languages/fortran/ch1-8.html#02,
however if someone knows of better procedures or if something here is incorrect, please do say.

Related

Is second order method worse than first order method?

I was thinking about an elementary question in numerical analysis.
When discretizing an ordinary differential equation, it is well known that a second order method is more accurate than a first order method, since the truncation error for second order method is O(dx^2) and O(dx) for the first order method. This is true when 0 < dx < 1.
what if dx > 1? For example, the domain is 0 to 10000 and the mesh size is 1000, then dx = 10. In this case, is the second order method not accurate as first order method, since dx^2 = 100 and dx = 10? We can encounter this when dealing with large scale problem, such as climate modeling (the cloud size could be several kilometers).
A second order method is not more accurate than a first order method because dx^2 < dx, for some value of dx. It's a statement about the asymptotic rate of convergence for small dx.
Additionally, comparing dx^2 to dx directly doesn't make sense, because dx isn't a unitless quantity, it's a length. So you're trying to compare an area to a length, which is meaningless.
In big-O notation, if a quantity converges with O(dx^2), then that typically means that the error is of the form e = a2 dx^2 + a3 dx^3 + ... The leading coefficient a2 is in the units of X/meters^2, where X is whatever units your error is in, and maybe you use some other length instead of meters. Similarly, for a first order solution, the error is in the form b1 dx + b2 dx^2 + ..., where b1 is in units of X/meters.
So if you decide you can neglect the non-leading terms (which you probably can't for large values of dx), the comparison isn't between dx^2 and dx, it's between a2 dx^2 and b1 dx. There is obviously a cross over between these two error terms, but it's not at dx=1, it's at dx = b1/a2. If your discretization is that coarse, you're probably not in the asymptotic regime in which you can ignore higher-order terms, and your solution is probably very inaccurate anyway.

numerical issues causing the difference in outputs of two programs?

I have two codes that theoretically should return the exact same output. However, this does not happen. The issue is that the two codes handle very small numbers (doubles) to the order of 1e-100 or so. I suspect that there could be some numerical issues which are related to that, and lead to the two outputs being different even though they should be theoretically the same.
Does it indeed make sense that handling numbers on the order of 1e-100 cause such problems? I don't mind the difference in output, if I could safely assume that the source is numerical issues. Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
Thanks.
Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
The first reference that comes to mind is What Every Computer Scientist Should Know About Floating-Point Arithmetic. It covers floating-point maths in general.
As far as numerical stability is concerned, the best references probably depend on the numerical algorithm in question. Two wide-ranging works that come to mind are:
Numerical Recipes by Press et al;
Matrix Computations by Golub and Van Loan.
It is not necessarily the small numbers that are causing the problem.
How do you check whether the outputs are the "exact same"?
I would check equality with tolerance. You may consider the floating point numbers x and y equal if either fabs(x-y) < 1.0e-6 or fabs(x-y) < fabs(x)*1.0e-6 holds.
Usually, there is a HUGE difference between the two algorithms if there are numerical issues. Often, a small change in the input may result in extreme changes in the output, if the algorithm suffers from numerical issues.
What makes you think that there are "numerical issues"?
If possible, change your algorithm to use Kahan Summation (aka compensated summation). From Wikipedia:
function KahanSum(input)
var sum = 0.0
var c = 0.0 //A running compensation for lost low-order bits.
for i = 1 to input.length do
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
//Next time around, the lost low part will be added to y in a fresh attempt.
return sum
This works by keeping a second running total of the cumulative error, similar to the Bresenham line drawing algorithm. The end result is that you get precision that is nearly double the data type's advertised precision.
Another technique I use is to sort my numbers from small to large (by manitude, ignoring sign) and add or subtract the small numbers first, then the larger ones. This has the virtue that if you add and subtract the same value multiple times, such numbers may cancel exactly and can be removed from the list.

CUDA, float precision

I am using CUDA 4.0 on Geforce GTX 580 (Fermi) . I have numbers as small as 7.721155e-43 . I want to multiply them with each other just once or better say I want to calculate 7.721155e-43 * 7.721155e-43 .
My experience showed me I can't do it just straight forward. Could you please give me suggestion? Do I need to use double precision? How?
The magnitude of the smallest normal IEEE single-precision number is about 1.18e-38, the smallest denormal gets you down to about 1.40e-45. As a consequece an operand of magnitude 7.82e-43 will comprise only about 9 non-zero bits, which in itself may already be a problem, even before you get to the multiplication (whose result will underflow to zero in single precision). So you may also want to look at any up-stream computation that produces these tiny numbers.
If these small numbers are intermediate terms in a mathematical expression, rewriting that expression into a mathematically equivalent one that does not involve tiny intermediates would be one way of addressing the issue. Or you could scale some operands by factors that are powers of two (so as to not incur additional round-off due to the scaling). For example, scale by 2^24 = 16777216.
Lastly, you can switch part of the computation to double precision. To do so, simply introduce temporary variables of type double, perform the computation on them, then convert the final result back to float:
float r, f = 7.721155e-43f;
double d, t;
d = (double)f; // explicit cast is not necessary, since converting to wider type
t = d * d;
[... more intermediate computation, leaving result in 't' ...]
r = (float)t; // since conversion is to narrower type, cast will avoid warnings
In statistics we often have to work with likelihoods that end up being very small numbers and the standard technique is to use logs for everything. Then multiplication on a log scale is just addition. All intermediate numbers are stored as logs. Indeed it can take a bit of getting used to - but the alternative will often fail even when doing relatively modest computations. In R (for my convenience!) which uses doubles and prints 7 significant figures by default btw:
> 7.721155e-43 * 7.721155e-43
[1] 5.961623e-85
> exp(log(7.721155e-43) + log(7.721155e-43))
[1] 5.961623e-85

Repeated application of functions

Reading this question got me thinking: For a given function f, how can we know that a loop of this form:
while (x > 2)
x = f(x)
will stop for any value x? Is there some simple criterion?
(The fact that f(x) < x for x > 2 doesn't seem to help since the series may converge).
Specifically, can we prove this for sqrt and for log?
For these functions, a proof that ceil(f(x))<x for x > 2 would suffice. You could do one iteration -- to arrive at an integer number, and then proceed by simple induction.
For the general case, probably the best idea is to use well-founded induction to prove this property. However, as Moron pointed out in the comments, this could be impossible in the general case and the right ordering is, in many cases, quite hard to find.
Edit, in reply to Amnon's comment:
If you wanted to use well-founded induction, you would have to define another strict order, that would be well-founded. In case of the functions you mentioned this is not hard: you can take x << y if and only if ceil(x) < ceil(y), where << is a symbol for this new order. This order is of course well-founded on numbers greater then 2, and both sqrt and log are decreasing with respect to it -- so you can apply well-founded induction.
Of course, in general case such an order is much more difficult to find. This is also related, in some way, to total correctness assertions in Hoare logic, where you need to guarantee similar obligations on each loop construct.
There's a general theorem for when then sequence of iterations will converge. (A convergent sequence may not stop in a finite number of steps, but it is getting closer to a target. You can get as close to the target as you like by going far enough out in the sequence.)
The sequence x, f(x), f(f(x)), ... will converge if f is a contraction mapping. That is, there exists a positive constant k < 1 such that for all x and y, |f(x) - f(y)| <= k |x-y|.
(The fact that f(x) < x for x > 2 doesn't seem to help since the series may converge).
If we're talking about floats here, that's not true. If for all x > n f(x) is strictly less than x, it will reach n at some point (because there's only a limited number of floating point values between any two numbers).
Of course this means you need to prove that f(x) is actually less than x using floating point arithmetic (i.e. proving it is less than x mathematically does not suffice, because then f(x) = x may still be true with floats when the difference is not enough).
There is no general algorithm to determine whether a function f and a variable x will end or not in that loop. The Halting problem is reducible to that problem.
For sqrt and log, we could safely do that because we happen to know the mathematical properties of those functions. Say, sqrt approaches 1, log eventually goes negative. So the condition x < 2 has to be false at some point.
Hope that helps.
In the general case, all that can be said is that the loop will terminate when it encounters xi≤2. That doesn't mean that the sequence will converge, nor does it even mean that it is bounded below 2. It only means that the sequence contains a value that is not greater than 2.
That said, any sequence containing a subsequence that converges to a value strictly less than two will (eventually) halt. That is the case for the sequence xi+1 = sqrt(xi), since x converges to 1. In the case of yi+1 = log(yi), it will contain a value less than 2 before becoming undefined for elements of R (though it is well defined on the extended complex plane, C*, but I don't think it will, in general converge except at any stable points that may exist (i.e. where z = log(z)). Ultimately what this means is that you need to perform some upfront analysis on the sequence to better understand its behavior.
The standard test for convergence of a sequence xi to a point z is that give ε > 0, there is an n such that for all i > n, |xi - z| < ε.
As an aside, consider the Mandelbrot Set, M. The test for a particular point c in C for an element in M is whether the sequence zi+1 = zi2 + c is unbounded, which occurs whenever there is a |zi| > 2. Some elements of M may converge (such as 0), but many do not (such as -1).
Sure. For all positive numbers x, the following inequality holds:
log(x) <= x - 1
(this is a pretty basic result from real analysis; it suffices to observe that the second derivative of log is always negative for all positive x, so the function is concave down, and that x-1 is tangent to the function at x = 1). From this it follows essentially immediately that your while loop must terminate within the first ceil(x) - 2 steps -- though in actuality it terminates much, much faster than that.
A similar argument will establish your result for f(x) = sqrt(x); specifically, you can use the fact that:
sqrt(x) <= x/(2 sqrt(2)) + 1/sqrt(2)
for all positive x.
If you're asking whether this result holds for actual programs, instead of mathematically, the answer is a little bit more nuanced, but not much. Basically, many languages don't actually have hard accuracy requirements for the log function, so if your particular language implementation had an absolutely terrible math library this property might fail to hold. That said, it would need to be a really, really terrible library; this property will hold for any reasonable implementation of log.
I suggest reading this wikipedia entry which provides useful pointers. Without additional knowledge about f, nothing can be said.

How would you write this algorithm for large combinations in the most compact way?

The number of combinations of k items which can be retrieved from N items is described by the following formula.
N!
c = ___________________
(k! * (N - k)!)
An example would be how many combinations of 6 Balls can be drawn from a drum of 48 Balls in a lottery draw.
Optimize this formula to run with the smallest O time complexity
This question was inspired by the new WolframAlpha math engine and the fact that it can calculate extremely large combinations very quickly. e.g. and a subsequent discussion on the topic on another forum.
http://www97.wolframalpha.com/input/?i=20000000+Choose+15000000
I'll post some info/links from that discussion after some people take a stab at the solution.
Any language is acceptable.
Python: O(min[k,n-k]2)
def choose(n,k):
k = min(k,n-k)
p = q = 1
for i in xrange(k):
p *= n - i
q *= 1 + i
return p/q
Analysis:
The size of p and q will increase linearly inside the loop, if n-i and 1+i can be considered to have constant size.
The cost of each multiplication will then also increase linearly.
This sum of all iterations becomes an arithmetic series over k.
My conclusion: O(k2)
If rewritten to use floating point numbers, the multiplications will be atomic operations, but we will lose a lot of precision. It even overflows for choose(20000000, 15000000). (Not a big surprise, since the result would be around 0.2119620413×104884378.)
def choose(n,k):
k = min(k,n-k)
result = 1.0
for i in xrange(k):
result *= 1.0 * (n - i) / (1 + i)
return result
Notice that WolframAlpha returns a "Decimal Approximation". If you don't need absolute precision, you could do the same thing by calculating the factorials with Stirling's Approximation.
Now, Stirling's approximation requires the evaluation of (n/e)^n, where e is the base of the natural logarithm, which will be by far the slowest operation. But this can be done using the techniques outlined in another stackoverflow post.
If you use double precision and repeated squaring to accomplish the exponentiation, the operations will be:
3 evaluations of a Stirling approximation, each requiring O(log n) multiplications and one square root evaluation.
2 multiplications
1 divisions
The number of operations could probably be reduced with a bit of cleverness, but the total time complexity is going to be O(log n) with this approach. Pretty manageable.
EDIT: There's also bound to be a lot of academic literature on this topic, given how common this calculation is. A good university library could help you track it down.
EDIT2: Also, as pointed out in another response, the values will easily overflow a double, so a floating point type with very extended precision will need to be used for even moderately large values of k and n.
I'd solve it in Mathematica:
Binomial[n, k]
Man, that was easy...
Python: approximation in O(1) ?
Using python decimal implementation to calculate an approximation. Since it does not use any external loop, and the numbers are limited in size, I think it will execute in O(1).
from decimal import Decimal
ln = lambda z: z.ln()
exp = lambda z: z.exp()
sinh = lambda z: (exp(z) - exp(-z))/2
sqrt = lambda z: z.sqrt()
pi = Decimal('3.1415926535897932384626433832795')
e = Decimal('2.7182818284590452353602874713527')
# Stirling's approximation of the gamma-funciton.
# Simplification by Robert H. Windschitl.
# Source: http://en.wikipedia.org/wiki/Stirling%27s_approximation
gamma = lambda z: sqrt(2*pi/z) * (z/e*sqrt(z*sinh(1/z)+1/(810*z**6)))**z
def choose(n, k):
n = Decimal(str(n))
k = Decimal(str(k))
return gamma(n+1)/gamma(k+1)/gamma(n-k+1)
Example:
>>> choose(20000000,15000000)
Decimal('2.087655025913799812289651991E+4884377')
>>> choose(130202807,65101404)
Decimal('1.867575060806365854276707374E+39194946')
Any higher, and it will overflow. The exponent seems to be limited to 40000000.
Given a reasonable number of values for n and K, calculate them in advance and use a lookup table.
It's dodging the issue in some fashion (you're offloading the calculation), but it's a useful technique if you're having to determine large numbers of values.
MATLAB:
The cheater's way (using the built-in function NCHOOSEK): 13 characters, O(?)
nchoosek(N,k)
My solution: 36 characters, O(min(k,N-k))
a=min(k,N-k);
prod(N-a+1:N)/prod(1:a)
I know this is a really old question but I struggled with a solution to this problem for a long while until I found a really simple one written in VB 6 and after porting it to C#, here is the result:
public int NChooseK(int n, int k)
{
var result = 1;
for (var i = 1; i <= k; i++)
{
result *= n - (k - i);
result /= i;
}
return result;
}
The final code is so simple you won't believe it will work until you run it.
Also, the original article gives some nice explanation on how he reached the final algorithm.