Why does division by zero in IEEE754 standard results in Infinite value? - language-agnostic

I'm just curious, why in IEEE-754 any non zero float number divided by zero results in infinite value? It's a nonsense from the mathematical perspective. So I think that correct result for this operation is NaN.
Function f(x) = 1/x is not defined when x=0, if x is a real number. For example, function sqrt is not defined for any negative number and sqrt(-1.0f) if IEEE-754 produces a NaN value. But 1.0f/0 is Inf.
But for some reason this is not the case in IEEE-754. There must be a reason for this, maybe some optimization or compatibility reasons.
So what's the point?

It's a nonsense from the mathematical perspective.
Yes. No. Sort of.
The thing is: Floating-point numbers are approximations. You want to use a wide range of exponents and a limited number of digits and get results which are not completely wrong. :)
The idea behind IEEE-754 is that every operation could trigger "traps" which indicate possible problems. They are
Illegal (senseless operation like sqrt of negative number)
Overflow (too big)
Underflow (too small)
Division by zero (The thing you do not like)
Inexact (This operation may give you wrong results because you are losing precision)
Now many people like scientists and engineers do not want to be bothered with writing trap routines. So Kahan, the inventor of IEEE-754, decided that every operation should also return a sensible default value if no trap routines exist.
They are
NaN for illegal values
signed infinities for Overflow
signed zeroes for Underflow
NaN for indeterminate results (0/0) and infinities for (x/0 x != 0)
normal operation result for Inexact
The thing is that in 99% of all cases zeroes are caused by underflow and therefore in 99%
of all times Infinity is "correct" even if wrong from a mathematical perspective.

I'm not sure why you would believe this to be nonsense.
The simplistic definition of a / b, at least for non-zero b, is the unique number of bs that has to be subtracted from a before you get to zero.
Expanding that to the case where b can be zero, the number that has to be subtracted from any non-zero number to get to zero is indeed infinite, because you'll never get to zero.
Another way to look at it is to talk in terms of limits. As a positive number n approaches zero, the expression 1 / n approaches "infinity". You'll notice I've quoted that word because I'm a firm believer in not propagating the delusion that infinity is actually a concrete number :-)
NaN is reserved for situations where the number cannot be represented (even approximately) by any other value (including the infinities), it is considered distinct from all those other values.
For example, 0 / 0 (using our simplistic definition above) can have any amount of bs subtracted from a to reach 0. Hence the result is indeterminate - it could be 1, 7, 42, 3.14159 or any other value.
Similarly things like the square root of a negative number, which has no value in the real plane used by IEEE754 (you have to go to the complex plane for that), cannot be represented.

In mathematics, division by zero is undefined because zero has no sign, therefore two results are equally possible, and exclusive: negative infinity or positive infinity (but not both).
In (most) computing, 0.0 has a sign. Therefore we know what direction we are approaching from, and what sign infinity would have. This is especially true when 0.0 represents a non-zero value too small to be expressed by the system, as it frequently the case.
The only time NaN would be appropriate is if the system knows with certainty that the denominator is truly, exactly zero. And it can't unless there is a special way to designate that, which would add overhead.

NOTE:
I re-wrote this following a valuable comment from #Cubic.
I think the correct answer to this has to come from calculus and the notion of limits. Consider the limit of f(x)/g(x) as x->0 under the assumption that g(0) == 0. There are two broad cases that are interesting here:
If f(0) != 0, then the limit as x->0 is either plus or minus infinity, or it's undefined. If g(x) takes both signs in the neighborhood of x==0, then the limit is undefined (left and right limits don't agree). If g(x) has only one sign near 0, however, the limit will be defined and be either positive or negative infinity. More on this later.
If f(0) == 0 as well, then the limit can be anything, including positive infinity, negative infinity, a finite number, or undefined.
In the second case, generally speaking, you cannot say anything at all. Arguably, in the second case NaN is the only viable answer.
Now in the first case, why choose one particular sign when either is possible or it might be undefined? As a practical matter, it gives you more flexibility in cases where you do know something about the sign of the denominator, at relatively little cost in the cases where you don't. You may have a formula, for example, where you know analytically that g(x) >= 0 for all x, say, for example, g(x) = x*x. In that case the limit is defined and it's infinity with sign equal to the sign of f(0). You might want to take advantage of that as a convenience in your code. In other cases, where you don't know anything about the sign of g, you cannot generally take advantage of it, but the cost here is just that you need to trap for a few extra cases - positive and negative infinity - in addition to NaN if you want to fully error check your code. There is some price there, but it's not large compared to the flexibility gained in other cases.
Why worry about general functions when the question was about "simple division"? One common reason is that if you're computing your numerator and denominator through other arithmetic operations, you accumulate round-off errors. The presence of those errors can be abstracted into the general formula format shown above. For example f(x) = x + e, where x is the analytically correct, exact answer, e represents the error from round-off, and f(x) is the floating point number that you actually have on the machine at execution.

Related

NaN and +-INF in floating point number system following IEEE754

In the standard, representation of NaN and INF is like this:
For NaN: exponent = emax+1 & mantissa != 0;
For INF: exponent = emax+1 & mantissa = 0;
Their are many ways and calculations resulting these two value.
But what ACTUALLY is NaN(INF)?
And HOW does the system "decide" or "judge" to store value as these one(two)?
Here may be a case seeming to be odd to me:
a = b = 1*2(emax);
then calculating c = a+b, the actual result is 1*2^(emax+1);
Now, c is not an available FP value according to the standard;
then how does the system store c in device?
Is it NaN?
If yes, how can this be even reasonable?
I mean, 1*2^(emax+1) IS(Should be) a Number...in a common sense...?
If this is the case, then how ACTUALLY does the standard think what a NaN is?
If not, then how do we deal with this???
I'm considering one like this:
let eM = emax+1;
then 1d.d...d * 2^(eM-1) = 1d.d...d * 2^(emax)
with 1d.d...d having legal number of digits by the system.
This is actually a way like that dealing with denormalized number.
The thing here is this:
Is the judgement posterior or prior to the completion of calculation?
If it's the former, the above may be a problem or not?
On the other hand, then the task seems undonable...
Is there anyone ever thinking about this issue?
Thx for considering it!!
Note: things for +-INF are also presented.
From Wikipedia:
The five possible exceptions are:
Invalid operation (e.g., square root of a negative number) (returns qNaN by default).
Division by zero (an operation on finite operands gives an exact infinite result, e.g., 1/0 or log(0)) (returns ±infinity by default).
Overflow (a result is too large to be represented correctly) (returns ±infinity by default (for round-to-nearest mode)).
Underflow (a result is very small (outside the normal range) and is inexact) (returns a denormalized value by default).
...

numerical issues causing the difference in outputs of two programs?

I have two codes that theoretically should return the exact same output. However, this does not happen. The issue is that the two codes handle very small numbers (doubles) to the order of 1e-100 or so. I suspect that there could be some numerical issues which are related to that, and lead to the two outputs being different even though they should be theoretically the same.
Does it indeed make sense that handling numbers on the order of 1e-100 cause such problems? I don't mind the difference in output, if I could safely assume that the source is numerical issues. Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
Thanks.
Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
The first reference that comes to mind is What Every Computer Scientist Should Know About Floating-Point Arithmetic. It covers floating-point maths in general.
As far as numerical stability is concerned, the best references probably depend on the numerical algorithm in question. Two wide-ranging works that come to mind are:
Numerical Recipes by Press et al;
Matrix Computations by Golub and Van Loan.
It is not necessarily the small numbers that are causing the problem.
How do you check whether the outputs are the "exact same"?
I would check equality with tolerance. You may consider the floating point numbers x and y equal if either fabs(x-y) < 1.0e-6 or fabs(x-y) < fabs(x)*1.0e-6 holds.
Usually, there is a HUGE difference between the two algorithms if there are numerical issues. Often, a small change in the input may result in extreme changes in the output, if the algorithm suffers from numerical issues.
What makes you think that there are "numerical issues"?
If possible, change your algorithm to use Kahan Summation (aka compensated summation). From Wikipedia:
function KahanSum(input)
var sum = 0.0
var c = 0.0 //A running compensation for lost low-order bits.
for i = 1 to input.length do
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
//Next time around, the lost low part will be added to y in a fresh attempt.
return sum
This works by keeping a second running total of the cumulative error, similar to the Bresenham line drawing algorithm. The end result is that you get precision that is nearly double the data type's advertised precision.
Another technique I use is to sort my numbers from small to large (by manitude, ignoring sign) and add or subtract the small numbers first, then the larger ones. This has the virtue that if you add and subtract the same value multiple times, such numbers may cancel exactly and can be removed from the list.

For any finite floating point value, is it guaranteed that x - x == 0?

Floating point values are inexact, which is why we should rarely use strict numerical equality in comparisons. For example, in Java this prints false (as seen on ideone.com):
System.out.println(.1 + .2 == .3);
// false
Usually the correct way to compare results of floating point calculations is to see if the absolute difference against some expected value is less than some tolerated epsilon.
System.out.println(Math.abs(.1 + .2 - .3) < .00000000000001);
// true
The question is about whether or not some operations can yield exact result. We know that for any non-finite floating point value x (i.e. either NaN or an infinity), x - x is ALWAYS NaN.
But if x is finite, is any of this guaranteed?
x * -1 == -x
x - x == 0
(In particular I'm most interested in Java behavior, but discussions for other languages are also welcome.)
For what it's worth, I think (and I may be wrong here) the answer is YES! I think it boils down to whether or not for any finite IEEE-754 floating point value, its additive inverse is always computable exactly. Since e.g. float and double has one dedicated bit just for the sign, this seems to be the case, since it only needs flipping of the sign bit to find the additive inverse (i.e. the significand should be left intact).
Related questions
Correct Way to Obtain The Most Negative Double
How many double numbers are there between 0.0 and 1.0?
Both equalities are guaranteed with IEEE 754 floating-point, because the results of both x-x and x * -1 are representable exactly as floating-point numbers of the same precision as x. In this case, regardless of the rounding mode, the exact values have to be returned by a compliant implementation.
EDIT: Comparing to the .1 + .2 example.
You can't add .1 and .2 in IEEE 754 because you can't represent them to pass to +. Addition, subtraction, multiplication, division and square root return the unique floating-point value which, depending on the rounding mode, is immediately below, immediately above, nearest with a rule to handle ties, ..., the result of the operation on the same arguments in R. Consequently, when the result (in R) happens to be representable as a floating-point number, this number is automatically the result regardless of the rounding mode.
The fact that your compiler lets you write 0.1 as shorthand for a different, representable number without a warning is orthogonal to the definition of these operations. When you write - (0.1) for instance, the - is exact: it returns exactly the opposite of its argument. On the other hand, its argument is not 0.1, but the double that your compiler uses in its place.
In short, another part of the reason why the operation x * (-1) is exact is that -1 can be represented as a double.
Although x - x may give you -0 rather than true 0, -0 compares as equal to 0, so you will be safe with your assumption that any finite number minus itself will compare equal to zero.
See Is there a floating point value of x, for which x-x == 0 is false? for more details.

Repeated application of functions

Reading this question got me thinking: For a given function f, how can we know that a loop of this form:
while (x > 2)
x = f(x)
will stop for any value x? Is there some simple criterion?
(The fact that f(x) < x for x > 2 doesn't seem to help since the series may converge).
Specifically, can we prove this for sqrt and for log?
For these functions, a proof that ceil(f(x))<x for x > 2 would suffice. You could do one iteration -- to arrive at an integer number, and then proceed by simple induction.
For the general case, probably the best idea is to use well-founded induction to prove this property. However, as Moron pointed out in the comments, this could be impossible in the general case and the right ordering is, in many cases, quite hard to find.
Edit, in reply to Amnon's comment:
If you wanted to use well-founded induction, you would have to define another strict order, that would be well-founded. In case of the functions you mentioned this is not hard: you can take x << y if and only if ceil(x) < ceil(y), where << is a symbol for this new order. This order is of course well-founded on numbers greater then 2, and both sqrt and log are decreasing with respect to it -- so you can apply well-founded induction.
Of course, in general case such an order is much more difficult to find. This is also related, in some way, to total correctness assertions in Hoare logic, where you need to guarantee similar obligations on each loop construct.
There's a general theorem for when then sequence of iterations will converge. (A convergent sequence may not stop in a finite number of steps, but it is getting closer to a target. You can get as close to the target as you like by going far enough out in the sequence.)
The sequence x, f(x), f(f(x)), ... will converge if f is a contraction mapping. That is, there exists a positive constant k < 1 such that for all x and y, |f(x) - f(y)| <= k |x-y|.
(The fact that f(x) < x for x > 2 doesn't seem to help since the series may converge).
If we're talking about floats here, that's not true. If for all x > n f(x) is strictly less than x, it will reach n at some point (because there's only a limited number of floating point values between any two numbers).
Of course this means you need to prove that f(x) is actually less than x using floating point arithmetic (i.e. proving it is less than x mathematically does not suffice, because then f(x) = x may still be true with floats when the difference is not enough).
There is no general algorithm to determine whether a function f and a variable x will end or not in that loop. The Halting problem is reducible to that problem.
For sqrt and log, we could safely do that because we happen to know the mathematical properties of those functions. Say, sqrt approaches 1, log eventually goes negative. So the condition x < 2 has to be false at some point.
Hope that helps.
In the general case, all that can be said is that the loop will terminate when it encounters xi≤2. That doesn't mean that the sequence will converge, nor does it even mean that it is bounded below 2. It only means that the sequence contains a value that is not greater than 2.
That said, any sequence containing a subsequence that converges to a value strictly less than two will (eventually) halt. That is the case for the sequence xi+1 = sqrt(xi), since x converges to 1. In the case of yi+1 = log(yi), it will contain a value less than 2 before becoming undefined for elements of R (though it is well defined on the extended complex plane, C*, but I don't think it will, in general converge except at any stable points that may exist (i.e. where z = log(z)). Ultimately what this means is that you need to perform some upfront analysis on the sequence to better understand its behavior.
The standard test for convergence of a sequence xi to a point z is that give ε > 0, there is an n such that for all i > n, |xi - z| < ε.
As an aside, consider the Mandelbrot Set, M. The test for a particular point c in C for an element in M is whether the sequence zi+1 = zi2 + c is unbounded, which occurs whenever there is a |zi| > 2. Some elements of M may converge (such as 0), but many do not (such as -1).
Sure. For all positive numbers x, the following inequality holds:
log(x) <= x - 1
(this is a pretty basic result from real analysis; it suffices to observe that the second derivative of log is always negative for all positive x, so the function is concave down, and that x-1 is tangent to the function at x = 1). From this it follows essentially immediately that your while loop must terminate within the first ceil(x) - 2 steps -- though in actuality it terminates much, much faster than that.
A similar argument will establish your result for f(x) = sqrt(x); specifically, you can use the fact that:
sqrt(x) <= x/(2 sqrt(2)) + 1/sqrt(2)
for all positive x.
If you're asking whether this result holds for actual programs, instead of mathematically, the answer is a little bit more nuanced, but not much. Basically, many languages don't actually have hard accuracy requirements for the log function, so if your particular language implementation had an absolutely terrible math library this property might fail to hold. That said, it would need to be a really, really terrible library; this property will hold for any reasonable implementation of log.
I suggest reading this wikipedia entry which provides useful pointers. Without additional knowledge about f, nothing can be said.

Should we compare floating point numbers for equality against a *relative* error?

So far I've seen many posts dealing with equality of floating point numbers. The standard answer to a question like "how should we decide if x and y are equal?" is
abs(x - y) < epsilon
where epsilon is a fixed, small constant. This is because the "operands" x and y are often the results of some computation where a rounding error is involved, hence the standard equality operator == is not what we mean, and what we should really ask is whether x and y are close, not equal.
Now, I feel that if x is "almost equal" to y, then also x*10^20 should be "almost equal" to y*10^20, in the sense that the relative error should be the same (but "relative" to what?). But with these big numbers, the above test would fail, i.e. that solution does not "scale".
How would you deal with this issue? Should we rescale the numbers or rescale epsilon? How?
(Or is my intuition wrong?)
Here is a related question, but I don't like its accepted answer, for the reinterpret_cast thing seems a bit tricky to me, I don't understand what's going on. Please try to provide a simple test.
It all depends on the specific problem domain. Yes, using relative error will be more correct in the general case, but it can be significantly less efficient since it involves an extra floating-point division. If you know the approximate scale of the numbers in your problem, using an absolute error is acceptable.
This page outlines a number of techniques for comparing floats. It also goes over a number of important issues, such as those with subnormals, infinities, and NaNs. It's a great read, I highly recommend reading it all the way through.
As an alternative solution, why not just round or truncate the numbers and then make a straight comparison? By setting the number of significant digits in advance, you can be certain of the accuracy within that bound.
The problem is that with very big numbers, comparing to epsilon will fail.
Perhaps a better (but slower) solution would be to use division, example:
div(max(a, b), min(a, b)) < eps + 1
Now the 'error' will be relative.
Using relative error is at least not as bad as using absolute errors, but it has subtle problems for values near zero due to rounding issues. A far from perfect, but somewhat robust algorithm combines absolute and relative error approaches:
boolean approxEqual(float a, float b, float absEps, float relEps) {
// Absolute error check needed when comparing numbers near zero.
float diff = abs(a - b);
if (diff <= absEps) {
return true;
}
// Symmetric relative error check without division.
return (diff <= relEps * max(abs(a), abs(b)));
}
I adapted this code from Bruce Dawson's excellent article Comparing Floating Point Numbers, 2012 Edition, a required read for anyone doing floating-point comparisons -- an amazingly complex topic with many pitfalls.
Most of the time when code compares values, it is doing so to answer some sort of question. For example:
If I know what a function returned when given a value of X, can I assume it will return the same thing if given Y?
If I have a method of computing a function which is slow but accurate, I am willing to accept some inaccuracy in exchange for speed, and I want to test a candidate function which seems to fit the bill, are the outputs from that function close enough to the known-accurate one to be considered "correct".
To answer the first question, code should ideally do a bit-wise comparison on the value, though unless a language supports the new operators added to IEEE-754 in 2009 that may be less efficient than ideal. To answer the second question, one should define what degree of accuracy is required and test against that.
I don't think there's much merit in a general-purpose method which regards as equal things which are close, since different applications will have differing requirements for both absolute and relative tolerance, based upon what exact questions the tests are supposed to answer.