Is floating point precision mutable or invariant? - binary

I keep getting mixed answers of whether floating point numbers (i.e. float, double, or long double) have one and only one value of precision, or have a precision value which can vary.
One topic called float vs. double precision seems to imply that floating point precision is an absolute.
However, another topic called Difference between float and double says,
In general a double has 15 to 16 decimal digits of precision
Another source says,
Variables of type float typically have a precision of about 7 significant digits
Variables of type double typically have a precision of about 16 significant digits
I don't like to refer to approximations like the above if I'm working with sensitive code that can break easily when my values are not exact. So let's set the record straight. Is floating point precision mutable or invariant, and why?

The precision is fixed, which is exactly 53 binary digits for double-precision (or 52 if we exclude the implicit leading 1). This comes out to about 15 decimal digits.
The OP asked me to elaborate on why having exactly 53 binary digits means "about" 15 decimal digits.
To understand this intuitively, let's consider a less-precise floating-point format: instead of a 52-bit mantissa like double-precision numbers have, we're just going to use a 4-bit mantissa.
So, each number will look like: (-1)s × 2yyy × 1.xxxx (where s is the sign bit, yyy is the exponent, and 1.xxxx is the normalised mantissa). For the immediate discussion, we'll focus only on the mantissa and not the sign or exponent.
Here's a table of what 1.xxxx looks like for all xxxx values (all rounding is half-to-even, just like how the default floating-point rounding mode works):
xxxx | 1.xxxx | value | 2dd | 3dd
--------+----------+----------+-------+--------
0000 | 1.0000 | 1.0 | 1.0 | 1.00
0001 | 1.0001 | 1.0625 | 1.1 | 1.06
0010 | 1.0010 | 1.125 | 1.1 | 1.12
0011 | 1.0011 | 1.1875 | 1.2 | 1.19
0100 | 1.0100 | 1.25 | 1.2 | 1.25
0101 | 1.0101 | 1.3125 | 1.3 | 1.31
0110 | 1.0110 | 1.375 | 1.4 | 1.38
0111 | 1.0111 | 1.4375 | 1.4 | 1.44
1000 | 1.1000 | 1.5 | 1.5 | 1.50
1001 | 1.1001 | 1.5625 | 1.6 | 1.56
1010 | 1.1010 | 1.625 | 1.6 | 1.62
1011 | 1.1011 | 1.6875 | 1.7 | 1.69
1100 | 1.1100 | 1.75 | 1.8 | 1.75
1101 | 1.1101 | 1.8125 | 1.8 | 1.81
1110 | 1.1110 | 1.875 | 1.9 | 1.88
1111 | 1.1111 | 1.9375 | 1.9 | 1.94
How many decimal digits do you say that provides? You could say 2, in that each value in the two-decimal-digit range is covered, albeit not uniquely; or you could say 3, which covers all unique values, but do not provide coverage for all values in the three-decimal-digit range.
For the sake of argument, we'll say it has 2 decimal digits: the decimal precision will be the number of digits where all values of those decimal digits could be represented.
Okay, then, so what happens if we halve all the numbers (so we're using yyy = -1)?
xxxx | 1.xxxx | value | 1dd | 2dd
--------+----------+-----------+-------+--------
0000 | 1.0000 | 0.5 | 0.5 | 0.50
0001 | 1.0001 | 0.53125 | 0.5 | 0.53
0010 | 1.0010 | 0.5625 | 0.6 | 0.56
0011 | 1.0011 | 0.59375 | 0.6 | 0.59
0100 | 1.0100 | 0.625 | 0.6 | 0.62
0101 | 1.0101 | 0.65625 | 0.7 | 0.66
0110 | 1.0110 | 0.6875 | 0.7 | 0.69
0111 | 1.0111 | 0.71875 | 0.7 | 0.72
1000 | 1.1000 | 0.75 | 0.8 | 0.75
1001 | 1.1001 | 0.78125 | 0.8 | 0.78
1010 | 1.1010 | 0.8125 | 0.8 | 0.81
1011 | 1.1011 | 0.84375 | 0.8 | 0.84
1100 | 1.1100 | 0.875 | 0.9 | 0.88
1101 | 1.1101 | 0.90625 | 0.9 | 0.91
1110 | 1.1110 | 0.9375 | 0.9 | 0.94
1111 | 1.1111 | 0.96875 | 1. | 0.97
By the same criteria as before, we're now dealing with 1 decimal digit. So you can see how, depending on the exponent, you can have more or less decimal digits, because binary and decimal floating-point numbers do not map cleanly to each other.
The same argument applies to double-precision floating point numbers (with the 52-bit mantissa), only in that case you're getting either 15 or 16 decimal digits depending on the exponent.

All modern computers use binary floating-point arithmetic. That means we have a binary mantissa, which has typically 24 bits for single precision, 53 bits for double precision and 64 bits for extended precision. (Extended precision is available on x86 processors, but not on ARM or possibly other types of processors.)
24, 53, and 64 bit mantissas mean that for a floating-point number between 2k and 2k+1 the next larger number is 2k-23, 2k-52 and 2k-63 respectively. That's the resolution. The rounding error of each floating-point operation is at most half of that.
So how does that translate into decimal numbers? It depends.
Take k = 0 and 1 ≤ x < 2. The resolution is 2-23, 2-52, and 2-63 which is about 1.19×10-7, 2.2×10-16, and 1.08×10-19 respectively. That's a bit less than 7, 16, and 19 decimals. Then take k = 3 and
8 ≤ x < 16. The difference between two floating-point numbers is now 8 times larger. For 8 ≤ x < 10 you get just over 6, less than 15, and just over 18 decimals respectively. But for 10 ≤ x < 16 you get one decimal more!
You get the highest number of decimal digits if x is only a bit less than 2k+1 and only a bit more than 10n, for example 1000 ≤ x < 1024. You get the lowest number of decimal digits if x is just a bit higher than 2k and a bit less than 10n, for example 1⁄1024 ≤ x < 1⁄1000 . The same binary precision can produce decimal precision that varies by up to 1.3 digits or log10 (2×10).
Of course, you could just read the article "What every computer scientist should know about floating-point arithmetic."

80x86 code using its hardware coprocessor (originally the 8087) provide three levels of precision: 32-bit, 64-bit, and 80-bit. Those very closely follow the IEEE-754 standard of 1985. The recent standard specifies a 128-bit format. The floating point formats have 24, 53, 65, and 113 mantissa bits which correspond to 7.22, 15.95, 19.57, and 34.02 decimal digits of precision.
The formula is mantissa_bits / log_2 10 where the log base two of ten is 3.321928095.
While the precision of any particular implementation does not vary, it may appear to when a floating point value is converted to decimal. Note that the value 0.1 does not have an exact binary representation. It is a repeating bit pattern (0.0001100110011001100110011001100...) like we are used to in decimal for 0.3333333333333 to approximate 1/3.
Many languages often don't support the 80-bit format. Some C compilers may offer long double which uses either 80-bit floats or 128-bit floats. Alas, it might also use a 64-bit float, depending on the implementation.
The NPU has 80 bit registers and performs all operations using the full 80 bit result. Code which calculates within the NPU stack benefit from this extra precision. Unfortunately, poor code generation—or poorly written code— might truncate or round intermediate calculations by storing them in a 32-bit or 64-bit variable.

Is floating point precision mutable or invariant, and why?
Typically, given any numbers in the same power-of-2 range, the floating point precision is invariant - a fixed value. The absolute precision changes with each power-of-2 step. Over the entire FP range, the precision is approximately relative to the magnitude. Relating this relative binary precision in terms of a decimal precision incurs a wobble varying between DBL_DIG and DBL_DECIMAL_DIG decimal digits - Typically 15 to 17.
What is precision? With FP, it makes most sense to discuss relative precision.
Floating point numbers have the form of:
Sign * Significand * pow(base,exponent)
They have a logarithmic distribution. There are about as many different floating point numbers between 100.0 and 3000.0 ( a range of 30x) as there are between 2.0 and 60.0. This is true regardless of the underlying storage representation.
1.23456789e100 has about the same relative precision as 1.23456789e-100.
Most computers implemment double as binary64. This format has 53 bits of binary precision.
The n numbers between 1.0 and 2.0 have the same absolute precision of 1 part in ((2.0-1.0)/pow(2,52).
Numbers between 64.0 and 128.0, also n, have the same absolute precision of 1 part in ((128.0-64.0)/pow(2,52).
Even group of numbers between powers of 2, have the same absolute precision.
Over the entire normal range of FP numbers, this approximates a uniform relative precision.
When these numbers are represented as decimal, the precision wobbles: Numbers 1.0 to 2.0 have 1 more bit of absolute precision than numbers 2.0 to 4.0. 2 more bits than 4.0 to 8.0, etc.
C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double counterparts. DBL_DIG indicates the minimum relative decimal precision. DBL_DECIMAL_DIG can be thought of as the maximum relative decimal precision.
Typically this means given double will have at 15 to 17 decimal digits of precision.
Consider 1.0and its next representable double, the digits do not change until the 17th significant decimal digit. Each next double is pow(2,-52) or about 2.2204e-16 apart.
/*
1 234567890123456789 */
1.000000000000000000...
1.000000000000000222...
Now consider "8.521812787393891"and its next representable number as a decimal string using 16 significant decimal digits. Both of these strings, converted to double are the same 8.521812787393891142073699... even though they differ in the 16th digit. Saying this double had 16 digits of precision was over-stated.
/*
1 234567890123456789 */
8.521812787393891
8.521812787393891142073699...
8.521812787393892

No, it is variable. Starting point is the very weak IEEE-754 standard, it only nailed down the format of floating pointer numbers as they are stored in memory. You can count on 7 digits of precision for single precision, 15 digits for double precision.
But a major flaw in that standard is that it does not specify how calculations are to be performed. And there's trouble, the Intel 8087 floating point processor in particular has caused programmers many sleepless nights. A significant design flaw in that chip is that it stores floating point values with more bits than the memory format. 80 bits instead of 32 or 64. The theory behind that design choice is that this allows to be intermediate calculations to be more precise and cause less round-off error.
Sounds like a good idea, that however did not turn out well in practice. A compiler writer will try to generate code that leaves intermediate values stored in the FPU as long as possible. Important to code speed, storing the value back to memory is expensive. Trouble is, he often must store values back, the number of registers in the FPU are limited and the code might cross a function boundary. At which point the value gets truncated back and loses a lot of precision. Small changes to the source code can now produce drastically different values. Also, the non-optimized build of a program produces different results from the optimized one. In a completely undiagnosable way, you'd have to look at the machine code to know why the result is different.
Intel redesigned their processor to solve this problem, the SSE instruction set calculates with the same number of bits as the memory format. Slow to catch on however, redesigning the code generator and optimizer of a compiler is a significant investment. The big three C++ compilers have all switched. But for example the x86 jitter in the .NET Framework still generates FPU code, it always will.
Then there is systemic error, losing precision as inevitable side-effect of the conversion and calculation. Conversion first, humans work in numbers in base 10 but the processor uses base 2. Nice round numbers we use, like 0.1 cannot be converted to nice round numbers on the processor. 0.1 is perfect as a sum of powers of 10 but there is no finite sum of powers of 2 that produce the same value. Converting it produces an infinite number of 1s and 0s in the same manner that you can't perfectly write down 10 / 3. So it needs to be truncated to fit the processor and that produces a value that's off by +/- 0.5 bit from the decimal value.
And calculation produces error. A multiplication or division doubles the number of bits in the result, rounding it to fit it back into the stored value produces +/- 0.5 bit error. Subtraction is the most dangerous operation and can cause loss of a lot of significant digits. If you, say, calculate 1.234567f - 1.234566f then the result has only 1 significant digit left. That's a junk result. Summing the difference between numbers that have nearly the same value is a very common in numerical algorithms.
Getting excessive systemic errors is ultimately a flaw in the mathematical model. Just as an example, you never want to use Gaussian elimination, it is very unfriendly to precision. And always consider an alternative approach, LU Decomposition is an excellent approach. It is however not that common that a mathematician was involved in building the model and accounted for the expected precision of the result. A common book like Numerical Recipes also doesn't pay enough attention to precision, albeit that it indirectly steers you away from bad models by proposing the better one. In the end, a programmer often gets stuck with the problem. Well, it was easy then anybody could do it and I'd be out of a good paying job :)

The type of a floating point variable defines what range of values and how many fractional bits (!) can be represented. As there is no integer relation between decimal and binary fraction, the decimal fraction is actually an approximation.
Second: Another problem is the precision arithmetic operations are performed. Just think of 1.0/3.0 or PI. Such values cannot be represented with a limited number of digits - neither decimal, nor binary. So the values have to be rounded to fit into the given space. The more fractional digits are available, the higher the precision.
Now think of multiple such operations being applied, e.g. PI/3.0 . This would require to round twice: PI as such is not exact and the result neither. This will loose precision twice, if repreated it becomes worse.
So, back to float and double: float has according to the standard (C11, Annex F, also for the rest) less bits available, so roundig will be less precise than for double. Just think of having a decimal with 2 fractional digits (m.ff, call it float) and one with four (m.ffff, call it double). If double is used for all calculations, you can have more operations until your result has only 2 correct fractional digits, than if you already start with float, even if a float result would suffice.
Note that on some (embedded) CPUs like ARM Cortex-M4F, the hardware FPU only supports folat (single precision), so double arithmetic will be much more costly. Other MCUs have no hardware floating point calculator at all, so they have to be simulated my software (very costly). On most GPUs, float is also much cheaper to perform than double, sometimes by more than a factor of 10.

The storage has a precise digit count in binary, as other answers explain.
One thing to know, the CPU can run operations at a different precision internally, like 80 bits. It means that code like that can trigger :
void Kaboom( float a, float b, float c ) // same is true for other floating point types.
{
float sum1 = a+b+c;
float sum2 = a+b;
sum2 += c; // let's assume that the compiler did not keep sum2 in a register and the value was write to memory then load again.
if (sum1 !=sum2)
throw "kaboom"; // this can happen.
}
It is more likely with more complex computation.

I'm going to add the off-beat answer here, and say that since you've tagged this question as C++, there is no guarantee whatsoever about precision of floating point data. The vast majority of implementations use IEEE-754 when implementing their floating point types, but that is not required. The only thing required by the C++ language is that (C++ spec §3.9.1.8):
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.

The amount of space required to store a float will be constant, and likewise a double; the amount of useful precision will in relative terms generally vary, however, between one part in 223 and one part in 224 for float, or one part in 252 and 253 for double. Precision very near zero isn't that good, with the second-smallest positive value being twice as big as the smallest, which will in turn be infinitely greater than zero. Throughout the most of the range, however, precision will vary as described above.
Note that while it often isn't practical to have types whose relative precision varies by less than a factor of two throughout its range, the variation in precision can sometimes cause calculations to yield much less accurate calculations than it would appear they should. Consider, for example, 16777215.0f + 4.0f - 4.0f. All of the values would be precisely representable as float using the same scale, and the nearest values to the large one are +/- one part in 16,777,215, but the first addition yields a result in part of the float range where values are separated by one part in only 8,388,610, causing the result to be rounded to 16,777,220. Consequently, subtracting 4 yields 16,777,216 rather than 16,777,215. For most values of float near 16777216, adding 4.0f and subtracting 4.0f would yield the original value unchanged, but the changing precision right at the break-over point causes the result to be off by an extra bit in the lowest place.

Well the answer to this is simple but complicated. These numbers are stored in binary. Depending on if it is a float or a double, the computer uses different amounts of binary to store the number. The precision that you get depends on your binary. If you don't know how binary numbers work, it would be a good idea to look it up. But simply put, some numbers need more ones and zeros than other numbers.
So the precision is fixed (same number of binary digits), but the actual precision that you get depends on the numbers that you are using.

Related

What decimal value does the 8-bit binary number 11011111 have if it is on a computer using signed-magnitude representation?

I am confused on how to present this in decimal value
Would it be just the negative value of 223 or would it be -125?
Would it be just the negative value of 223 or would it be -125?
Neither. Signed magnitude representation reserves one bit to represent the sign, and the remaining bits make up the value.
In this case:
bits 0-6 are 1011111, which is 95 in decimal
bit 7 is 1, meaning the value is negative
So the result is -95.
As for your calculation of -125, it appears you got that backwards. 125 is 1111101 in binary. Definitely got your wires crossed there.

SQL command Floor return result wrong [duplicate]

Consider the following code:
0.1 + 0.2 == 0.3 -> false
0.1 + 0.2 -> 0.30000000000000004
Why do these inaccuracies happen?
Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.
For 0.1 in the standard binary64 format, the representation can be written exactly as
0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.
In contrast, the rational number 0.1, which is 1/10, can be written exactly as
0.1 in decimal, or
0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.
The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.
A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.
Side Note: All positional (base-N) number systems share this problem with precision
Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...
You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.
Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)
Side Side Note: Working with Floats in Programming
In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.
You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:
Do not do if (x == y) { ... }
Instead do if (abs(x - y) < myToleranceValue) { ... }.
where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These can be used as tolerance values but their effectiveness depends on the magnitude (size) of the numbers you're working with, since calculations with large numbers may exceed the epsilon threshold.
A Hardware Designer's Perspective
I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.
1. Overview
From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.
2. Standards
Most processors follow the IEEE-754 standard but some use denormalized, or different standards
. For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.
In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.
3. Cause of Rounding Error in Division
The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).
3.1 Division Rounding Error: Approximation of Reciprocal
What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.
4. Rounding Errors in Other Operations: Truncation
Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.
5. Repeated Operations
Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.
Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.
6. Summary
In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.
It's broken in the exact same way the decimal (base-10) notation you learned in grade school and use every day is broken, just for base-2.
To understand, think about representing 1/3 as a decimal value. It's impossible to do exactly! The world will end before you finish writing the 3's after the decimal point, and so instead we write to some number of places and consider it sufficiently accurate.
In the same way, 1/10 (decimal 0.1) cannot be represented exactly in base 2 (binary) as a "decimal" value; a repeating pattern after the decimal point goes on forever. The value is not exact, and therefore you can't do exact math with it using normal floating point methods. Just like with base 10, there are other values that exhibit this problem as well.
Most answers here address this question in very dry, technical terms. I'd like to address this in terms that normal human beings can understand.
Imagine that you are trying to slice up pizzas. You have a robotic pizza cutter that can cut pizza slices exactly in half. It can halve a whole pizza, or it can halve an existing slice, but in any case, the halving is always exact.
That pizza cutter has very fine movements, and if you start with a whole pizza, then halve that, and continue halving the smallest slice each time, you can do the halving 53 times before the slice is too small for even its high-precision abilities. At that point, you can no longer halve that very thin slice, but must either include or exclude it as is.
Now, how would you piece all the slices in such a way that would add up to one-tenth (0.1) or one-fifth (0.2) of a pizza? Really think about it, and try working it out. You can even try to use a real pizza, if you have a mythical precision pizza cutter at hand. :-)
Most experienced programmers, of course, know the real answer, which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices, no matter how finely you slice them. You can do a pretty good approximation, and if you add up the approximation of 0.1 with the approximation of 0.2, you get a pretty good approximation of 0.3, but it's still just that, an approximation.
For double-precision numbers (which is the precision that allows you to halve your pizza 53 times), the numbers immediately less and greater than 0.1 are 0.09999999999999999167332731531132594682276248931884765625 and 0.1000000000000000055511151231257827021181583404541015625. The latter is quite a bit closer to 0.1 than the former, so a numeric parser will, given an input of 0.1, favour the latter.
(The difference between those two numbers is the "smallest slice" that we must decide to either include, which introduces an upward bias, or exclude, which introduces a downward bias. The technical term for that smallest slice is an ulp.)
In the case of 0.2, the numbers are all the same, just scaled up by a factor of 2. Again, we favour the value that's slightly higher than 0.2.
Notice that in both cases, the approximations for 0.1 and 0.2 have a slight upward bias. If we add enough of these biases in, they will push the number further and further away from what we want, and in fact, in the case of 0.1 + 0.2, the bias is high enough that the resulting number is no longer the closest number to 0.3.
In particular, 0.1 + 0.2 is really 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125, whereas the number closest to 0.3 is actually 0.299999999999999988897769753748434595763683319091796875.
P.S. Some programming languages also provide pizza cutters that can split slices into exact tenths. Although such pizza cutters are uncommon, if you do have access to one, you should use it when it's important to be able to get exactly one-tenth or one-fifth of a slice.
(Originally posted on Quora.)
Floating point rounding errors. 0.1 cannot be represented as accurately in base-2 as in base-10 due to the missing prime factor of 5. Just as 1/3 takes an infinite number of digits to represent in decimal, but is "0.1" in base-3, 0.1 takes an infinite number of digits in base-2 where it does not in base-10. And computers don't have an infinite amount of memory.
My answer is quite long, so I've split it into three sections. Since the question is about floating point mathematics, I've put the emphasis on what the machine actually does. I've also made it specific to double (64 bit) precision, but the argument applies equally to any floating point arithmetic.
Preamble
An IEEE 754 double-precision binary floating-point format (binary64) number represents a number of the form
value = (-1)^s * (1.m51m50...m2m1m0)2 * 2e-1023
in 64 bits:
The first bit is the sign bit: 1 if the number is negative, 0 otherwise1.
The next 11 bits are the exponent, which is offset by 1023. In other words, after reading the exponent bits from a double-precision number, 1023 must be subtracted to obtain the power of two.
The remaining 52 bits are the significand (or mantissa). In the mantissa, an 'implied' 1. is always2 omitted since the most significant bit of any binary value is 1.
1 - IEEE 754 allows for the concept of a signed zero - +0 and -0 are treated differently: 1 / (+0) is positive infinity; 1 / (-0) is negative infinity. For zero values, the mantissa and exponent bits are all zero. Note: zero values (+0 and -0) are explicitly not classed as denormal2.
2 - This is not the case for denormal numbers, which have an offset exponent of zero (and an implied 0.). The range of denormal double precision numbers is dmin ≤ |x| ≤ dmax, where dmin (the smallest representable nonzero number) is 2-1023 - 51 (≈ 4.94 * 10-324) and dmax (the largest denormal number, for which the mantissa consists entirely of 1s) is 2-1023 + 1 - 2-1023 - 51 (≈ 2.225 * 10-308).
Turning a double precision number to binary
Many online converters exist to convert a double precision floating point number to binary (e.g. at binaryconvert.com), but here is some sample C# code to obtain the IEEE 754 representation for a double precision number (I separate the three parts with colons (:):
public static string BinaryRepresentation(double value)
{
long valueInLongType = BitConverter.DoubleToInt64Bits(value);
string bits = Convert.ToString(valueInLongType, 2);
string leadingZeros = new string('0', 64 - bits.Length);
string binaryRepresentation = leadingZeros + bits;
string sign = binaryRepresentation[0].ToString();
string exponent = binaryRepresentation.Substring(1, 11);
string mantissa = binaryRepresentation.Substring(12);
return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}
Getting to the point: the original question
(Skip to the bottom for the TL;DR version)
Cato Johnston (the question asker) asked why 0.1 + 0.2 != 0.3.
Written in binary (with colons separating the three parts), the IEEE 754 representations of the values are:
0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010
Note that the mantissa is composed of recurring digits of 0011. This is key to why there is any error to the calculations - 0.1, 0.2 and 0.3 cannot be represented in binary precisely in a finite number of binary bits any more than 1/9, 1/3 or 1/7 can be represented precisely in decimal digits.
Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places (much like 10-3 * 1.23 == 10-5 * 123). This then enables us to represent the binary representation as the exact value that it represents in the form a * 2p. where 'a' is an integer.
Converting the exponents to decimal, removing the offset, and re-adding the implied 1 (in square brackets), 0.1 and 0.2 are:
0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125
To add two numbers, the exponent needs to be the same, i.e.:
0.1 => 2^-3 * 0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 * 1.1001100110011001100110011001100110011001100110011010
sum = 2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125
sum = 2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875
Since the sum is not of the form 2n * 1.{bbb} we increase the exponent by one and shift the decimal (binary) point to get:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1)
= 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875
There are now 53 bits in the mantissa (the 53rd is in square brackets in the line above). The default rounding mode for IEEE 754 is 'Round to Nearest' - i.e. if a number x falls between two values a and b, the value where the least significant bit is zero is chosen.
a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
= 2^-2 * 1.0011001100110011001100110011001100110011001100110011
x = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1)
b = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
= 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125
Note that a and b differ only in the last bit; ...0011 + 1 = ...0100. In this case, the value with the least significant bit of zero is b, so the sum is:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
= 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125
whereas the binary representation of 0.3 is:
0.3 => 2^-2 * 1.0011001100110011001100110011001100110011001100110011
= 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
which only differs from the binary representation of the sum of 0.1 and 0.2 by 2-54.
The binary representation of 0.1 and 0.2 are the most accurate representations of the numbers allowable by IEEE 754. The addition of these representation, due to the default rounding mode, results in a value which differs only in the least-significant-bit.
TL;DR
Writing 0.1 + 0.2 in a IEEE 754 binary representation (with colons separating the three parts) and comparing it to 0.3, this is (I've put the distinct bits in square brackets):
0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3 => 0:01111111101:0011001100110011001100110011001100110011001100110[011]
Converted back to decimal, these values are:
0.1 + 0.2 => 0.300000000000000044408920985006...
0.3 => 0.299999999999999988897769753748...
The difference is exactly 2-54, which is ~5.5511151231258 × 10-17 - insignificant (for many applications) when compared to the original values.
Comparing the last few bits of a floating point number is inherently dangerous, as anyone who reads the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (which covers all the major parts of this answer) will know.
Most calculators use additional guard digits to get around this problem, which is how 0.1 + 0.2 would give 0.3: the final few bits are rounded.
In addition to the other correct answers, you may want to consider scaling your values to avoid problems with floating-point arithmetic.
For example:
var result = 1.0 + 2.0; // result === 3.0 returns true
... instead of:
var result = 0.1 + 0.2; // result === 0.3 returns false
The expression 0.1 + 0.2 === 0.3 returns false in JavaScript, but fortunately integer arithmetic in floating-point is exact, so decimal representation errors can be avoided by scaling.
As a practical example, to avoid floating-point problems where accuracy is paramount, it is recommended1 to handle money as an integer representing the number of cents: 2550 cents instead of 25.50 dollars.
1 Douglas Crockford: JavaScript: The Good Parts: Appendix A - Awful Parts (page 105).
Floating point numbers stored in the computer consist of two parts, an integer and an exponent that the base is taken to and multiplied by the integer part.
If the computer were working in base 10, 0.1 would be 1 x 10⁻¹, 0.2 would be 2 x 10⁻¹, and 0.3 would be 3 x 10⁻¹. Integer math is easy and exact, so adding 0.1 + 0.2 will obviously result in 0.3.
Computers don't usually work in base 10, they work in base 2. You can still get exact results for some values, for example 0.5 is 1 x 2⁻¹ and 0.25 is 1 x 2⁻², and adding them results in 3 x 2⁻², or 0.75. Exactly.
The problem comes with numbers that can be represented exactly in base 10, but not in base 2. Those numbers need to be rounded to their closest equivalent. Assuming the very common IEEE 64-bit floating point format, the closest number to 0.1 is 3602879701896397 x 2⁻⁵⁵, and the closest number to 0.2 is 7205759403792794 x 2⁻⁵⁵; adding them together results in 10808639105689191 x 2⁻⁵⁵, or an exact decimal value of 0.3000000000000000444089209850062616169452667236328125. Floating point numbers are generally rounded for display.
In short it's because:
Floating point numbers cannot represent all decimals precisely in binary
So just like 10/3 which does not exist in base 10 precisely (it will be 3.33... recurring), in the same way 1/10 doesn't exist in binary.
So what? How to deal with it? Is there any workaround?
In order to offer The best solution I can say I discovered following method:
parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3
Let me explain why it's the best solution.
As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.
Imagine you are going to add up two float numbers like 0.2 and 0.7 here it is: 0.2 + 0.7 = 0.8999999999999999.
Your expected result was 0.9 it means you need a result with 1 digit precision in this case.
So you should have used (0.2 + 0.7).tofixed(1)
but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance
0.22 + 0.7 = 0.9199999999999999
In this example you need 2 digits precision so it should be toFixed(2), so what should be the paramter to fit every given float number?
You might say let it be 10 in every situation then:
(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"
Damn! What are you going to do with those unwanted zeros after 9?
It's the time to convert it to float to make it as you desire:
parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9
Now that you found the solution, it's better to offer it as a function like this:
function floatify(number){
return parseFloat((number).toFixed(10));
}
Let's try it yourself:
function floatify(number){
return parseFloat((number).toFixed(10));
}
function addUp(){
var number1 = +$("#number1").val();
var number2 = +$("#number2").val();
var unexpectedResult = number1 + number2;
var expectedResult = floatify(number1 + number2);
$("#unexpectedResult").text(unexpectedResult);
$("#expectedResult").text(expectedResult);
}
addUp();
input{
width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>
You can use it this way:
var x = 0.2 + 0.7;
floatify(x); => Result: 0.9
As W3SCHOOLS suggests there is another solution too, you can multiply and divide to solve the problem above:
var x = (0.2 * 10 + 0.1 * 10) / 10; // x will be 0.3
Keep in mind that (0.2 + 0.1) * 10 / 10 won't work at all although it seems the same!
I prefer the first solution since I can apply it as a function which converts the input float to accurate output float.
FYI, the same problem exists for multiplication, for instance 0.09 * 10 returns 0.8999999999999999. Apply the flotify function as a workaround: flotify(0.09 * 10) returns 0.9
Floating point rounding error. From What Every Computer Scientist Should Know About Floating-Point Arithmetic:
Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.
My workaround:
function add(a, b, precision) {
var x = Math.pow(10, precision || 2);
return (Math.round(a * x) + Math.round(b * x)) / x;
}
precision refers to the number of digits you want to preserve after the decimal point during addition.
No, not broken, but most decimal fractions must be approximated
Summary
Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.
Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base2. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.
We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.
How did this happen?
When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form
a / (2n x 5m)
In binary, we only get the 2n term, that is:
a / 2n
So in decimal, we can't represent 1/3. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base10 fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2n term.
In base10 we can't represent 1/3. But in binary, we can't do 1/10 or 1/3.
So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.
Dealing with it
Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.
Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.
The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.
I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)
Conclusion
If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.
A lot of good answers have been posted, but I'd like to append one more.
Not all numbers can be represented via floats/doubles
For example, the number "0.2" will be represented as "0.200000003" in single precision in IEEE754 float point standard.
Model for store real numbers under the hood represent float numbers as
Even though you can type 0.2 easily, FLT_RADIX and DBL_RADIX is 2; not 10 for a computer with FPU which uses "IEEE Standard for Binary Floating-Point Arithmetic (ISO/IEEE Std 754-1985)".
So it is a bit hard to represent such numbers exactly. Even if you specify this variable explicitly without any intermediate calculation.
Some statistics related to this famous double precision question.
When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values.
Here are some examples:
0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)
When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error.
Here are some examples:
0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)
*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).
Given that nobody has mentioned this...
Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:
Python's decimal module and Java's BigDecimal class, that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.
Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:
>>> 0.1 + 0.2 == 0.3
False
>>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
True
Python's decimal module is based on IEEE standard 854-1987.
Python's fractions module and Apache Common's BigFraction class. Both represent rational numbers as (numerator, denominator) pairs and they may give more accurate results than decimal floating point arithmetic.
Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.
Did you try the duct tape solution?
Try to determine when errors occur and fix them with short if statements, it's not pretty but for some problems it is the only solution and this is one of them.
if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
else { return n * 0.1 + 0.000000000000001 ;}
I had the same problem in a scientific simulation project in c#, and I can tell you that if you ignore the butterfly effect it's gonna turn to a big fat dragon and bite you in the a**
Those weird numbers appear because computers use binary(base 2) number system for calculation purposes, while we use decimal(base 10).
There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both. Result - A rounded up (but precise) number results.
Many of this question's numerous duplicates ask about the effects of floating point rounding on specific numbers. In practice, it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it. Some languages provide ways of doing that - such as converting a float or double to BigDecimal in Java.
Since this is a language-agnostic question, it needs language-agnostic tools, such as a Decimal to Floating-Point Converter.
Applying it to the numbers in the question, treated as doubles:
0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,
0.2 converts to 0.200000000000000011102230246251565404236316680908203125,
0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and
0.30000000000000004 converts to 0.3000000000000000444089209850062616169452667236328125.
Adding the first two numbers manually or in a decimal calculator such as Full Precision Calculator, shows the exact sum of the actual inputs is 0.3000000000000000166533453693773481063544750213623046875.
If it were rounded down to the equivalent of 0.3 the rounding error would be 0.0000000000000000277555756156289135105907917022705078125. Rounding up to the equivalent of 0.30000000000000004 also gives rounding error 0.0000000000000000277555756156289135105907917022705078125. The round-to-even tie breaker applies.
Returning to the floating point converter, the raw hexadecimal for 0.30000000000000004 is 3fd3333333333334, which ends in an even digit and therefore is the correct result.
Can I just add; people always assume this to be a computer problem, but if you count with your hands (base 10), you can't get (1/3+1/3=2/3)=true unless you have infinity to add 0.333... to 0.333... so just as with the (1/10+2/10)!==3/10 problem in base 2, you truncate it to 0.333 + 0.333 = 0.666 and probably round it to 0.667 which would be also be technically inaccurate.
Count in ternary, and thirds are not a problem though - maybe some race with 15 fingers on each hand would ask why your decimal math was broken...
The kind of floating-point math that can be implemented in a digital computer necessarily uses an approximation of the real numbers and operations on them. (The standard version runs to over fifty pages of documentation and has a committee to deal with its errata and further refinement.)
This approximation is a mixture of approximations of different kinds, each of which can either be ignored or carefully accounted for due to its specific manner of deviation from exactitude. It also involves a number of explicit exceptional cases at both the hardware and software levels that most people walk right past while pretending not to notice.
If you need infinite precision (using the number π, for example, instead of one of its many shorter stand-ins), you should write or use a symbolic math program instead.
But if you're okay with the idea that sometimes floating-point math is fuzzy in value and logic and errors can accumulate quickly, and you can write your requirements and tests to allow for that, then your code can frequently get by with what's in your FPU.
Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.
The code prints the binary representation of floats in 3 separated groups
SIGN EXPONENT FRACTION
and after that it prints a sum, that, when summed with enough precision, it will show the value that really exists in hardware.
So when you write float x = 999..., the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number.
In reality, this sum is only an approximation. For the number 999,999,999 the compiler will insert in bit representation of the float the number 1,000,000,000
After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.
#include <stdio.h>
#include <limits.h>
void
xx(float *x)
{
unsigned char i = sizeof(*x)*CHAR_BIT-1;
do {
switch (i) {
case 31:
printf("sign:");
break;
case 30:
printf("exponent:");
break;
case 23:
printf("fraction:");
break;
}
char b=(*(unsigned long long*)x&((unsigned long long)1<<i))!=0;
printf("%d ", b);
} while (i--);
printf("\n");
}
void
yy(float a)
{
int sign=!(*(unsigned long long*)&a&((unsigned long long)1<<31));
int fraction = ((1<<23)-1)&(*(int*)&a);
int exponent = (255&((*(int*)&a)>>23))-127;
printf(sign?"positive" " ( 1+":"negative" " ( 1+");
unsigned int i = 1<<22;
unsigned int j = 1;
do {
char b=(fraction&i)!=0;
b&&(printf("1/(%d) %c", 1<<j, (fraction&(i-1))?'+':')' ), 0);
} while (j++, i>>=1);
printf("*2^%d", exponent);
printf("\n");
}
void
main()
{
float x=-3.14;
float y=999999999;
printf("%lu\n", sizeof(x));
xx(&x);
xx(&y);
yy(x);
yy(y);
}
Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.
-- .../terra1/stub
# qemacs f.c
-- .../terra1/stub
# gcc f.c
-- .../terra1/stub
# ./a.out
sign:1 exponent:1 0 0 0 0 0 0 fraction:0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign:0 exponent:1 0 0 1 1 1 0 fraction:0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
# bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872
That's it. The value of 999999999 is in fact
999999999.999999446351872
You can also check with bc that -3.14 is also perturbed. Do not forget to set a scale factor in bc.
The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.
Since Python 3.5 you can use math.isclose() function for testing approximate equality:
>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False
The trap with floating point numbers is that they look like decimal but they work in binary.
The only prime factor of 2 is 2, while 10 has prime factors of 2 and 5. The result of this is that every number that can be written exactly as a binary fraction can also be written exactly as a decimal fraction but only a subset of numbers that can be written as decimal fractions can be written as binary fractions.
A floating point number is essentially a binary fraction with a limited number of significant digits. If you go past those significant digits then the results will be rounded.
When you type a literal in your code or call the function to parse a floating point number to a string, it expects a decimal number and it stores a binary approximation of that decimal number in the variable.
When you print a floating point number or call the function to convert one to a string it prints a decimal approximation of the floating point number. It is possible to convert a binary number to decimal exactly, but no language I'm aware of does that by default when converting to a string*. Some languages use a fixed number of significant digits, others use the shortest string that will "round trip" back to the same floating point value.
* Python does convert exactly when converting a floating point number to a "decimal.Decimal". This is the easiest way I know of to obtain the exact decimal equivalent of a floating point number.
Floating point numbers are represented, at the hardware level, as fractions of binary numbers (base 2). For example, the decimal fraction:
0.125
has the value 1/10 + 2/100 + 5/1000 and, in the same way, the binary fraction:
0.001
has the value 0/2 + 0/4 + 1/8. These two fractions have the same value, the only difference is that the first is a decimal fraction, the second is a binary fraction.
Unfortunately, most decimal fractions cannot have exact representation in binary fractions. Therefore, in general, the floating point numbers you give are only approximated to binary fractions to be stored in the machine.
The problem is easier to approach in base 10. Take for example, the fraction 1/3. You can approximate it to a decimal fraction:
0.3
or better,
0.33
or better,
0.333
etc. No matter how many decimal places you write, the result is never exactly 1/3, but it is an estimate that always comes closer.
Likewise, no matter how many base 2 decimal places you use, the decimal value 0.1 cannot be represented exactly as a binary fraction. In base 2, 1/10 is the following periodic number:
0.0001100110011001100110011001100110011001100110011 ...
Stop at any finite amount of bits, and you'll get an approximation.
For Python, on a typical machine, 53 bits are used for the precision of a float, so the value stored when you enter the decimal 0.1 is the binary fraction.
0.00011001100110011001100110011001100110011001100110011010
which is close, but not exactly equal, to 1/10.
It's easy to forget that the stored value is an approximation of the original decimal fraction, due to the way floats are displayed in the interpreter. Python only displays a decimal approximation of the value stored in binary. If Python were to output the true decimal value of the binary approximation stored for 0.1, it would output:
>>> 0.1
0.1000000000000000055511151231257827021181583404541015625
This is a lot more decimal places than most people would expect, so Python displays a rounded value to improve readability:
>>> 0.1
0.1
It is important to understand that in reality this is an illusion: the stored value is not exactly 1/10, it is simply on the display that the stored value is rounded. This becomes evident as soon as you perform arithmetic operations with these values:
>>> 0.1 + 0.2
0.30000000000000004
This behavior is inherent to the very nature of the machine's floating-point representation: it is not a bug in Python, nor is it a bug in your code. You can observe the same type of behavior in all other languages ​​that use hardware support for calculating floating point numbers (although some languages ​​do not make the difference visible by default, or not in all display modes).
Another surprise is inherent in this one. For example, if you try to round the value 2.675 to two decimal places, you will get
>>> round (2.675, 2)
2.67
The documentation for the round() primitive indicates that it rounds to the nearest value away from zero. Since the decimal fraction is exactly halfway between 2.67 and 2.68, you should expect to get (a binary approximation of) 2.68. This is not the case, however, because when the decimal fraction 2.675 is converted to a float, it is stored by an approximation whose exact value is :
2.67499999999999982236431605997495353221893310546875
Since the approximation is slightly closer to 2.67 than 2.68, the rounding is down.
If you are in a situation where rounding decimal numbers halfway down matters, you should use the decimal module. By the way, the decimal module also provides a convenient way to "see" the exact value stored for any float.
>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')
Another consequence of the fact that 0.1 is not exactly stored in 1/10 is that the sum of ten values ​​of 0.1 does not give 1.0 either:
>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999
The arithmetic of binary floating point numbers holds many such surprises. The problem with "0.1" is explained in detail below, in the section "Representation errors". See The Perils of Floating Point for a more complete list of such surprises.
It is true that there is no simple answer, however do not be overly suspicious of floating virtula numbers! Errors, in Python, in floating-point number operations are due to the underlying hardware, and on most machines are no more than 1 in 2 ** 53 per operation. This is more than necessary for most tasks, but you should keep in mind that these are not decimal operations, and every operation on floating point numbers may suffer from a new error.
Although pathological cases exist, for most common use cases you will get the expected result at the end by simply rounding up to the number of decimal places you want on the display. For fine control over how floats are displayed, see String Formatting Syntax for the formatting specifications of the str.format () method.
This part of the answer explains in detail the example of "0.1" and shows how you can perform an exact analysis of this type of case on your own. We assume that you are familiar with the binary representation of floating point numbers.The term Representation error means that most decimal fractions cannot be represented exactly in binary. This is the main reason why Python (or Perl, C, C ++, Java, Fortran, and many others) usually doesn't display the exact result in decimal:
>>> 0.1 + 0.2
0.30000000000000004
Why ? 1/10 and 2/10 are not representable exactly in binary fractions. However, all machines today (July 2010) follow the IEEE-754 standard for the arithmetic of floating point numbers. and most platforms use an "IEEE-754 double precision" to represent Python floats. Double precision IEEE-754 uses 53 bits of precision, so on reading the computer tries to convert 0.1 to the nearest fraction of the form J / 2 ** N with J an integer of exactly 53 bits. Rewrite :
1/10 ~ = J / (2 ** N)
in :
J ~ = 2 ** N / 10
remembering that J is exactly 53 bits (so> = 2 ** 52 but <2 ** 53), the best possible value for N is 56:
>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793
So 56 is the only possible value for N which leaves exactly 53 bits for J. The best possible value for J is therefore this quotient, rounded:
>>> q, r = divmod (2 ** 56, 10)
>>> r
6
Since the carry is greater than half of 10, the best approximation is obtained by rounding up:
>>> q + 1
7205759403792794
Therefore the best possible approximation for 1/10 in "IEEE-754 double precision" is this above 2 ** 56, that is:
7205759403792794/72057594037927936
Note that since the rounding was done upward, the result is actually slightly greater than 1/10; if we hadn't rounded up, the quotient would have been slightly less than 1/10. But in no case is it exactly 1/10!
So the computer never "sees" 1/10: what it sees is the exact fraction given above, the best approximation using the double precision floating point numbers from the "" IEEE-754 ":
>>>. 1 * 2 ** 56
7205759403792794.0
If we multiply this fraction by 10 ** 30, we can observe the values ​​of its 30 decimal places of strong weight.
>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L
meaning that the exact value stored in the computer is approximately equal to the decimal value 0.100000000000000005551115123125. In versions prior to Python 2.7 and Python 3.1, Python rounded these values ​​to 17 significant decimal places, displaying “0.10000000000000001”. In current versions of Python, the displayed value is the value whose fraction is as short as possible while giving exactly the same representation when converted back to binary, simply displaying “0.1”.
Another way to look at this: Used are 64 bits to represent numbers. As consequence there is no way more than 2**64 = 18,446,744,073,709,551,616 different numbers can be precisely represented.
However, Math says there are already infinitely many decimals between 0 and 1. IEE 754 defines an encoding to use these 64 bits efficiently for a much larger number space plus NaN and +/- Infinity, so there are gaps between accurately represented numbers filled with numbers only approximated.
Unfortunately 0.3 sits in a gap.
Imagine working in base ten with, say, 8 digits of accuracy. You check whether
1/3 + 2 / 3 == 1
and learn that this returns false. Why? Well, as real numbers we have
1/3 = 0.333.... and 2/3 = 0.666....
Truncating at eight decimal places, we get
0.33333333 + 0.66666666 = 0.99999999
which is, of course, different from 1.00000000 by exactly 0.00000001.
The situation for binary numbers with a fixed number of bits is exactly analogous. As real numbers, we have
1/10 = 0.0001100110011001100... (base 2)
and
1/5 = 0.0011001100110011001... (base 2)
If we truncated these to, say, seven bits, then we'd get
0.0001100 + 0.0011001 = 0.0100101
while on the other hand,
3/10 = 0.01001100110011... (base 2)
which, truncated to seven bits, is 0.0100110, and these differ by exactly 0.0000001.
The exact situation is slightly more subtle because these numbers are typically stored in scientific notation. So, for instance, instead of storing 1/10 as 0.0001100 we may store it as something like 1.10011 * 2^-4, depending on how many bits we've allocated for the exponent and the mantissa. This affects how many digits of precision you get for your calculations.
The upshot is that because of these rounding errors you essentially never want to use == on floating-point numbers. Instead, you can check if the absolute value of their difference is smaller than some fixed small number.
It's actually pretty simple. When you have a base 10 system (like ours), it can only express fractions that use a prime factor of the base. The prime factors of 10 are 2 and 5. So 1/2, 1/4, 1/5, 1/8, and 1/10 can all be expressed cleanly because the denominators all use prime factors of 10. In contrast, 1/3, 1/6, and 1/7 are all repeating decimals because their denominators use a prime factor of 3 or 7. In binary (or base 2), the only prime factor is 2. So you can only express fractions cleanly which only contain 2 as a prime factor. In binary, 1/2, 1/4, 1/8 would all be expressed cleanly as decimals. While, 1/5 or 1/10 would be repeating decimals. So 0.1 and 0.2 (1/10 and 1/5) while clean decimals in a base 10 system, are repeating decimals in the base 2 system the computer is operating in. When you do math on these repeating decimals, you end up with leftovers which carry over when you convert the computer's base 2 (binary) number into a more human readable base 10 number.
From https://0.30000000000000004.com/
Decimal numbers such as 0.1, 0.2, and 0.3 are not represented exactly in binary encoded floating point types. The sum of the approximations for 0.1 and 0.2 differs from the approximation used for 0.3, hence the falsehood of 0.1 + 0.2 == 0.3 as can be seen more clearly here:
#include <stdio.h>
int main() {
printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
printf("0.1 is %.23f\n", 0.1);
printf("0.2 is %.23f\n", 0.2);
printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
printf("0.3 is %.23f\n", 0.3);
printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
return 0;
}
Output:
0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17
For these computations to be evaluated more reliably, you would need to use a decimal-based representation for floating point values. The C Standard does not specify such types by default but as an extension described in a technical Report.
The _Decimal32, _Decimal64 and _Decimal128 types might be available on your system (for example, GCC supports them on selected targets, but Clang does not support them on OS X).
Since this thread branched off a bit into a general discussion over current floating point implementations I'd add that there are projects on fixing their issues.
Take a look at https://posithub.org/ for example, which showcases a number type called posit (and its predecessor unum) that promises to offer better accuracy with fewer bits. If my understanding is correct, it also fixes the kind of problems in the question. Quite interesting project, the person behind it is a mathematician it Dr. John Gustafson. The whole thing is open source, with many actual implementations in C/C++, Python, Julia and C# (https://hastlayer.com/arithmetics).
Normal arithmetic is base-10, so decimals represent tenths, hundredths, etc. When you try to represent a floating-point number in binary base-2 arithmetic, you are dealing with halves, fourths, eighths, etc.
In the hardware, floating points are stored as integer mantissas and exponents. Mantissa represents the significant digits. Exponent is like scientific notation but it uses a base of 2 instead of 10. For example 64.0 would be represented with a mantissa of 1 and exponent of 6. 0.125 would be represented with a mantissa of 1 and an exponent of -3.
Floating point decimals have to add up negative powers of 2
0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d
and so on.
It is common to use a error delta instead of using equality operators when dealing with floating point arithmetic. Instead of
if(a==b) ...
you would use
delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

Is Double data type trimmed in MySQL while checking uniqueness? [duplicate]

Consider the following code:
0.1 + 0.2 == 0.3 -> false
0.1 + 0.2 -> 0.30000000000000004
Why do these inaccuracies happen?
Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.
For 0.1 in the standard binary64 format, the representation can be written exactly as
0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.
In contrast, the rational number 0.1, which is 1/10, can be written exactly as
0.1 in decimal, or
0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.
The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.
A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.
Side Note: All positional (base-N) number systems share this problem with precision
Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...
You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.
Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)
Side Side Note: Working with Floats in Programming
In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.
You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:
Do not do if (x == y) { ... }
Instead do if (abs(x - y) < myToleranceValue) { ... }.
where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These can be used as tolerance values but their effectiveness depends on the magnitude (size) of the numbers you're working with, since calculations with large numbers may exceed the epsilon threshold.
A Hardware Designer's Perspective
I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.
1. Overview
From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.
2. Standards
Most processors follow the IEEE-754 standard but some use denormalized, or different standards
. For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.
In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.
3. Cause of Rounding Error in Division
The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).
3.1 Division Rounding Error: Approximation of Reciprocal
What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.
4. Rounding Errors in Other Operations: Truncation
Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.
5. Repeated Operations
Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.
Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.
6. Summary
In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.
It's broken in the exact same way the decimal (base-10) notation you learned in grade school and use every day is broken, just for base-2.
To understand, think about representing 1/3 as a decimal value. It's impossible to do exactly! The world will end before you finish writing the 3's after the decimal point, and so instead we write to some number of places and consider it sufficiently accurate.
In the same way, 1/10 (decimal 0.1) cannot be represented exactly in base 2 (binary) as a "decimal" value; a repeating pattern after the decimal point goes on forever. The value is not exact, and therefore you can't do exact math with it using normal floating point methods. Just like with base 10, there are other values that exhibit this problem as well.
Most answers here address this question in very dry, technical terms. I'd like to address this in terms that normal human beings can understand.
Imagine that you are trying to slice up pizzas. You have a robotic pizza cutter that can cut pizza slices exactly in half. It can halve a whole pizza, or it can halve an existing slice, but in any case, the halving is always exact.
That pizza cutter has very fine movements, and if you start with a whole pizza, then halve that, and continue halving the smallest slice each time, you can do the halving 53 times before the slice is too small for even its high-precision abilities. At that point, you can no longer halve that very thin slice, but must either include or exclude it as is.
Now, how would you piece all the slices in such a way that would add up to one-tenth (0.1) or one-fifth (0.2) of a pizza? Really think about it, and try working it out. You can even try to use a real pizza, if you have a mythical precision pizza cutter at hand. :-)
Most experienced programmers, of course, know the real answer, which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices, no matter how finely you slice them. You can do a pretty good approximation, and if you add up the approximation of 0.1 with the approximation of 0.2, you get a pretty good approximation of 0.3, but it's still just that, an approximation.
For double-precision numbers (which is the precision that allows you to halve your pizza 53 times), the numbers immediately less and greater than 0.1 are 0.09999999999999999167332731531132594682276248931884765625 and 0.1000000000000000055511151231257827021181583404541015625. The latter is quite a bit closer to 0.1 than the former, so a numeric parser will, given an input of 0.1, favour the latter.
(The difference between those two numbers is the "smallest slice" that we must decide to either include, which introduces an upward bias, or exclude, which introduces a downward bias. The technical term for that smallest slice is an ulp.)
In the case of 0.2, the numbers are all the same, just scaled up by a factor of 2. Again, we favour the value that's slightly higher than 0.2.
Notice that in both cases, the approximations for 0.1 and 0.2 have a slight upward bias. If we add enough of these biases in, they will push the number further and further away from what we want, and in fact, in the case of 0.1 + 0.2, the bias is high enough that the resulting number is no longer the closest number to 0.3.
In particular, 0.1 + 0.2 is really 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125, whereas the number closest to 0.3 is actually 0.299999999999999988897769753748434595763683319091796875.
P.S. Some programming languages also provide pizza cutters that can split slices into exact tenths. Although such pizza cutters are uncommon, if you do have access to one, you should use it when it's important to be able to get exactly one-tenth or one-fifth of a slice.
(Originally posted on Quora.)
Floating point rounding errors. 0.1 cannot be represented as accurately in base-2 as in base-10 due to the missing prime factor of 5. Just as 1/3 takes an infinite number of digits to represent in decimal, but is "0.1" in base-3, 0.1 takes an infinite number of digits in base-2 where it does not in base-10. And computers don't have an infinite amount of memory.
My answer is quite long, so I've split it into three sections. Since the question is about floating point mathematics, I've put the emphasis on what the machine actually does. I've also made it specific to double (64 bit) precision, but the argument applies equally to any floating point arithmetic.
Preamble
An IEEE 754 double-precision binary floating-point format (binary64) number represents a number of the form
value = (-1)^s * (1.m51m50...m2m1m0)2 * 2e-1023
in 64 bits:
The first bit is the sign bit: 1 if the number is negative, 0 otherwise1.
The next 11 bits are the exponent, which is offset by 1023. In other words, after reading the exponent bits from a double-precision number, 1023 must be subtracted to obtain the power of two.
The remaining 52 bits are the significand (or mantissa). In the mantissa, an 'implied' 1. is always2 omitted since the most significant bit of any binary value is 1.
1 - IEEE 754 allows for the concept of a signed zero - +0 and -0 are treated differently: 1 / (+0) is positive infinity; 1 / (-0) is negative infinity. For zero values, the mantissa and exponent bits are all zero. Note: zero values (+0 and -0) are explicitly not classed as denormal2.
2 - This is not the case for denormal numbers, which have an offset exponent of zero (and an implied 0.). The range of denormal double precision numbers is dmin ≤ |x| ≤ dmax, where dmin (the smallest representable nonzero number) is 2-1023 - 51 (≈ 4.94 * 10-324) and dmax (the largest denormal number, for which the mantissa consists entirely of 1s) is 2-1023 + 1 - 2-1023 - 51 (≈ 2.225 * 10-308).
Turning a double precision number to binary
Many online converters exist to convert a double precision floating point number to binary (e.g. at binaryconvert.com), but here is some sample C# code to obtain the IEEE 754 representation for a double precision number (I separate the three parts with colons (:):
public static string BinaryRepresentation(double value)
{
long valueInLongType = BitConverter.DoubleToInt64Bits(value);
string bits = Convert.ToString(valueInLongType, 2);
string leadingZeros = new string('0', 64 - bits.Length);
string binaryRepresentation = leadingZeros + bits;
string sign = binaryRepresentation[0].ToString();
string exponent = binaryRepresentation.Substring(1, 11);
string mantissa = binaryRepresentation.Substring(12);
return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}
Getting to the point: the original question
(Skip to the bottom for the TL;DR version)
Cato Johnston (the question asker) asked why 0.1 + 0.2 != 0.3.
Written in binary (with colons separating the three parts), the IEEE 754 representations of the values are:
0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010
Note that the mantissa is composed of recurring digits of 0011. This is key to why there is any error to the calculations - 0.1, 0.2 and 0.3 cannot be represented in binary precisely in a finite number of binary bits any more than 1/9, 1/3 or 1/7 can be represented precisely in decimal digits.
Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places (much like 10-3 * 1.23 == 10-5 * 123). This then enables us to represent the binary representation as the exact value that it represents in the form a * 2p. where 'a' is an integer.
Converting the exponents to decimal, removing the offset, and re-adding the implied 1 (in square brackets), 0.1 and 0.2 are:
0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125
To add two numbers, the exponent needs to be the same, i.e.:
0.1 => 2^-3 * 0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 * 1.1001100110011001100110011001100110011001100110011010
sum = 2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125
sum = 2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875
Since the sum is not of the form 2n * 1.{bbb} we increase the exponent by one and shift the decimal (binary) point to get:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1)
= 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875
There are now 53 bits in the mantissa (the 53rd is in square brackets in the line above). The default rounding mode for IEEE 754 is 'Round to Nearest' - i.e. if a number x falls between two values a and b, the value where the least significant bit is zero is chosen.
a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
= 2^-2 * 1.0011001100110011001100110011001100110011001100110011
x = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1)
b = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
= 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125
Note that a and b differ only in the last bit; ...0011 + 1 = ...0100. In this case, the value with the least significant bit of zero is b, so the sum is:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
= 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125
whereas the binary representation of 0.3 is:
0.3 => 2^-2 * 1.0011001100110011001100110011001100110011001100110011
= 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
which only differs from the binary representation of the sum of 0.1 and 0.2 by 2-54.
The binary representation of 0.1 and 0.2 are the most accurate representations of the numbers allowable by IEEE 754. The addition of these representation, due to the default rounding mode, results in a value which differs only in the least-significant-bit.
TL;DR
Writing 0.1 + 0.2 in a IEEE 754 binary representation (with colons separating the three parts) and comparing it to 0.3, this is (I've put the distinct bits in square brackets):
0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3 => 0:01111111101:0011001100110011001100110011001100110011001100110[011]
Converted back to decimal, these values are:
0.1 + 0.2 => 0.300000000000000044408920985006...
0.3 => 0.299999999999999988897769753748...
The difference is exactly 2-54, which is ~5.5511151231258 × 10-17 - insignificant (for many applications) when compared to the original values.
Comparing the last few bits of a floating point number is inherently dangerous, as anyone who reads the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (which covers all the major parts of this answer) will know.
Most calculators use additional guard digits to get around this problem, which is how 0.1 + 0.2 would give 0.3: the final few bits are rounded.
In addition to the other correct answers, you may want to consider scaling your values to avoid problems with floating-point arithmetic.
For example:
var result = 1.0 + 2.0; // result === 3.0 returns true
... instead of:
var result = 0.1 + 0.2; // result === 0.3 returns false
The expression 0.1 + 0.2 === 0.3 returns false in JavaScript, but fortunately integer arithmetic in floating-point is exact, so decimal representation errors can be avoided by scaling.
As a practical example, to avoid floating-point problems where accuracy is paramount, it is recommended1 to handle money as an integer representing the number of cents: 2550 cents instead of 25.50 dollars.
1 Douglas Crockford: JavaScript: The Good Parts: Appendix A - Awful Parts (page 105).
Floating point numbers stored in the computer consist of two parts, an integer and an exponent that the base is taken to and multiplied by the integer part.
If the computer were working in base 10, 0.1 would be 1 x 10⁻¹, 0.2 would be 2 x 10⁻¹, and 0.3 would be 3 x 10⁻¹. Integer math is easy and exact, so adding 0.1 + 0.2 will obviously result in 0.3.
Computers don't usually work in base 10, they work in base 2. You can still get exact results for some values, for example 0.5 is 1 x 2⁻¹ and 0.25 is 1 x 2⁻², and adding them results in 3 x 2⁻², or 0.75. Exactly.
The problem comes with numbers that can be represented exactly in base 10, but not in base 2. Those numbers need to be rounded to their closest equivalent. Assuming the very common IEEE 64-bit floating point format, the closest number to 0.1 is 3602879701896397 x 2⁻⁵⁵, and the closest number to 0.2 is 7205759403792794 x 2⁻⁵⁵; adding them together results in 10808639105689191 x 2⁻⁵⁵, or an exact decimal value of 0.3000000000000000444089209850062616169452667236328125. Floating point numbers are generally rounded for display.
In short it's because:
Floating point numbers cannot represent all decimals precisely in binary
So just like 10/3 which does not exist in base 10 precisely (it will be 3.33... recurring), in the same way 1/10 doesn't exist in binary.
So what? How to deal with it? Is there any workaround?
In order to offer The best solution I can say I discovered following method:
parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3
Let me explain why it's the best solution.
As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.
Imagine you are going to add up two float numbers like 0.2 and 0.7 here it is: 0.2 + 0.7 = 0.8999999999999999.
Your expected result was 0.9 it means you need a result with 1 digit precision in this case.
So you should have used (0.2 + 0.7).tofixed(1)
but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance
0.22 + 0.7 = 0.9199999999999999
In this example you need 2 digits precision so it should be toFixed(2), so what should be the paramter to fit every given float number?
You might say let it be 10 in every situation then:
(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"
Damn! What are you going to do with those unwanted zeros after 9?
It's the time to convert it to float to make it as you desire:
parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9
Now that you found the solution, it's better to offer it as a function like this:
function floatify(number){
return parseFloat((number).toFixed(10));
}
Let's try it yourself:
function floatify(number){
return parseFloat((number).toFixed(10));
}
function addUp(){
var number1 = +$("#number1").val();
var number2 = +$("#number2").val();
var unexpectedResult = number1 + number2;
var expectedResult = floatify(number1 + number2);
$("#unexpectedResult").text(unexpectedResult);
$("#expectedResult").text(expectedResult);
}
addUp();
input{
width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>
You can use it this way:
var x = 0.2 + 0.7;
floatify(x); => Result: 0.9
As W3SCHOOLS suggests there is another solution too, you can multiply and divide to solve the problem above:
var x = (0.2 * 10 + 0.1 * 10) / 10; // x will be 0.3
Keep in mind that (0.2 + 0.1) * 10 / 10 won't work at all although it seems the same!
I prefer the first solution since I can apply it as a function which converts the input float to accurate output float.
FYI, the same problem exists for multiplication, for instance 0.09 * 10 returns 0.8999999999999999. Apply the flotify function as a workaround: flotify(0.09 * 10) returns 0.9
Floating point rounding error. From What Every Computer Scientist Should Know About Floating-Point Arithmetic:
Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.
My workaround:
function add(a, b, precision) {
var x = Math.pow(10, precision || 2);
return (Math.round(a * x) + Math.round(b * x)) / x;
}
precision refers to the number of digits you want to preserve after the decimal point during addition.
No, not broken, but most decimal fractions must be approximated
Summary
Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.
Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base2. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.
We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.
How did this happen?
When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form
a / (2n x 5m)
In binary, we only get the 2n term, that is:
a / 2n
So in decimal, we can't represent 1/3. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base10 fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2n term.
In base10 we can't represent 1/3. But in binary, we can't do 1/10 or 1/3.
So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.
Dealing with it
Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.
Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.
The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.
I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)
Conclusion
If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.
A lot of good answers have been posted, but I'd like to append one more.
Not all numbers can be represented via floats/doubles
For example, the number "0.2" will be represented as "0.200000003" in single precision in IEEE754 float point standard.
Model for store real numbers under the hood represent float numbers as
Even though you can type 0.2 easily, FLT_RADIX and DBL_RADIX is 2; not 10 for a computer with FPU which uses "IEEE Standard for Binary Floating-Point Arithmetic (ISO/IEEE Std 754-1985)".
So it is a bit hard to represent such numbers exactly. Even if you specify this variable explicitly without any intermediate calculation.
Some statistics related to this famous double precision question.
When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values.
Here are some examples:
0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)
When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error.
Here are some examples:
0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)
*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).
Given that nobody has mentioned this...
Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:
Python's decimal module and Java's BigDecimal class, that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.
Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:
>>> 0.1 + 0.2 == 0.3
False
>>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
True
Python's decimal module is based on IEEE standard 854-1987.
Python's fractions module and Apache Common's BigFraction class. Both represent rational numbers as (numerator, denominator) pairs and they may give more accurate results than decimal floating point arithmetic.
Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.
Did you try the duct tape solution?
Try to determine when errors occur and fix them with short if statements, it's not pretty but for some problems it is the only solution and this is one of them.
if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
else { return n * 0.1 + 0.000000000000001 ;}
I had the same problem in a scientific simulation project in c#, and I can tell you that if you ignore the butterfly effect it's gonna turn to a big fat dragon and bite you in the a**
Those weird numbers appear because computers use binary(base 2) number system for calculation purposes, while we use decimal(base 10).
There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both. Result - A rounded up (but precise) number results.
Many of this question's numerous duplicates ask about the effects of floating point rounding on specific numbers. In practice, it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it. Some languages provide ways of doing that - such as converting a float or double to BigDecimal in Java.
Since this is a language-agnostic question, it needs language-agnostic tools, such as a Decimal to Floating-Point Converter.
Applying it to the numbers in the question, treated as doubles:
0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,
0.2 converts to 0.200000000000000011102230246251565404236316680908203125,
0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and
0.30000000000000004 converts to 0.3000000000000000444089209850062616169452667236328125.
Adding the first two numbers manually or in a decimal calculator such as Full Precision Calculator, shows the exact sum of the actual inputs is 0.3000000000000000166533453693773481063544750213623046875.
If it were rounded down to the equivalent of 0.3 the rounding error would be 0.0000000000000000277555756156289135105907917022705078125. Rounding up to the equivalent of 0.30000000000000004 also gives rounding error 0.0000000000000000277555756156289135105907917022705078125. The round-to-even tie breaker applies.
Returning to the floating point converter, the raw hexadecimal for 0.30000000000000004 is 3fd3333333333334, which ends in an even digit and therefore is the correct result.
Can I just add; people always assume this to be a computer problem, but if you count with your hands (base 10), you can't get (1/3+1/3=2/3)=true unless you have infinity to add 0.333... to 0.333... so just as with the (1/10+2/10)!==3/10 problem in base 2, you truncate it to 0.333 + 0.333 = 0.666 and probably round it to 0.667 which would be also be technically inaccurate.
Count in ternary, and thirds are not a problem though - maybe some race with 15 fingers on each hand would ask why your decimal math was broken...
The kind of floating-point math that can be implemented in a digital computer necessarily uses an approximation of the real numbers and operations on them. (The standard version runs to over fifty pages of documentation and has a committee to deal with its errata and further refinement.)
This approximation is a mixture of approximations of different kinds, each of which can either be ignored or carefully accounted for due to its specific manner of deviation from exactitude. It also involves a number of explicit exceptional cases at both the hardware and software levels that most people walk right past while pretending not to notice.
If you need infinite precision (using the number π, for example, instead of one of its many shorter stand-ins), you should write or use a symbolic math program instead.
But if you're okay with the idea that sometimes floating-point math is fuzzy in value and logic and errors can accumulate quickly, and you can write your requirements and tests to allow for that, then your code can frequently get by with what's in your FPU.
Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.
The code prints the binary representation of floats in 3 separated groups
SIGN EXPONENT FRACTION
and after that it prints a sum, that, when summed with enough precision, it will show the value that really exists in hardware.
So when you write float x = 999..., the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number.
In reality, this sum is only an approximation. For the number 999,999,999 the compiler will insert in bit representation of the float the number 1,000,000,000
After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.
#include <stdio.h>
#include <limits.h>
void
xx(float *x)
{
unsigned char i = sizeof(*x)*CHAR_BIT-1;
do {
switch (i) {
case 31:
printf("sign:");
break;
case 30:
printf("exponent:");
break;
case 23:
printf("fraction:");
break;
}
char b=(*(unsigned long long*)x&((unsigned long long)1<<i))!=0;
printf("%d ", b);
} while (i--);
printf("\n");
}
void
yy(float a)
{
int sign=!(*(unsigned long long*)&a&((unsigned long long)1<<31));
int fraction = ((1<<23)-1)&(*(int*)&a);
int exponent = (255&((*(int*)&a)>>23))-127;
printf(sign?"positive" " ( 1+":"negative" " ( 1+");
unsigned int i = 1<<22;
unsigned int j = 1;
do {
char b=(fraction&i)!=0;
b&&(printf("1/(%d) %c", 1<<j, (fraction&(i-1))?'+':')' ), 0);
} while (j++, i>>=1);
printf("*2^%d", exponent);
printf("\n");
}
void
main()
{
float x=-3.14;
float y=999999999;
printf("%lu\n", sizeof(x));
xx(&x);
xx(&y);
yy(x);
yy(y);
}
Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.
-- .../terra1/stub
# qemacs f.c
-- .../terra1/stub
# gcc f.c
-- .../terra1/stub
# ./a.out
sign:1 exponent:1 0 0 0 0 0 0 fraction:0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign:0 exponent:1 0 0 1 1 1 0 fraction:0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
# bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872
That's it. The value of 999999999 is in fact
999999999.999999446351872
You can also check with bc that -3.14 is also perturbed. Do not forget to set a scale factor in bc.
The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.
Since Python 3.5 you can use math.isclose() function for testing approximate equality:
>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False
The trap with floating point numbers is that they look like decimal but they work in binary.
The only prime factor of 2 is 2, while 10 has prime factors of 2 and 5. The result of this is that every number that can be written exactly as a binary fraction can also be written exactly as a decimal fraction but only a subset of numbers that can be written as decimal fractions can be written as binary fractions.
A floating point number is essentially a binary fraction with a limited number of significant digits. If you go past those significant digits then the results will be rounded.
When you type a literal in your code or call the function to parse a floating point number to a string, it expects a decimal number and it stores a binary approximation of that decimal number in the variable.
When you print a floating point number or call the function to convert one to a string it prints a decimal approximation of the floating point number. It is possible to convert a binary number to decimal exactly, but no language I'm aware of does that by default when converting to a string*. Some languages use a fixed number of significant digits, others use the shortest string that will "round trip" back to the same floating point value.
* Python does convert exactly when converting a floating point number to a "decimal.Decimal". This is the easiest way I know of to obtain the exact decimal equivalent of a floating point number.
Floating point numbers are represented, at the hardware level, as fractions of binary numbers (base 2). For example, the decimal fraction:
0.125
has the value 1/10 + 2/100 + 5/1000 and, in the same way, the binary fraction:
0.001
has the value 0/2 + 0/4 + 1/8. These two fractions have the same value, the only difference is that the first is a decimal fraction, the second is a binary fraction.
Unfortunately, most decimal fractions cannot have exact representation in binary fractions. Therefore, in general, the floating point numbers you give are only approximated to binary fractions to be stored in the machine.
The problem is easier to approach in base 10. Take for example, the fraction 1/3. You can approximate it to a decimal fraction:
0.3
or better,
0.33
or better,
0.333
etc. No matter how many decimal places you write, the result is never exactly 1/3, but it is an estimate that always comes closer.
Likewise, no matter how many base 2 decimal places you use, the decimal value 0.1 cannot be represented exactly as a binary fraction. In base 2, 1/10 is the following periodic number:
0.0001100110011001100110011001100110011001100110011 ...
Stop at any finite amount of bits, and you'll get an approximation.
For Python, on a typical machine, 53 bits are used for the precision of a float, so the value stored when you enter the decimal 0.1 is the binary fraction.
0.00011001100110011001100110011001100110011001100110011010
which is close, but not exactly equal, to 1/10.
It's easy to forget that the stored value is an approximation of the original decimal fraction, due to the way floats are displayed in the interpreter. Python only displays a decimal approximation of the value stored in binary. If Python were to output the true decimal value of the binary approximation stored for 0.1, it would output:
>>> 0.1
0.1000000000000000055511151231257827021181583404541015625
This is a lot more decimal places than most people would expect, so Python displays a rounded value to improve readability:
>>> 0.1
0.1
It is important to understand that in reality this is an illusion: the stored value is not exactly 1/10, it is simply on the display that the stored value is rounded. This becomes evident as soon as you perform arithmetic operations with these values:
>>> 0.1 + 0.2
0.30000000000000004
This behavior is inherent to the very nature of the machine's floating-point representation: it is not a bug in Python, nor is it a bug in your code. You can observe the same type of behavior in all other languages ​​that use hardware support for calculating floating point numbers (although some languages ​​do not make the difference visible by default, or not in all display modes).
Another surprise is inherent in this one. For example, if you try to round the value 2.675 to two decimal places, you will get
>>> round (2.675, 2)
2.67
The documentation for the round() primitive indicates that it rounds to the nearest value away from zero. Since the decimal fraction is exactly halfway between 2.67 and 2.68, you should expect to get (a binary approximation of) 2.68. This is not the case, however, because when the decimal fraction 2.675 is converted to a float, it is stored by an approximation whose exact value is :
2.67499999999999982236431605997495353221893310546875
Since the approximation is slightly closer to 2.67 than 2.68, the rounding is down.
If you are in a situation where rounding decimal numbers halfway down matters, you should use the decimal module. By the way, the decimal module also provides a convenient way to "see" the exact value stored for any float.
>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')
Another consequence of the fact that 0.1 is not exactly stored in 1/10 is that the sum of ten values ​​of 0.1 does not give 1.0 either:
>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999
The arithmetic of binary floating point numbers holds many such surprises. The problem with "0.1" is explained in detail below, in the section "Representation errors". See The Perils of Floating Point for a more complete list of such surprises.
It is true that there is no simple answer, however do not be overly suspicious of floating virtula numbers! Errors, in Python, in floating-point number operations are due to the underlying hardware, and on most machines are no more than 1 in 2 ** 53 per operation. This is more than necessary for most tasks, but you should keep in mind that these are not decimal operations, and every operation on floating point numbers may suffer from a new error.
Although pathological cases exist, for most common use cases you will get the expected result at the end by simply rounding up to the number of decimal places you want on the display. For fine control over how floats are displayed, see String Formatting Syntax for the formatting specifications of the str.format () method.
This part of the answer explains in detail the example of "0.1" and shows how you can perform an exact analysis of this type of case on your own. We assume that you are familiar with the binary representation of floating point numbers.The term Representation error means that most decimal fractions cannot be represented exactly in binary. This is the main reason why Python (or Perl, C, C ++, Java, Fortran, and many others) usually doesn't display the exact result in decimal:
>>> 0.1 + 0.2
0.30000000000000004
Why ? 1/10 and 2/10 are not representable exactly in binary fractions. However, all machines today (July 2010) follow the IEEE-754 standard for the arithmetic of floating point numbers. and most platforms use an "IEEE-754 double precision" to represent Python floats. Double precision IEEE-754 uses 53 bits of precision, so on reading the computer tries to convert 0.1 to the nearest fraction of the form J / 2 ** N with J an integer of exactly 53 bits. Rewrite :
1/10 ~ = J / (2 ** N)
in :
J ~ = 2 ** N / 10
remembering that J is exactly 53 bits (so> = 2 ** 52 but <2 ** 53), the best possible value for N is 56:
>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793
So 56 is the only possible value for N which leaves exactly 53 bits for J. The best possible value for J is therefore this quotient, rounded:
>>> q, r = divmod (2 ** 56, 10)
>>> r
6
Since the carry is greater than half of 10, the best approximation is obtained by rounding up:
>>> q + 1
7205759403792794
Therefore the best possible approximation for 1/10 in "IEEE-754 double precision" is this above 2 ** 56, that is:
7205759403792794/72057594037927936
Note that since the rounding was done upward, the result is actually slightly greater than 1/10; if we hadn't rounded up, the quotient would have been slightly less than 1/10. But in no case is it exactly 1/10!
So the computer never "sees" 1/10: what it sees is the exact fraction given above, the best approximation using the double precision floating point numbers from the "" IEEE-754 ":
>>>. 1 * 2 ** 56
7205759403792794.0
If we multiply this fraction by 10 ** 30, we can observe the values ​​of its 30 decimal places of strong weight.
>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L
meaning that the exact value stored in the computer is approximately equal to the decimal value 0.100000000000000005551115123125. In versions prior to Python 2.7 and Python 3.1, Python rounded these values ​​to 17 significant decimal places, displaying “0.10000000000000001”. In current versions of Python, the displayed value is the value whose fraction is as short as possible while giving exactly the same representation when converted back to binary, simply displaying “0.1”.
Another way to look at this: Used are 64 bits to represent numbers. As consequence there is no way more than 2**64 = 18,446,744,073,709,551,616 different numbers can be precisely represented.
However, Math says there are already infinitely many decimals between 0 and 1. IEE 754 defines an encoding to use these 64 bits efficiently for a much larger number space plus NaN and +/- Infinity, so there are gaps between accurately represented numbers filled with numbers only approximated.
Unfortunately 0.3 sits in a gap.
Imagine working in base ten with, say, 8 digits of accuracy. You check whether
1/3 + 2 / 3 == 1
and learn that this returns false. Why? Well, as real numbers we have
1/3 = 0.333.... and 2/3 = 0.666....
Truncating at eight decimal places, we get
0.33333333 + 0.66666666 = 0.99999999
which is, of course, different from 1.00000000 by exactly 0.00000001.
The situation for binary numbers with a fixed number of bits is exactly analogous. As real numbers, we have
1/10 = 0.0001100110011001100... (base 2)
and
1/5 = 0.0011001100110011001... (base 2)
If we truncated these to, say, seven bits, then we'd get
0.0001100 + 0.0011001 = 0.0100101
while on the other hand,
3/10 = 0.01001100110011... (base 2)
which, truncated to seven bits, is 0.0100110, and these differ by exactly 0.0000001.
The exact situation is slightly more subtle because these numbers are typically stored in scientific notation. So, for instance, instead of storing 1/10 as 0.0001100 we may store it as something like 1.10011 * 2^-4, depending on how many bits we've allocated for the exponent and the mantissa. This affects how many digits of precision you get for your calculations.
The upshot is that because of these rounding errors you essentially never want to use == on floating-point numbers. Instead, you can check if the absolute value of their difference is smaller than some fixed small number.
It's actually pretty simple. When you have a base 10 system (like ours), it can only express fractions that use a prime factor of the base. The prime factors of 10 are 2 and 5. So 1/2, 1/4, 1/5, 1/8, and 1/10 can all be expressed cleanly because the denominators all use prime factors of 10. In contrast, 1/3, 1/6, and 1/7 are all repeating decimals because their denominators use a prime factor of 3 or 7. In binary (or base 2), the only prime factor is 2. So you can only express fractions cleanly which only contain 2 as a prime factor. In binary, 1/2, 1/4, 1/8 would all be expressed cleanly as decimals. While, 1/5 or 1/10 would be repeating decimals. So 0.1 and 0.2 (1/10 and 1/5) while clean decimals in a base 10 system, are repeating decimals in the base 2 system the computer is operating in. When you do math on these repeating decimals, you end up with leftovers which carry over when you convert the computer's base 2 (binary) number into a more human readable base 10 number.
From https://0.30000000000000004.com/
Decimal numbers such as 0.1, 0.2, and 0.3 are not represented exactly in binary encoded floating point types. The sum of the approximations for 0.1 and 0.2 differs from the approximation used for 0.3, hence the falsehood of 0.1 + 0.2 == 0.3 as can be seen more clearly here:
#include <stdio.h>
int main() {
printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
printf("0.1 is %.23f\n", 0.1);
printf("0.2 is %.23f\n", 0.2);
printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
printf("0.3 is %.23f\n", 0.3);
printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
return 0;
}
Output:
0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17
For these computations to be evaluated more reliably, you would need to use a decimal-based representation for floating point values. The C Standard does not specify such types by default but as an extension described in a technical Report.
The _Decimal32, _Decimal64 and _Decimal128 types might be available on your system (for example, GCC supports them on selected targets, but Clang does not support them on OS X).
Since this thread branched off a bit into a general discussion over current floating point implementations I'd add that there are projects on fixing their issues.
Take a look at https://posithub.org/ for example, which showcases a number type called posit (and its predecessor unum) that promises to offer better accuracy with fewer bits. If my understanding is correct, it also fixes the kind of problems in the question. Quite interesting project, the person behind it is a mathematician it Dr. John Gustafson. The whole thing is open source, with many actual implementations in C/C++, Python, Julia and C# (https://hastlayer.com/arithmetics).
Normal arithmetic is base-10, so decimals represent tenths, hundredths, etc. When you try to represent a floating-point number in binary base-2 arithmetic, you are dealing with halves, fourths, eighths, etc.
In the hardware, floating points are stored as integer mantissas and exponents. Mantissa represents the significant digits. Exponent is like scientific notation but it uses a base of 2 instead of 10. For example 64.0 would be represented with a mantissa of 1 and exponent of 6. 0.125 would be represented with a mantissa of 1 and an exponent of -3.
Floating point decimals have to add up negative powers of 2
0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d
and so on.
It is common to use a error delta instead of using equality operators when dealing with floating point arithmetic. Instead of
if(a==b) ...
you would use
delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

"Frac" function bug in Delphi? [duplicate]

Consider the following code:
0.1 + 0.2 == 0.3 -> false
0.1 + 0.2 -> 0.30000000000000004
Why do these inaccuracies happen?
Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.
For 0.1 in the standard binary64 format, the representation can be written exactly as
0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.
In contrast, the rational number 0.1, which is 1/10, can be written exactly as
0.1 in decimal, or
0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.
The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.
A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.
Side Note: All positional (base-N) number systems share this problem with precision
Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...
You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.
Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)
Side Side Note: Working with Floats in Programming
In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.
You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:
Do not do if (x == y) { ... }
Instead do if (abs(x - y) < myToleranceValue) { ... }.
where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These can be used as tolerance values but their effectiveness depends on the magnitude (size) of the numbers you're working with, since calculations with large numbers may exceed the epsilon threshold.
A Hardware Designer's Perspective
I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.
1. Overview
From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.
2. Standards
Most processors follow the IEEE-754 standard but some use denormalized, or different standards
. For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.
In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.
3. Cause of Rounding Error in Division
The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).
3.1 Division Rounding Error: Approximation of Reciprocal
What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.
4. Rounding Errors in Other Operations: Truncation
Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.
5. Repeated Operations
Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.
Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.
6. Summary
In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.
It's broken in the exact same way the decimal (base-10) notation you learned in grade school and use every day is broken, just for base-2.
To understand, think about representing 1/3 as a decimal value. It's impossible to do exactly! The world will end before you finish writing the 3's after the decimal point, and so instead we write to some number of places and consider it sufficiently accurate.
In the same way, 1/10 (decimal 0.1) cannot be represented exactly in base 2 (binary) as a "decimal" value; a repeating pattern after the decimal point goes on forever. The value is not exact, and therefore you can't do exact math with it using normal floating point methods. Just like with base 10, there are other values that exhibit this problem as well.
Most answers here address this question in very dry, technical terms. I'd like to address this in terms that normal human beings can understand.
Imagine that you are trying to slice up pizzas. You have a robotic pizza cutter that can cut pizza slices exactly in half. It can halve a whole pizza, or it can halve an existing slice, but in any case, the halving is always exact.
That pizza cutter has very fine movements, and if you start with a whole pizza, then halve that, and continue halving the smallest slice each time, you can do the halving 53 times before the slice is too small for even its high-precision abilities. At that point, you can no longer halve that very thin slice, but must either include or exclude it as is.
Now, how would you piece all the slices in such a way that would add up to one-tenth (0.1) or one-fifth (0.2) of a pizza? Really think about it, and try working it out. You can even try to use a real pizza, if you have a mythical precision pizza cutter at hand. :-)
Most experienced programmers, of course, know the real answer, which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices, no matter how finely you slice them. You can do a pretty good approximation, and if you add up the approximation of 0.1 with the approximation of 0.2, you get a pretty good approximation of 0.3, but it's still just that, an approximation.
For double-precision numbers (which is the precision that allows you to halve your pizza 53 times), the numbers immediately less and greater than 0.1 are 0.09999999999999999167332731531132594682276248931884765625 and 0.1000000000000000055511151231257827021181583404541015625. The latter is quite a bit closer to 0.1 than the former, so a numeric parser will, given an input of 0.1, favour the latter.
(The difference between those two numbers is the "smallest slice" that we must decide to either include, which introduces an upward bias, or exclude, which introduces a downward bias. The technical term for that smallest slice is an ulp.)
In the case of 0.2, the numbers are all the same, just scaled up by a factor of 2. Again, we favour the value that's slightly higher than 0.2.
Notice that in both cases, the approximations for 0.1 and 0.2 have a slight upward bias. If we add enough of these biases in, they will push the number further and further away from what we want, and in fact, in the case of 0.1 + 0.2, the bias is high enough that the resulting number is no longer the closest number to 0.3.
In particular, 0.1 + 0.2 is really 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125, whereas the number closest to 0.3 is actually 0.299999999999999988897769753748434595763683319091796875.
P.S. Some programming languages also provide pizza cutters that can split slices into exact tenths. Although such pizza cutters are uncommon, if you do have access to one, you should use it when it's important to be able to get exactly one-tenth or one-fifth of a slice.
(Originally posted on Quora.)
Floating point rounding errors. 0.1 cannot be represented as accurately in base-2 as in base-10 due to the missing prime factor of 5. Just as 1/3 takes an infinite number of digits to represent in decimal, but is "0.1" in base-3, 0.1 takes an infinite number of digits in base-2 where it does not in base-10. And computers don't have an infinite amount of memory.
My answer is quite long, so I've split it into three sections. Since the question is about floating point mathematics, I've put the emphasis on what the machine actually does. I've also made it specific to double (64 bit) precision, but the argument applies equally to any floating point arithmetic.
Preamble
An IEEE 754 double-precision binary floating-point format (binary64) number represents a number of the form
value = (-1)^s * (1.m51m50...m2m1m0)2 * 2e-1023
in 64 bits:
The first bit is the sign bit: 1 if the number is negative, 0 otherwise1.
The next 11 bits are the exponent, which is offset by 1023. In other words, after reading the exponent bits from a double-precision number, 1023 must be subtracted to obtain the power of two.
The remaining 52 bits are the significand (or mantissa). In the mantissa, an 'implied' 1. is always2 omitted since the most significant bit of any binary value is 1.
1 - IEEE 754 allows for the concept of a signed zero - +0 and -0 are treated differently: 1 / (+0) is positive infinity; 1 / (-0) is negative infinity. For zero values, the mantissa and exponent bits are all zero. Note: zero values (+0 and -0) are explicitly not classed as denormal2.
2 - This is not the case for denormal numbers, which have an offset exponent of zero (and an implied 0.). The range of denormal double precision numbers is dmin ≤ |x| ≤ dmax, where dmin (the smallest representable nonzero number) is 2-1023 - 51 (≈ 4.94 * 10-324) and dmax (the largest denormal number, for which the mantissa consists entirely of 1s) is 2-1023 + 1 - 2-1023 - 51 (≈ 2.225 * 10-308).
Turning a double precision number to binary
Many online converters exist to convert a double precision floating point number to binary (e.g. at binaryconvert.com), but here is some sample C# code to obtain the IEEE 754 representation for a double precision number (I separate the three parts with colons (:):
public static string BinaryRepresentation(double value)
{
long valueInLongType = BitConverter.DoubleToInt64Bits(value);
string bits = Convert.ToString(valueInLongType, 2);
string leadingZeros = new string('0', 64 - bits.Length);
string binaryRepresentation = leadingZeros + bits;
string sign = binaryRepresentation[0].ToString();
string exponent = binaryRepresentation.Substring(1, 11);
string mantissa = binaryRepresentation.Substring(12);
return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}
Getting to the point: the original question
(Skip to the bottom for the TL;DR version)
Cato Johnston (the question asker) asked why 0.1 + 0.2 != 0.3.
Written in binary (with colons separating the three parts), the IEEE 754 representations of the values are:
0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010
Note that the mantissa is composed of recurring digits of 0011. This is key to why there is any error to the calculations - 0.1, 0.2 and 0.3 cannot be represented in binary precisely in a finite number of binary bits any more than 1/9, 1/3 or 1/7 can be represented precisely in decimal digits.
Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places (much like 10-3 * 1.23 == 10-5 * 123). This then enables us to represent the binary representation as the exact value that it represents in the form a * 2p. where 'a' is an integer.
Converting the exponents to decimal, removing the offset, and re-adding the implied 1 (in square brackets), 0.1 and 0.2 are:
0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125
To add two numbers, the exponent needs to be the same, i.e.:
0.1 => 2^-3 * 0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 * 1.1001100110011001100110011001100110011001100110011010
sum = 2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125
sum = 2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875
Since the sum is not of the form 2n * 1.{bbb} we increase the exponent by one and shift the decimal (binary) point to get:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1)
= 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875
There are now 53 bits in the mantissa (the 53rd is in square brackets in the line above). The default rounding mode for IEEE 754 is 'Round to Nearest' - i.e. if a number x falls between two values a and b, the value where the least significant bit is zero is chosen.
a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
= 2^-2 * 1.0011001100110011001100110011001100110011001100110011
x = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1)
b = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
= 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125
Note that a and b differ only in the last bit; ...0011 + 1 = ...0100. In this case, the value with the least significant bit of zero is b, so the sum is:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
= 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125
whereas the binary representation of 0.3 is:
0.3 => 2^-2 * 1.0011001100110011001100110011001100110011001100110011
= 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
which only differs from the binary representation of the sum of 0.1 and 0.2 by 2-54.
The binary representation of 0.1 and 0.2 are the most accurate representations of the numbers allowable by IEEE 754. The addition of these representation, due to the default rounding mode, results in a value which differs only in the least-significant-bit.
TL;DR
Writing 0.1 + 0.2 in a IEEE 754 binary representation (with colons separating the three parts) and comparing it to 0.3, this is (I've put the distinct bits in square brackets):
0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3 => 0:01111111101:0011001100110011001100110011001100110011001100110[011]
Converted back to decimal, these values are:
0.1 + 0.2 => 0.300000000000000044408920985006...
0.3 => 0.299999999999999988897769753748...
The difference is exactly 2-54, which is ~5.5511151231258 × 10-17 - insignificant (for many applications) when compared to the original values.
Comparing the last few bits of a floating point number is inherently dangerous, as anyone who reads the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (which covers all the major parts of this answer) will know.
Most calculators use additional guard digits to get around this problem, which is how 0.1 + 0.2 would give 0.3: the final few bits are rounded.
In addition to the other correct answers, you may want to consider scaling your values to avoid problems with floating-point arithmetic.
For example:
var result = 1.0 + 2.0; // result === 3.0 returns true
... instead of:
var result = 0.1 + 0.2; // result === 0.3 returns false
The expression 0.1 + 0.2 === 0.3 returns false in JavaScript, but fortunately integer arithmetic in floating-point is exact, so decimal representation errors can be avoided by scaling.
As a practical example, to avoid floating-point problems where accuracy is paramount, it is recommended1 to handle money as an integer representing the number of cents: 2550 cents instead of 25.50 dollars.
1 Douglas Crockford: JavaScript: The Good Parts: Appendix A - Awful Parts (page 105).
Floating point numbers stored in the computer consist of two parts, an integer and an exponent that the base is taken to and multiplied by the integer part.
If the computer were working in base 10, 0.1 would be 1 x 10⁻¹, 0.2 would be 2 x 10⁻¹, and 0.3 would be 3 x 10⁻¹. Integer math is easy and exact, so adding 0.1 + 0.2 will obviously result in 0.3.
Computers don't usually work in base 10, they work in base 2. You can still get exact results for some values, for example 0.5 is 1 x 2⁻¹ and 0.25 is 1 x 2⁻², and adding them results in 3 x 2⁻², or 0.75. Exactly.
The problem comes with numbers that can be represented exactly in base 10, but not in base 2. Those numbers need to be rounded to their closest equivalent. Assuming the very common IEEE 64-bit floating point format, the closest number to 0.1 is 3602879701896397 x 2⁻⁵⁵, and the closest number to 0.2 is 7205759403792794 x 2⁻⁵⁵; adding them together results in 10808639105689191 x 2⁻⁵⁵, or an exact decimal value of 0.3000000000000000444089209850062616169452667236328125. Floating point numbers are generally rounded for display.
In short it's because:
Floating point numbers cannot represent all decimals precisely in binary
So just like 10/3 which does not exist in base 10 precisely (it will be 3.33... recurring), in the same way 1/10 doesn't exist in binary.
So what? How to deal with it? Is there any workaround?
In order to offer The best solution I can say I discovered following method:
parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3
Let me explain why it's the best solution.
As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.
Imagine you are going to add up two float numbers like 0.2 and 0.7 here it is: 0.2 + 0.7 = 0.8999999999999999.
Your expected result was 0.9 it means you need a result with 1 digit precision in this case.
So you should have used (0.2 + 0.7).tofixed(1)
but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance
0.22 + 0.7 = 0.9199999999999999
In this example you need 2 digits precision so it should be toFixed(2), so what should be the paramter to fit every given float number?
You might say let it be 10 in every situation then:
(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"
Damn! What are you going to do with those unwanted zeros after 9?
It's the time to convert it to float to make it as you desire:
parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9
Now that you found the solution, it's better to offer it as a function like this:
function floatify(number){
return parseFloat((number).toFixed(10));
}
Let's try it yourself:
function floatify(number){
return parseFloat((number).toFixed(10));
}
function addUp(){
var number1 = +$("#number1").val();
var number2 = +$("#number2").val();
var unexpectedResult = number1 + number2;
var expectedResult = floatify(number1 + number2);
$("#unexpectedResult").text(unexpectedResult);
$("#expectedResult").text(expectedResult);
}
addUp();
input{
width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>
You can use it this way:
var x = 0.2 + 0.7;
floatify(x); => Result: 0.9
As W3SCHOOLS suggests there is another solution too, you can multiply and divide to solve the problem above:
var x = (0.2 * 10 + 0.1 * 10) / 10; // x will be 0.3
Keep in mind that (0.2 + 0.1) * 10 / 10 won't work at all although it seems the same!
I prefer the first solution since I can apply it as a function which converts the input float to accurate output float.
FYI, the same problem exists for multiplication, for instance 0.09 * 10 returns 0.8999999999999999. Apply the flotify function as a workaround: flotify(0.09 * 10) returns 0.9
Floating point rounding error. From What Every Computer Scientist Should Know About Floating-Point Arithmetic:
Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.
My workaround:
function add(a, b, precision) {
var x = Math.pow(10, precision || 2);
return (Math.round(a * x) + Math.round(b * x)) / x;
}
precision refers to the number of digits you want to preserve after the decimal point during addition.
No, not broken, but most decimal fractions must be approximated
Summary
Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.
Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base2. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.
We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.
How did this happen?
When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form
a / (2n x 5m)
In binary, we only get the 2n term, that is:
a / 2n
So in decimal, we can't represent 1/3. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base10 fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2n term.
In base10 we can't represent 1/3. But in binary, we can't do 1/10 or 1/3.
So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.
Dealing with it
Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.
Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.
The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.
I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)
Conclusion
If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.
A lot of good answers have been posted, but I'd like to append one more.
Not all numbers can be represented via floats/doubles
For example, the number "0.2" will be represented as "0.200000003" in single precision in IEEE754 float point standard.
Model for store real numbers under the hood represent float numbers as
Even though you can type 0.2 easily, FLT_RADIX and DBL_RADIX is 2; not 10 for a computer with FPU which uses "IEEE Standard for Binary Floating-Point Arithmetic (ISO/IEEE Std 754-1985)".
So it is a bit hard to represent such numbers exactly. Even if you specify this variable explicitly without any intermediate calculation.
Some statistics related to this famous double precision question.
When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values.
Here are some examples:
0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)
When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error.
Here are some examples:
0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)
*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).
Given that nobody has mentioned this...
Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:
Python's decimal module and Java's BigDecimal class, that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.
Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:
>>> 0.1 + 0.2 == 0.3
False
>>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
True
Python's decimal module is based on IEEE standard 854-1987.
Python's fractions module and Apache Common's BigFraction class. Both represent rational numbers as (numerator, denominator) pairs and they may give more accurate results than decimal floating point arithmetic.
Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.
Did you try the duct tape solution?
Try to determine when errors occur and fix them with short if statements, it's not pretty but for some problems it is the only solution and this is one of them.
if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
else { return n * 0.1 + 0.000000000000001 ;}
I had the same problem in a scientific simulation project in c#, and I can tell you that if you ignore the butterfly effect it's gonna turn to a big fat dragon and bite you in the a**
Those weird numbers appear because computers use binary(base 2) number system for calculation purposes, while we use decimal(base 10).
There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both. Result - A rounded up (but precise) number results.
Many of this question's numerous duplicates ask about the effects of floating point rounding on specific numbers. In practice, it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it. Some languages provide ways of doing that - such as converting a float or double to BigDecimal in Java.
Since this is a language-agnostic question, it needs language-agnostic tools, such as a Decimal to Floating-Point Converter.
Applying it to the numbers in the question, treated as doubles:
0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,
0.2 converts to 0.200000000000000011102230246251565404236316680908203125,
0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and
0.30000000000000004 converts to 0.3000000000000000444089209850062616169452667236328125.
Adding the first two numbers manually or in a decimal calculator such as Full Precision Calculator, shows the exact sum of the actual inputs is 0.3000000000000000166533453693773481063544750213623046875.
If it were rounded down to the equivalent of 0.3 the rounding error would be 0.0000000000000000277555756156289135105907917022705078125. Rounding up to the equivalent of 0.30000000000000004 also gives rounding error 0.0000000000000000277555756156289135105907917022705078125. The round-to-even tie breaker applies.
Returning to the floating point converter, the raw hexadecimal for 0.30000000000000004 is 3fd3333333333334, which ends in an even digit and therefore is the correct result.
Can I just add; people always assume this to be a computer problem, but if you count with your hands (base 10), you can't get (1/3+1/3=2/3)=true unless you have infinity to add 0.333... to 0.333... so just as with the (1/10+2/10)!==3/10 problem in base 2, you truncate it to 0.333 + 0.333 = 0.666 and probably round it to 0.667 which would be also be technically inaccurate.
Count in ternary, and thirds are not a problem though - maybe some race with 15 fingers on each hand would ask why your decimal math was broken...
The kind of floating-point math that can be implemented in a digital computer necessarily uses an approximation of the real numbers and operations on them. (The standard version runs to over fifty pages of documentation and has a committee to deal with its errata and further refinement.)
This approximation is a mixture of approximations of different kinds, each of which can either be ignored or carefully accounted for due to its specific manner of deviation from exactitude. It also involves a number of explicit exceptional cases at both the hardware and software levels that most people walk right past while pretending not to notice.
If you need infinite precision (using the number π, for example, instead of one of its many shorter stand-ins), you should write or use a symbolic math program instead.
But if you're okay with the idea that sometimes floating-point math is fuzzy in value and logic and errors can accumulate quickly, and you can write your requirements and tests to allow for that, then your code can frequently get by with what's in your FPU.
Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.
The code prints the binary representation of floats in 3 separated groups
SIGN EXPONENT FRACTION
and after that it prints a sum, that, when summed with enough precision, it will show the value that really exists in hardware.
So when you write float x = 999..., the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number.
In reality, this sum is only an approximation. For the number 999,999,999 the compiler will insert in bit representation of the float the number 1,000,000,000
After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.
#include <stdio.h>
#include <limits.h>
void
xx(float *x)
{
unsigned char i = sizeof(*x)*CHAR_BIT-1;
do {
switch (i) {
case 31:
printf("sign:");
break;
case 30:
printf("exponent:");
break;
case 23:
printf("fraction:");
break;
}
char b=(*(unsigned long long*)x&((unsigned long long)1<<i))!=0;
printf("%d ", b);
} while (i--);
printf("\n");
}
void
yy(float a)
{
int sign=!(*(unsigned long long*)&a&((unsigned long long)1<<31));
int fraction = ((1<<23)-1)&(*(int*)&a);
int exponent = (255&((*(int*)&a)>>23))-127;
printf(sign?"positive" " ( 1+":"negative" " ( 1+");
unsigned int i = 1<<22;
unsigned int j = 1;
do {
char b=(fraction&i)!=0;
b&&(printf("1/(%d) %c", 1<<j, (fraction&(i-1))?'+':')' ), 0);
} while (j++, i>>=1);
printf("*2^%d", exponent);
printf("\n");
}
void
main()
{
float x=-3.14;
float y=999999999;
printf("%lu\n", sizeof(x));
xx(&x);
xx(&y);
yy(x);
yy(y);
}
Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.
-- .../terra1/stub
# qemacs f.c
-- .../terra1/stub
# gcc f.c
-- .../terra1/stub
# ./a.out
sign:1 exponent:1 0 0 0 0 0 0 fraction:0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign:0 exponent:1 0 0 1 1 1 0 fraction:0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
# bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872
That's it. The value of 999999999 is in fact
999999999.999999446351872
You can also check with bc that -3.14 is also perturbed. Do not forget to set a scale factor in bc.
The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.
Since Python 3.5 you can use math.isclose() function for testing approximate equality:
>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False
The trap with floating point numbers is that they look like decimal but they work in binary.
The only prime factor of 2 is 2, while 10 has prime factors of 2 and 5. The result of this is that every number that can be written exactly as a binary fraction can also be written exactly as a decimal fraction but only a subset of numbers that can be written as decimal fractions can be written as binary fractions.
A floating point number is essentially a binary fraction with a limited number of significant digits. If you go past those significant digits then the results will be rounded.
When you type a literal in your code or call the function to parse a floating point number to a string, it expects a decimal number and it stores a binary approximation of that decimal number in the variable.
When you print a floating point number or call the function to convert one to a string it prints a decimal approximation of the floating point number. It is possible to convert a binary number to decimal exactly, but no language I'm aware of does that by default when converting to a string*. Some languages use a fixed number of significant digits, others use the shortest string that will "round trip" back to the same floating point value.
* Python does convert exactly when converting a floating point number to a "decimal.Decimal". This is the easiest way I know of to obtain the exact decimal equivalent of a floating point number.
Floating point numbers are represented, at the hardware level, as fractions of binary numbers (base 2). For example, the decimal fraction:
0.125
has the value 1/10 + 2/100 + 5/1000 and, in the same way, the binary fraction:
0.001
has the value 0/2 + 0/4 + 1/8. These two fractions have the same value, the only difference is that the first is a decimal fraction, the second is a binary fraction.
Unfortunately, most decimal fractions cannot have exact representation in binary fractions. Therefore, in general, the floating point numbers you give are only approximated to binary fractions to be stored in the machine.
The problem is easier to approach in base 10. Take for example, the fraction 1/3. You can approximate it to a decimal fraction:
0.3
or better,
0.33
or better,
0.333
etc. No matter how many decimal places you write, the result is never exactly 1/3, but it is an estimate that always comes closer.
Likewise, no matter how many base 2 decimal places you use, the decimal value 0.1 cannot be represented exactly as a binary fraction. In base 2, 1/10 is the following periodic number:
0.0001100110011001100110011001100110011001100110011 ...
Stop at any finite amount of bits, and you'll get an approximation.
For Python, on a typical machine, 53 bits are used for the precision of a float, so the value stored when you enter the decimal 0.1 is the binary fraction.
0.00011001100110011001100110011001100110011001100110011010
which is close, but not exactly equal, to 1/10.
It's easy to forget that the stored value is an approximation of the original decimal fraction, due to the way floats are displayed in the interpreter. Python only displays a decimal approximation of the value stored in binary. If Python were to output the true decimal value of the binary approximation stored for 0.1, it would output:
>>> 0.1
0.1000000000000000055511151231257827021181583404541015625
This is a lot more decimal places than most people would expect, so Python displays a rounded value to improve readability:
>>> 0.1
0.1
It is important to understand that in reality this is an illusion: the stored value is not exactly 1/10, it is simply on the display that the stored value is rounded. This becomes evident as soon as you perform arithmetic operations with these values:
>>> 0.1 + 0.2
0.30000000000000004
This behavior is inherent to the very nature of the machine's floating-point representation: it is not a bug in Python, nor is it a bug in your code. You can observe the same type of behavior in all other languages ​​that use hardware support for calculating floating point numbers (although some languages ​​do not make the difference visible by default, or not in all display modes).
Another surprise is inherent in this one. For example, if you try to round the value 2.675 to two decimal places, you will get
>>> round (2.675, 2)
2.67
The documentation for the round() primitive indicates that it rounds to the nearest value away from zero. Since the decimal fraction is exactly halfway between 2.67 and 2.68, you should expect to get (a binary approximation of) 2.68. This is not the case, however, because when the decimal fraction 2.675 is converted to a float, it is stored by an approximation whose exact value is :
2.67499999999999982236431605997495353221893310546875
Since the approximation is slightly closer to 2.67 than 2.68, the rounding is down.
If you are in a situation where rounding decimal numbers halfway down matters, you should use the decimal module. By the way, the decimal module also provides a convenient way to "see" the exact value stored for any float.
>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')
Another consequence of the fact that 0.1 is not exactly stored in 1/10 is that the sum of ten values ​​of 0.1 does not give 1.0 either:
>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999
The arithmetic of binary floating point numbers holds many such surprises. The problem with "0.1" is explained in detail below, in the section "Representation errors". See The Perils of Floating Point for a more complete list of such surprises.
It is true that there is no simple answer, however do not be overly suspicious of floating virtula numbers! Errors, in Python, in floating-point number operations are due to the underlying hardware, and on most machines are no more than 1 in 2 ** 53 per operation. This is more than necessary for most tasks, but you should keep in mind that these are not decimal operations, and every operation on floating point numbers may suffer from a new error.
Although pathological cases exist, for most common use cases you will get the expected result at the end by simply rounding up to the number of decimal places you want on the display. For fine control over how floats are displayed, see String Formatting Syntax for the formatting specifications of the str.format () method.
This part of the answer explains in detail the example of "0.1" and shows how you can perform an exact analysis of this type of case on your own. We assume that you are familiar with the binary representation of floating point numbers.The term Representation error means that most decimal fractions cannot be represented exactly in binary. This is the main reason why Python (or Perl, C, C ++, Java, Fortran, and many others) usually doesn't display the exact result in decimal:
>>> 0.1 + 0.2
0.30000000000000004
Why ? 1/10 and 2/10 are not representable exactly in binary fractions. However, all machines today (July 2010) follow the IEEE-754 standard for the arithmetic of floating point numbers. and most platforms use an "IEEE-754 double precision" to represent Python floats. Double precision IEEE-754 uses 53 bits of precision, so on reading the computer tries to convert 0.1 to the nearest fraction of the form J / 2 ** N with J an integer of exactly 53 bits. Rewrite :
1/10 ~ = J / (2 ** N)
in :
J ~ = 2 ** N / 10
remembering that J is exactly 53 bits (so> = 2 ** 52 but <2 ** 53), the best possible value for N is 56:
>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793
So 56 is the only possible value for N which leaves exactly 53 bits for J. The best possible value for J is therefore this quotient, rounded:
>>> q, r = divmod (2 ** 56, 10)
>>> r
6
Since the carry is greater than half of 10, the best approximation is obtained by rounding up:
>>> q + 1
7205759403792794
Therefore the best possible approximation for 1/10 in "IEEE-754 double precision" is this above 2 ** 56, that is:
7205759403792794/72057594037927936
Note that since the rounding was done upward, the result is actually slightly greater than 1/10; if we hadn't rounded up, the quotient would have been slightly less than 1/10. But in no case is it exactly 1/10!
So the computer never "sees" 1/10: what it sees is the exact fraction given above, the best approximation using the double precision floating point numbers from the "" IEEE-754 ":
>>>. 1 * 2 ** 56
7205759403792794.0
If we multiply this fraction by 10 ** 30, we can observe the values ​​of its 30 decimal places of strong weight.
>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L
meaning that the exact value stored in the computer is approximately equal to the decimal value 0.100000000000000005551115123125. In versions prior to Python 2.7 and Python 3.1, Python rounded these values ​​to 17 significant decimal places, displaying “0.10000000000000001”. In current versions of Python, the displayed value is the value whose fraction is as short as possible while giving exactly the same representation when converted back to binary, simply displaying “0.1”.
Another way to look at this: Used are 64 bits to represent numbers. As consequence there is no way more than 2**64 = 18,446,744,073,709,551,616 different numbers can be precisely represented.
However, Math says there are already infinitely many decimals between 0 and 1. IEE 754 defines an encoding to use these 64 bits efficiently for a much larger number space plus NaN and +/- Infinity, so there are gaps between accurately represented numbers filled with numbers only approximated.
Unfortunately 0.3 sits in a gap.
Imagine working in base ten with, say, 8 digits of accuracy. You check whether
1/3 + 2 / 3 == 1
and learn that this returns false. Why? Well, as real numbers we have
1/3 = 0.333.... and 2/3 = 0.666....
Truncating at eight decimal places, we get
0.33333333 + 0.66666666 = 0.99999999
which is, of course, different from 1.00000000 by exactly 0.00000001.
The situation for binary numbers with a fixed number of bits is exactly analogous. As real numbers, we have
1/10 = 0.0001100110011001100... (base 2)
and
1/5 = 0.0011001100110011001... (base 2)
If we truncated these to, say, seven bits, then we'd get
0.0001100 + 0.0011001 = 0.0100101
while on the other hand,
3/10 = 0.01001100110011... (base 2)
which, truncated to seven bits, is 0.0100110, and these differ by exactly 0.0000001.
The exact situation is slightly more subtle because these numbers are typically stored in scientific notation. So, for instance, instead of storing 1/10 as 0.0001100 we may store it as something like 1.10011 * 2^-4, depending on how many bits we've allocated for the exponent and the mantissa. This affects how many digits of precision you get for your calculations.
The upshot is that because of these rounding errors you essentially never want to use == on floating-point numbers. Instead, you can check if the absolute value of their difference is smaller than some fixed small number.
It's actually pretty simple. When you have a base 10 system (like ours), it can only express fractions that use a prime factor of the base. The prime factors of 10 are 2 and 5. So 1/2, 1/4, 1/5, 1/8, and 1/10 can all be expressed cleanly because the denominators all use prime factors of 10. In contrast, 1/3, 1/6, and 1/7 are all repeating decimals because their denominators use a prime factor of 3 or 7. In binary (or base 2), the only prime factor is 2. So you can only express fractions cleanly which only contain 2 as a prime factor. In binary, 1/2, 1/4, 1/8 would all be expressed cleanly as decimals. While, 1/5 or 1/10 would be repeating decimals. So 0.1 and 0.2 (1/10 and 1/5) while clean decimals in a base 10 system, are repeating decimals in the base 2 system the computer is operating in. When you do math on these repeating decimals, you end up with leftovers which carry over when you convert the computer's base 2 (binary) number into a more human readable base 10 number.
From https://0.30000000000000004.com/
Decimal numbers such as 0.1, 0.2, and 0.3 are not represented exactly in binary encoded floating point types. The sum of the approximations for 0.1 and 0.2 differs from the approximation used for 0.3, hence the falsehood of 0.1 + 0.2 == 0.3 as can be seen more clearly here:
#include <stdio.h>
int main() {
printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
printf("0.1 is %.23f\n", 0.1);
printf("0.2 is %.23f\n", 0.2);
printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
printf("0.3 is %.23f\n", 0.3);
printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
return 0;
}
Output:
0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17
For these computations to be evaluated more reliably, you would need to use a decimal-based representation for floating point values. The C Standard does not specify such types by default but as an extension described in a technical Report.
The _Decimal32, _Decimal64 and _Decimal128 types might be available on your system (for example, GCC supports them on selected targets, but Clang does not support them on OS X).
Since this thread branched off a bit into a general discussion over current floating point implementations I'd add that there are projects on fixing their issues.
Take a look at https://posithub.org/ for example, which showcases a number type called posit (and its predecessor unum) that promises to offer better accuracy with fewer bits. If my understanding is correct, it also fixes the kind of problems in the question. Quite interesting project, the person behind it is a mathematician it Dr. John Gustafson. The whole thing is open source, with many actual implementations in C/C++, Python, Julia and C# (https://hastlayer.com/arithmetics).
Normal arithmetic is base-10, so decimals represent tenths, hundredths, etc. When you try to represent a floating-point number in binary base-2 arithmetic, you are dealing with halves, fourths, eighths, etc.
In the hardware, floating points are stored as integer mantissas and exponents. Mantissa represents the significant digits. Exponent is like scientific notation but it uses a base of 2 instead of 10. For example 64.0 would be represented with a mantissa of 1 and exponent of 6. 0.125 would be represented with a mantissa of 1 and an exponent of -3.
Floating point decimals have to add up negative powers of 2
0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d
and so on.
It is common to use a error delta instead of using equality operators when dealing with floating point arithmetic. Instead of
if(a==b) ...
you would use
delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

largest integer that can be stored in a double such that all integers less than can be accurately stored as well

This is some more clarification to the question that was already answered some time ago here: biggest integer that can be stored in a double
The top answer mentions that "the largest integer such that it and all smaller integers can be stored in IEEE 64-bit doubles without losing precision. An IEEE 64-bit double has 52 bits of mantissa, so I think it's 2^53:
because:
253 + 1 cannot be stored, because the 1 at the start and the 1 at the end have too many zeros in between.
Anything less than 253 can be stored, with 52 bits explicitly stored in the mantissa, and then the exponent in effect giving you another one.
253 obviously can be stored, since it's a small power of 2.
Can someone clarify the first point? What does he mean by that? is he talking about for example if it were a 4 bit number 1000 + 0001, you can't store that in 4 bits? 253 is just the first bit 1 and the rest 0's right? how come you can't add a 1 to that without losing precision?
also, "The largest integer such that it and all smaller integers can be stored in IEEE". Is there some general rule such that if I wanted to find the largest n bit integer such that it and all smaller integers can be stored in IEEE, could I simply say that it is 2n? example if I were to find the largest 4 bit integers such that it and all integer below it can be represented, it would be 2^4?
is he talking about for example if it were a 4 bit number 1000 + 0001, you can't store that in 4 bits?
No, he is saying that you can't store that in 3 bits. Using the usual binary notation.
253 is just the first bit 1 and the rest 0's right?
Yes, and so are 1, 2, 4, …, 253, 254, 255, …, 2123, 2124, … and also 0.125.
This is floating-point we are talking about. 253 is just an implicit 1 with all explicit significand bits 0, yes, but it is not the only number with this property. The crucial property is that the ULP for representing 253 is 2. So 253 can be represented as all powers of two that are in range, and 253+1 cannot because the ULP is too large in that neighborhood.
also, "The largest integer such that it and all smaller integers can be stored in IEEE". Is there some general rule such that if I wanted to find the largest n bit integer such that it and all smaller integers can be stored in IEEE, could I simply say that it is 2n?
Yes, in binary IEEE 754 floating-point, all “largest integer such that it and all smaller integers can be stored” are powers of two, and specifically 2n where n is the significand's width (counting the implicit bit).