Representing numbers with IEEE754 with Round to Nearest Even - binary

I'm currently learning about IEEE754 standard and rounding, and I have an exercise which is the following:
Add -325.875 to 0.546875 in IEEE754, but with 3 bits dedicated to the mantissa instead of 23.
I'm having a lot of trouble doing this, especially representing the intermediary values, and the guard/round/sticky bits. Can someone give me a step-by-step solution, to the problem?
My biggest problem is that obviously I can't represent 0.546875 as 0.100011 as that would have more precision than the system has. So how would that be represented?
Apologies if the wording is confusing.

Preliminaries
The preferred term for the fraction portion of a floating-point number is “significand,” not “mantissa.” “Mantissa” is an old word for the fraction portion of a logarithm. Mantissas are logarithmic; adding to the mantissa multiplies the number represented. Significands are linear; adding to the significand adds to the number represented (as scaled by the exponent).
When working with a significand, use its mathematical precision, not the number of bits in the storage format. The IEEE-754 binary32 format has 23 bits in its primary field for the encoding of a significand, but another bit is encoded via the exponent field. Mathematically, numbers in the binary32 format behave as if they have 24 bits in their significands.
So, the task is to work with numbers with four bits in their significands, not three.
Work
In binary, −325.875 is −101000101.1112•2. In scientific notation, that is −1.010001011112•28. Rounding it to four bits in the significand gives −1.0102•28.
In binary, 0.546875 is .1000112. In scientific notation, that is 1.000112•2−1. Rounding it to four bits in the significand gives 1.0012•2−1. Note that the first four bits are 1000, but they are immediately followed by 11, so we round up. 1.00011 is closer to 1.001 than it is to 1.000.
So, in a floating-point format with four-bit significands, we want to add −1.0102•28 and 1.0012•2−1. If we adjust the latter number to have the same exponent as the former, we have −1.0102•28 and 0.0000000010012•28. To add those, we note the signs are different, so we want to subtract the magnitudes. It may help to line up the digits as we were taught in elementary school:
1.010000000000
0.000000001001
——————————————
1.001111110111
Thus, the mathematical result would be −1.0011111101112•28. However, we need to round the significand to four bits. The first four bits are 1001, but they are followed by 11, so we round up, producing 1010. So the final result is −1.0102•28.
−1.0102•28 is −1.25•28 = −320.

Related

Negative fixed point number representation

I am writing a generic routine for converting fixed-point numbers between decimal and binary representations.
For positive numbers the processing is simple, however when things come to negative ones I found divergent sources. Someone says there is a single bit used to hold the sign while others say the whole number should be represented in a pseudo integer using 2's complement even it is negative.
Please anyone tell me which source is correct or is there a standard representation for signed fixed point numbers?
Additionally, if the 2's complement representation was correct then how to represent negative numbers with zero integer part. For example -0.125?
Fixed-point numbers are just binary values where the place values have been changed. Assigning place values to the bits is an arbitrary human activity, and we can do it in any way that makes sense. Normally we talk about binary integers so it is convenient to assign the place value 2^0 = 1 to the LSB, 2^1=2 to the bit to the left of the LSB, and so on. For an N bit integer the place value of the MSB becomes 2^(N-1). If we want a two's-complement representation, we change the place value of the MSB to -2^(N-1) and all of the other bit place values are unchanged.
For fixed-point values, if we want F bits to represent a fractional part of the number, then the place value of the LSB becomes 2^(0-F)
and the place value of the MSB becomes 2^(N-1-F) for unsigned numbers and -2^(N-1-F) for signed numbers.
So, how would we represent -0.125 in a two's-complement fixed-point value? That is equal to 0.875 - 1, so we can use a representation where the place value of the MSB is -1 and the value of all of the other bits adds up to 0.875. If you choose a
4-bit fixed-point number with 3 fraction bits you would say that
1111 binary equals -0.125 decimal. Adding up the place values of the bits we have (-1) + 0.5 + 0.25 + 0.125 = -0.125. My personal preference is to write the binary number as 1.111 to note which bits are fraction and which are integer.
The reason we use this approach is that the normal integer arithmetic operators still work.
It's easiest to think of fixed-point numbers as scaled integers — rather than shifted integers. For a given fixed-point type, there is a fixed scale which is a power of two (or ten). To convert from the real value to the integer representation, multiply by that scale. To convert back again, simply divide. Then the issue of how negative values are represented becomes a detail of the integer type with which you are representing your number.
Please anyone tell me which source is correct...
Both are problematic.
Your first source is incorrect. The given example is not...
the same as 2's complement numbers.
In two’s complement, the MSB's (most significant bit's) weight is negated but the other bits still contribute positive values. Thus a two’s complement number with all bits set to 1 does not produce the minimum value.
Your second source could be a little misleading where it says...
shifting the bit pattern of a number to the right by 1 bit always divide the number by 2.
This statement brushes over the matter of underflow that occurs when the LSB (least significant bit) is set to 1, and the resultant rounding. Right-shifting commonly results in rounding towards negative infinity while division results in rounding towards zero (truncation). Both produce the same behavior for positive numbers: 3/2 == 1 and 3>>1 == 1. For negative numbers, they are contrary: -3/2 == -1 but -3>>1 == -2.
...is there a standard representation for signed fixed point numbers?
I don't think so. There are language-specific standards, e.g. ISO/IEC TR 18037 (draft). But the convention of scaling integers to approximate real numbers of predetermined range and resolution is well established. How the underlying integers are represented is another matter.
Additionally, if the 2's complement representation was correct then how to represent negative numbers with zero integer part. For example -0.125?
That depends on the format of your integer and your choice of radix. Assuming a 16-bit two’s complement number representing binary fixed-point values, the scaling factor is 2^15 which is 32,768. Multiply the value to store as an integer: -0.125*32768. == -4096 and divide to retrieve it: -4096/32768. == -0.125.

How does exponent bias make comparison easier

I'm reading this article about exponent bias in floating point numbers and it says the following:
n IEEE 754 floating point numbers, the exponent is biased in the
engineering sense of the word – the value stored is offset from the
actual value by the exponent bias. Biasing is done because exponents
have to be signed values in order to be able to represent both tiny
and huge values, but two's complement, the usual representation for
signed values, would make comparison harder. To solve this problem the
exponent is biased before being stored, by adjusting its value to put
it within an unsigned range suitable for comparison. By arranging the
fields so that the sign bit is in the most significant bit position,
the biased exponent in the middle, then the mantissa in the least
significant bits, the resulting value will be ordered properly,
whether it's interpreted as a floating point or integer value. This
allows high speed comparisons of floating point numbers using fixed
point hardware.
I've also found this explanation from wikipedia's article about offset binary:
This has the consequence that the "zero" value is represented by a 1
in the most significant bit and zero in all other bits, and in general
the effect is conveniently the same as using two's complement except
that the most significant bit is inverted. It also has the consequence
that in a logical comparison operation, one gets the same result as
with a two's complement numerical comparison operation, whereas, in
two's complement notation a logical comparison will agree with two's
complement numerical comparison operation if and only if the numbers
being compared have the same sign. Otherwise the sense of the
comparison will be inverted, with all negative values being taken as
being larger than all positive values.
I don't really understand what kind of comparison they are talking about here. Can someone please explain using a simple example?
'Comparison' here refers to the usual comparison of numbers by size: 5 > 4, etc. Suppose floating-point numbers were stored with as
[sign bit] [unbiased exponent] [mantissa]
For example, if the exponent is a 2's complement 3-bit binary number and the mantissa is a 4-bit unsigned binary number, you'd have
1 010 1001 = 4.5
1 110 0111 = 0.21875
You can see that the first is bigger than the second, but to figure this out, the computer would have to calculate 1.001 x 2^2 and 0.111 x 2^(-2) and then compare the resulting floating-point numbers. This is already complex with floating-point hardware, and if there is no such hardware for this computer, then...
So the number is stored as
[sign bit] [biased exponent] [mantissa]
Using the same 3-bit binary number for the exponent (this time biased; see a related question) and unsigned 4-bit mantissa, we have
1 101 1001 = 4.5
1 001 0111 = 0.21875
But now comparison is very easy! You can treat the two numbers as integers 11011001 and 10010111 and see that the first is obviously bigger: obvious even to a computer, as integer comparisons are easy. This is why biased exponents are used.

Why does the binary fourteen bit floating point model used in textbooks use bias 16 where as the IEEE single precision uses bias 127?

In my computer architecture course we use a 14 bit binary model;(1 bit for sign,5 bits for exponent, and 8 bits for our mantissa). When inputting the Exponent my instructor has us add 16 to offset it.(bias 16) Why are we using 16 bias? Is it because 5 bits can only represent up to 31 numbers? If so please elaborate and compare to IEEE single precision that uses a 127 bias when using the exponent. Lastly if someone can give me a clear definition of bias used in this context and in binary I would greatly appreciate it. Please comment if anything I said was unclear.
The IEEE 754 binary float formats follow a simple pattern for the exponent bias. When the exponent has p bits the bias is . With this the exponent has an equal number of positive and negative exponents.
For single precision floats p is 8 and therefore the bias is 127. For your format p is 5 and the bias is 15. Maybe your instructor changed the bias to 16 because the format don't support denorm, infinity and NaN.
There are several ways of representing a range of numbers including both positive and negative. Adding a bias is particularly flexible. The range [-n, m) can be represented by adding n to each number, mapping it to the range [0, m+n).
That system is used for the exponent in all the floating point systems I have used. It simplifies some comparisons, because larger unsigned binary value of the non-sign bits represents larger absolute magnitude of the float, except for special values such as NaNs.
For float exponents, the bias is around half the exponent range, so that approximately half the values are on each side of zero. Exact balance is impossible because there are an even number of bit patterns, and one is used for zero.
As discussed in another answer, the IEEE 754 standard would use a bias of 15 for a 5 bit exponent.
There are several possible reasons for choosing 16:
There is some actual technical reason, such as the suggested one of not treating 31 as special.
Bias 16 makes the representation of 1.0 particularly simple, with a single non-zero bit.
Being subtly different from IEEE 754 helps convince students that floating point does not imply IEEE 754. There are other floating point formats.
Being subtly different from IEEE 754 may discourage use of existing tools to get the results for exercises without understanding how the representation works.
It is an arbitrary choice of one of the reasonable values for the exponent bias, without reference to IEEE 754.

Is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance 6 or 7.225?

I've come across two different precision formulas for floating-point numbers.
⌊(N-1) log10(2)⌋ = 6 decimal digits (Single-precision)
and
N log10(2) ≈ 7.225 decimal digits (Single-precision)
Where N = 24 Significant bits (Single-precision)
The first formula is found at the top of page 4 of "IEEE Standard 754 for Binary Floating-Point Arithmetic" written by, Professor W. Kahan.
The second formula is found on the Wikipedia article "Single-precision floating-point format" under section IEEE 754 single-precision binary floating-point format: binary32.
For the first formula, Professor W. Kahan says
If a decimal string with at most 6 sig. dec. is converted to Single and then converted back to the same number of sig. dec.,
then the final string should match the original.
For the second formula, Wikipedia says
...the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).
The results of both formulas (6 and 7.225 decimal digits) are different, and I expected them to be the same because I assumed they both were meant to represent the most significant decimal digits which can be converted to floating-point binary and then converted back to decimal with the same number of significant decimal digits that it started with.
Why do these two numbers differ, and what is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance?
These are talking about two slightly different things.
The 7.2251 digits is the precision with which a number can be stored internally. For one example, if you did a computation with a double precision number (so you were starting with something like 15 digits of precision), then rounded it to a single precision number, the precision you'd have left at that point would be approximately 7 digits.
The 6 digits is talking about the precision that can be maintained through a round-trip conversion from a string of decimal digits, into a floating point number, then back to another string of decimal digits.
So, let's assume I start with a number like 1.23456789 as a string, then convert that to a float32, then convert the result back to a string. When I've done this, I can expect 6 digits to match exactly. The seventh digit might be rounded though, so I can't necessarily expect it to match (though it probably will be +/- 1 of the original string.
For example, consider the following code:
#include <iostream>
#include <iomanip>
int main() {
double init = 987.23456789;
for (int i = 0; i < 100; i++) {
float f = init + i / 100.0;
std::cout << std::setprecision(10) << std::setw(20) << f;
}
}
This produces a table like the following:
987.2345581 987.2445679 987.2545776 987.2645874
987.2745972 987.2845459 987.2945557 987.3045654
987.3145752 987.324585 987.3345947 987.3445435
987.3545532 987.364563 987.3745728 987.3845825
987.3945923 987.404541 987.4145508 987.4245605
987.4345703 987.4445801 987.4545898 987.4645386
987.4745483 987.4845581 987.4945679 987.5045776
987.5145874 987.5245972 987.5345459 987.5445557
987.5545654 987.5645752 987.574585 987.5845947
987.5945435 987.6045532 987.614563 987.6245728
987.6345825 987.6445923 987.654541 987.6645508
987.6745605 987.6845703 987.6945801 987.7045898
987.7145386 987.7245483 987.7345581 987.7445679
987.7545776 987.7645874 987.7745972 987.7845459
987.7945557 987.8045654 987.8145752 987.824585
987.8345947 987.8445435 987.8545532 987.864563
987.8745728 987.8845825 987.8945923 987.904541
987.9145508 987.9245605 987.9345703 987.9445801
987.9545898 987.9645386 987.9745483 987.9845581
987.9945679 988.0045776 988.0145874 988.0245972
988.0345459 988.0445557 988.0545654 988.0645752
988.074585 988.0845947 988.0945435 988.1045532
988.114563 988.1245728 988.1345825 988.1445923
988.154541 988.1645508 988.1745605 988.1845703
988.1945801 988.2045898 988.2145386 988.2245483
If we look through this, we can see that the first six significant digits always follow the pattern precisely (i.e., each result is exactly 0.01 greater than its predecessor). As we can see in the original double, the value is actually 98x.xx456--but when we convert the single-precision float to decimal, we can see that the 7th digit frequently would not be read back in correctly--since the subsequent digit is greater than 5, it should round up to 98x.xx46, but some of the values won't (e.g,. the second to last item in the first column is 988.154541, which would be round down instead of up, so we'd end up with 98x.xx45 instead of 46. So, even though the value (as stored) is precise to 7 digits (plus a little), by the time we round-trip the value through a conversion to decimal and back, we can't depend on that seventh digit matching precisely any more (even though there's enough precision that it will a lot more often than not).
1. That basically means 7 digits, and the 8th digit will be a little more accurate than nothing, but not a whole lot--for example, if we were converting from a double of 1.2345678, the .225 digits of precision mean that the last digit would be with about +/- .775 of the what started out there (whereas without the .225 digits of precision, it would be basically +/- 1 of what started out there).
what is the most significant decimal digits precision that can be
converted to binary and back to decimal without loss of significance?
The most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance (for single-precision floating-point numbers or 24-bits) is 6 decimal digits.
Why do these two numbers differ...
The numbers 6 and 7.225 differ, because they define two different things. 6 is the most decimal digits that can be round-tripped. 7.225 is the approximate number of decimal digits precision for a 24-bit binary integer because a 24-bit binary integer can have 7 or 8 decimal digits depending on its specific value.
7.225 was found using the specific binary integer formula.
dspec = b·log10(2) (dspec
= specific decimal digits, b = bits)
However, what you normally need to know, are the minimum and maximum decimal digits for a b-bit integer. The following formulas are used to find the min and max decimal digits (7 and 8 respectively for 24-bits) of a specific binary integer.
dmin = ⌈(b-1)·log10(2)⌉ (dmin
= min decimal digits, b = bits, ⌈x⌉ = smallest integer ≥ x)
dmax = ⌈b·log10(2)⌉ (dmax
= max decimal digits, b = bits, ⌈x⌉ = smallest integer ≥ x)
To learn more about how these formulas are derived, read Number of Decimal Digits In a Binary Integer, written by Rick Regan.
This is all well and good, but you may ask, why is 6 the most decimal digits for a round-trip conversion if you say that the span of decimal digits for a 24-bit number is 7 to 8?
The answer is — because the above formulas only work for integers and not floating-point numbers!
Every decimal integer has an exact value in binary. However, the same cannot be said for every decimal floating-point number. Take .1 for example. .1 in binary is the number 0.000110011001100..., which is a repeating or recurring binary. This can produce rounding error.
Moreover, it takes one more bit to represent a decimal floating-point number than it does to represent a decimal integer of equal significance. This is because floating-point numbers are more precise the closer they are to 0, and less precise the further they are from 0. Because of this, many floating-point numbers near the minimum and maximum value ranges (emin = -126 and emax = +127 for single-precision) lose 1 bit of precision due to rounding error. To see this visually, look at What every computer programmer should know about floating point, part 1, written by Josh Haberman.
Furthermore, there are at least 784,757 positive seven-digit decimal numbers that cannot retain their original value after a round-trip conversion. An example of such a number that cannot survive the round-trip is 8.589973e9. This is the smallest positive number that does not retain its original value.
Here's the formula that you should be using for floating-point number precision that will give you 6 decimal digits for round-trip conversion.
dmax = ⌊(b-1)·log10(2)⌋ (dmax
= max decimal digits, b = bits, ⌊x⌋ = largest integer ≤ x)
To learn more about how this formula is derived, read Number of Digits Required For Round-Trip Conversions, also written by Rick Regan. Rick does an excellent job showing the formulas derivation with references to rigorous proofs.
As a result, you can utilize the above formulas in a constructive way; if you understand how they work, you can apply them to any programming language that uses floating-point data types. All you have to know is the number of significant bits that your floating-point data type has, and you can find their respective number of decimal digits that you can count on to have no loss of significance after a round-trip conversion.
June 18, 2017 Update: I want to include a link to Rick Regan's new article which goes into more detail and in my opinion better answers this question than any answer provided here. His article is "Decimal Precision of Binary Floating-Point Numbers" and can be found on his website www.exploringbinary.com.
Do keep in mind that they are the exact same formulas. Remember your high-school math book identity:
Log(x^y) == y * Log(x)
It helps to actually calculate the values for N = 24 with your calculator:
Kahan's: 23 * Log(2) = 6.924
Wikipedia's: Log(2^24) = 7.225
Kahan was forced to truncate 6.924 down to 6 digits because of floor(), bummer. The only actual difference is that Kahan used 1 less bit of precision.
Pretty hard to guess why, the professor might have relied on old notes. Written before IEEE-754 and not taking into account that the 24th bit of precision is for free. The format uses a trick, the most significant bit of a floating point value that isn't 0 is always 1. So it doesn't need to be stored. The processor adds it back before it performs a calculation. Turning 23 bits of stored precision into 24 of effective precision.
Or he took into account that the conversion from a decimal string to a binary floating point value itself generates an error. Many nice round decimal values, like 0.1, cannot be perfectly converted to binary. It has an endless number of digits, just like 1/3 in decimal. That however generates a result that is off by +/- 0.5 bits, achieved by simple rounding. So the result is accurate to 23.5 * Log(2) = 7.074 decimal digits. If he assumed that the conversion routine is clumsy and doesn't properly round then the result can be off by +/-1 bit and N-1 is appropriate. They are not clumsy.
Or he thought like a typical scientist or (heaven forbid) accountant and wants the result of a calculation converted back to decimal as well. Such as you'd get when you trivially look for a 7 digit decimal number whose conversion back-and-forth does not produce the same number. Yes, that adds another +/- 0.5 bit error, summing up to 1 bit error total.
But never, never make that mistake, you always have to include any errors you get from manipulating the number in a calculation. Some of them lose significant digits very quickly, subtraction in particular is very dangerous.

32 bit mantissa representation of 2455.1152

If I wanted to represent -2455.1152 as 32 bit I know the first bit is 1 (negative sign) but I can get the 2455 to binary as 10010010111 but for the fractional part I'm not too sure. .1152 could have an infinite number of fractional parts. Would that mean that only up to 23 bits are used to represent the fractional part? So since 2445 uses 11 bits, bits 11 to 0 are for the fractional part?
for the binary representation I have 10010010111.00011101001. Exponent is 10. 10+127=137. 137 as binary is 10001001.
full representation would be:
1 10001001 1001001011100011101001
is that right?
It looks like you are trying to devise your own floating-point representation, but you used a fixed-point tag so I will explain how to convert your real number to a traditional fixed-point representation. First, you need to decide how many bits will be used to represent the fractional part of the number. Just for the sake of discussion let's say that 16 bits will be used for the fractional part, 15 bits for the integer part, and one bit reserved for the sign bit. Now, multiply the absolute value of the real number by 2^{16}: 2455.1152 * 65536 = 160898429.747. You can either round to the nearest integer or just truncate. Suppose we just truncate to 160898429. Converting this to hexadecimal we get 0x09971D7D. To make this negative, invert and add a 1 to the LSB, and the final result is 0xF668E283.
To convert back to a real number just reverse the process. Take the absolute value of the fixed-point representation and divide by 2^{16}. In this case we would find that the fixed-point representation is equal to the real number -2455.1151886 . The accuracy can be improved by rounding instead of truncating when converting from real to fixed-point, or by allowing more bits for the fractional part.