Convert from format with 5 exponent bits to format with 4 exponent bits - binary

Consider the following two 9-bit floating-point representations based on the IEEE floating-point
format.
Format A:
There is 1 sign bit.
There are k = 5 exponent bits. The exponent bias is 15.
There are n = 3 fraction bits.
Format B:
There is 1 sign bit.
There are k = 4 exponent bits. The exponent bias is 7.
There are n = 4 fraction bits.
In the following table, you are given some bit patterns in format A, and your task is to convert
them to the closest value in format B. In
addition, give the values of numbers given by the format A and format B bit patterns
I'm currently stuck on 3 cases:
Format A
Value
Format B
Value
1 00111 010
-5/1024
0 00000 111
7/131072
1 11100 000
-8192
I am able to convert to decimal value for all 3 cases, but I am struggling to convert format B.
The first case if I change to format B, the exponent with bias will be -8 + bias = -8 + 7 = -1, so is it correct if I make the exponent all 0 (denormalized value)? And how will be the frac part?
The second case I think it is right to make the exp all 0 (denormalized value), but what is the correct frac part?
The last case, the exponent overflows (since 13 + 7 = 20 which exceeds 4-bit), so what should it be?
I really need to understand how this works, not only the answer. Thank you for any help!

The exponent field encodes an exponent. The code 0 means subnormal. The code 1 is the minimum normal exponent. With a bias of 7, the code of 1 encodes an exponent of 1−7 = −6. Therefore, the minimum exponent is −6. To encode a subnormal number, you need to adjust its exponent to be −6.
The value −5/1024 equals −1012•2−10. Shifting to make its exponent −6 gives −1012•2−10 = −0.01012•2−6. So the leading bit of the significand is 0 (confirming it is subnormal), and the trailing bits are 0101.
For 7/131,072, shifting to make the exponent −6 gives 1112•2−13 = 0.00001112•2−6. The significand does not fit into five bits (one leading plus four trailing), so this number cannot be represented in the format. Rounding to the nearest representable value gives 0.00012•2−6.
For −8192, shifting to make the exponent the largest representable value, 7, gives −12•213 = −10000002•27. So this number cannot be represented in the format. Rounding is implemented by choosing the rounding direction as if the exponent range were unbounded. So this should be rounded upward in magnitude (downward when the sign is conisdered). Rounding upward in magnitude from the largest finite value produces infinity. So this number is rounded to −∞, which is represented with the sign bit set, all ones in the exponent field, and all zeros in the primary significand field.

Related

How do I conver to binary if I have fewer bits?

I'm given a 6 bit exponent range and a 4 bit mantissa range (MMMM|EEEEEE). I want to convert the number 171.125 to this form.
Converting both parts to binary I get 10101011.11111101 and normalising I have
1.0101011 11111101 x 2^7.
I found that the bias in this case is not 127 but it's given by a formula which gives the bias as 31. So for the 6-bit exponential part : exp-31=7=>exp=38=100110
However, my mantissa range is too small. What do I do with the fractional part?

Why offset is calculated as `2^{n-1} - 1` instead of `2^{n-1}` for floating point exponent representation

I'm trying to understand why offset K in binary offset notation is calculated as
2^{n-1}-1 instead of 2^{n-1} for floating point exponent representation. Here is my reasoning for 2^{n-1}.
Four bits can represent values in the range [-8;7], so 0000 represents -8. An offset from zero here is 8 and can be calculated as 2^{n-1}. Using this offset we can define representation of any number, for example, the number 3.
What number do we need to add to -8 to get 3? It's 11, so 3 in offset binary is represented as 1011. And the formula seems to be number to represent + offset.
However, the real formula is number to represent + offset - 1, and so the correct representation is 1010. Can someone please explain why we also subtract additional one?
I am posting this as an answer to better explain my thougths, but even though I'll quote the standard a few times, I haven't found an explicitly stated reason.
In the following, I'll refer to the IEEE 754 standard (and succesive revisions) for floating point representation, even if OP doesn't mention it (if I'm wrong, please, let me know).
The question is about the particular representation of the exponent in a floating point number.
In subclause 3.3 Sets of floating-point data is said (emphasis mine):
The set of finite floating-point numbers representable within a
particular format is determined by the following integer parameters:
― b = the radix, 2 or 10
― p = the number of digits in the significand (precision)
― emax = the maximum exponent e
― emin = the minimum exponent e
emin shall be 1 − emax for all formats.
Later it specifies:
The smallest positive normal floating-point number is bemin and
the largest is bemax×(b − b1 − p). The non-zero floating-point numbers for a format with magnitude less than bemin
are called subnormal because their magnitudes lie between
zero and the smallest normal magnitude.
In 3.4 Binary interchange format encodings:
Representations of floating-point data in the binary interchange formats are encoded in k bits in the following three fields (...):
a) 1-bit sign S
b) w-bit biased exponent E = e + bias
c) (t=p−1)-bit trailing significand field digit string T=d1 d2 ... dp − 1 ; the leading bit of the significand, d0, is implicitly encoded in the biased exponent E
(...)
The range of the encoding’s biased exponent E shall include:
― every integer between 1 and 2w − 2, inclusive, to encode normal numbers
― the reserved value 0 to encode ±0 and subnormal numbers
― the reserved value 2w − 1 to encode ± ∞ and NaNs.
For example a 32-bit floating point number has those parameters:
k, storage width in bits 32
p, precision in bits 24
emax, maximum exponent e 127
emin, minimum exponent e -126
bias, E − e 127
w, exponent field width in bits 8
t, trailing significand field width in bits 23
In this Q&A is pointed out that: "The purpose of the bias is so that the exponent is stored in unsigned form, making it easier to do comparisons."
Considering the above mentioned 32-bit floating point representation a normal (not subnormal) number has an encoded biased exponent E in the range between 1 and 254.
The reason behind the particular choice of the range -126, 127 for the exponent could be, in my opinion, to extend the range of representable numbers: very low numbers are represented by subnormals so a bigger (even if only by one) maximum exponent can take care of the big ones.

Understanding offset-k method of representing negative integers

I'm reading this article about offset-k method of representing negative integers. Can someone please explain the following passage using some examples:
One logical way to represent signed integers is to have enough range
in binary numbers so that the zero can be offset to the middle of the
range of positive binary numbers. Then the magnitude of a negative
binary number can be simply subtracted from that zero point.
I understand the mechanics, e.g. to represent number 4 in 11 bits, I'll do 4+1023=1027, but can't understand the logic behind it and why it works.
If we have numbers ranging from -8 to +8 we can remove the sign by adding 8 to all of or numbers. The numbers would then be 0 to +16. It is rather like shifting the scale (as in converting Celsius to Kelvin) to obtain only positive values.
This representation allows operations on the biased numbers to be the same as for unsigned integers, but actually represents both positive and negative values.
This method is called by several names - Excess-K, also called offset binary or biased representation, uses a fixed value K as a biasing value.
A value is represented by the unsigned number which is K greater than the intended value.
Biased representations are now primarily used for the exponent of floating-point numbers. The IEEE floating-point standard defines the exponent field of a single-precision (32-bit) number as an 8-bit excess-127 field.
To understand More clearly the two examples below:
Example:1
4-bit Pattern
0110 the digit/column value of the most significant bit is 8, so 4 bit patterns are referred to as an Excess (8) notation.
To convert this example find the sum value of the entire pattern as though a standard binary number:
=(0 x 8) + (1 x 4) + (1 x 2) + (0 x 1 )
=0 + 4 + 2 + 0
=6
Then subtract the excess value,8, from the sum, (6 - 8)
The result is a signed value, -2.
Example 2
5-bit pattern
11110, the digit/column value of the most significant bit is 16, so 5-bit patterns are referred to as an Excess (16) notation.
To convert this example find the sum value of the entire pattern as though a standard binary number:
(1x16) + (1x8) + (1x4) + (1x2) + (0x1)
= 16 + 8 + 4 + 2 + 0
= 30
Then subtract the current excess value, 16, from the sum, (30 - 16)
The result is a signed value, + 14.
Hope it will clear the logic behind and understand that sign bit of 0 represents the negative sign and 1 represents the non-negative sign to denote a signed value

Is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance 6 or 7.225?

I've come across two different precision formulas for floating-point numbers.
⌊(N-1) log10(2)⌋ = 6 decimal digits (Single-precision)
and
N log10(2) ≈ 7.225 decimal digits (Single-precision)
Where N = 24 Significant bits (Single-precision)
The first formula is found at the top of page 4 of "IEEE Standard 754 for Binary Floating-Point Arithmetic" written by, Professor W. Kahan.
The second formula is found on the Wikipedia article "Single-precision floating-point format" under section IEEE 754 single-precision binary floating-point format: binary32.
For the first formula, Professor W. Kahan says
If a decimal string with at most 6 sig. dec. is converted to Single and then converted back to the same number of sig. dec.,
then the final string should match the original.
For the second formula, Wikipedia says
...the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).
The results of both formulas (6 and 7.225 decimal digits) are different, and I expected them to be the same because I assumed they both were meant to represent the most significant decimal digits which can be converted to floating-point binary and then converted back to decimal with the same number of significant decimal digits that it started with.
Why do these two numbers differ, and what is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance?
These are talking about two slightly different things.
The 7.2251 digits is the precision with which a number can be stored internally. For one example, if you did a computation with a double precision number (so you were starting with something like 15 digits of precision), then rounded it to a single precision number, the precision you'd have left at that point would be approximately 7 digits.
The 6 digits is talking about the precision that can be maintained through a round-trip conversion from a string of decimal digits, into a floating point number, then back to another string of decimal digits.
So, let's assume I start with a number like 1.23456789 as a string, then convert that to a float32, then convert the result back to a string. When I've done this, I can expect 6 digits to match exactly. The seventh digit might be rounded though, so I can't necessarily expect it to match (though it probably will be +/- 1 of the original string.
For example, consider the following code:
#include <iostream>
#include <iomanip>
int main() {
double init = 987.23456789;
for (int i = 0; i < 100; i++) {
float f = init + i / 100.0;
std::cout << std::setprecision(10) << std::setw(20) << f;
}
}
This produces a table like the following:
987.2345581 987.2445679 987.2545776 987.2645874
987.2745972 987.2845459 987.2945557 987.3045654
987.3145752 987.324585 987.3345947 987.3445435
987.3545532 987.364563 987.3745728 987.3845825
987.3945923 987.404541 987.4145508 987.4245605
987.4345703 987.4445801 987.4545898 987.4645386
987.4745483 987.4845581 987.4945679 987.5045776
987.5145874 987.5245972 987.5345459 987.5445557
987.5545654 987.5645752 987.574585 987.5845947
987.5945435 987.6045532 987.614563 987.6245728
987.6345825 987.6445923 987.654541 987.6645508
987.6745605 987.6845703 987.6945801 987.7045898
987.7145386 987.7245483 987.7345581 987.7445679
987.7545776 987.7645874 987.7745972 987.7845459
987.7945557 987.8045654 987.8145752 987.824585
987.8345947 987.8445435 987.8545532 987.864563
987.8745728 987.8845825 987.8945923 987.904541
987.9145508 987.9245605 987.9345703 987.9445801
987.9545898 987.9645386 987.9745483 987.9845581
987.9945679 988.0045776 988.0145874 988.0245972
988.0345459 988.0445557 988.0545654 988.0645752
988.074585 988.0845947 988.0945435 988.1045532
988.114563 988.1245728 988.1345825 988.1445923
988.154541 988.1645508 988.1745605 988.1845703
988.1945801 988.2045898 988.2145386 988.2245483
If we look through this, we can see that the first six significant digits always follow the pattern precisely (i.e., each result is exactly 0.01 greater than its predecessor). As we can see in the original double, the value is actually 98x.xx456--but when we convert the single-precision float to decimal, we can see that the 7th digit frequently would not be read back in correctly--since the subsequent digit is greater than 5, it should round up to 98x.xx46, but some of the values won't (e.g,. the second to last item in the first column is 988.154541, which would be round down instead of up, so we'd end up with 98x.xx45 instead of 46. So, even though the value (as stored) is precise to 7 digits (plus a little), by the time we round-trip the value through a conversion to decimal and back, we can't depend on that seventh digit matching precisely any more (even though there's enough precision that it will a lot more often than not).
1. That basically means 7 digits, and the 8th digit will be a little more accurate than nothing, but not a whole lot--for example, if we were converting from a double of 1.2345678, the .225 digits of precision mean that the last digit would be with about +/- .775 of the what started out there (whereas without the .225 digits of precision, it would be basically +/- 1 of what started out there).
what is the most significant decimal digits precision that can be
converted to binary and back to decimal without loss of significance?
The most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance (for single-precision floating-point numbers or 24-bits) is 6 decimal digits.
Why do these two numbers differ...
The numbers 6 and 7.225 differ, because they define two different things. 6 is the most decimal digits that can be round-tripped. 7.225 is the approximate number of decimal digits precision for a 24-bit binary integer because a 24-bit binary integer can have 7 or 8 decimal digits depending on its specific value.
7.225 was found using the specific binary integer formula.
dspec = b·log10(2) (dspec
= specific decimal digits, b = bits)
However, what you normally need to know, are the minimum and maximum decimal digits for a b-bit integer. The following formulas are used to find the min and max decimal digits (7 and 8 respectively for 24-bits) of a specific binary integer.
dmin = ⌈(b-1)·log10(2)⌉ (dmin
= min decimal digits, b = bits, ⌈x⌉ = smallest integer ≥ x)
dmax = ⌈b·log10(2)⌉ (dmax
= max decimal digits, b = bits, ⌈x⌉ = smallest integer ≥ x)
To learn more about how these formulas are derived, read Number of Decimal Digits In a Binary Integer, written by Rick Regan.
This is all well and good, but you may ask, why is 6 the most decimal digits for a round-trip conversion if you say that the span of decimal digits for a 24-bit number is 7 to 8?
The answer is — because the above formulas only work for integers and not floating-point numbers!
Every decimal integer has an exact value in binary. However, the same cannot be said for every decimal floating-point number. Take .1 for example. .1 in binary is the number 0.000110011001100..., which is a repeating or recurring binary. This can produce rounding error.
Moreover, it takes one more bit to represent a decimal floating-point number than it does to represent a decimal integer of equal significance. This is because floating-point numbers are more precise the closer they are to 0, and less precise the further they are from 0. Because of this, many floating-point numbers near the minimum and maximum value ranges (emin = -126 and emax = +127 for single-precision) lose 1 bit of precision due to rounding error. To see this visually, look at What every computer programmer should know about floating point, part 1, written by Josh Haberman.
Furthermore, there are at least 784,757 positive seven-digit decimal numbers that cannot retain their original value after a round-trip conversion. An example of such a number that cannot survive the round-trip is 8.589973e9. This is the smallest positive number that does not retain its original value.
Here's the formula that you should be using for floating-point number precision that will give you 6 decimal digits for round-trip conversion.
dmax = ⌊(b-1)·log10(2)⌋ (dmax
= max decimal digits, b = bits, ⌊x⌋ = largest integer ≤ x)
To learn more about how this formula is derived, read Number of Digits Required For Round-Trip Conversions, also written by Rick Regan. Rick does an excellent job showing the formulas derivation with references to rigorous proofs.
As a result, you can utilize the above formulas in a constructive way; if you understand how they work, you can apply them to any programming language that uses floating-point data types. All you have to know is the number of significant bits that your floating-point data type has, and you can find their respective number of decimal digits that you can count on to have no loss of significance after a round-trip conversion.
June 18, 2017 Update: I want to include a link to Rick Regan's new article which goes into more detail and in my opinion better answers this question than any answer provided here. His article is "Decimal Precision of Binary Floating-Point Numbers" and can be found on his website www.exploringbinary.com.
Do keep in mind that they are the exact same formulas. Remember your high-school math book identity:
Log(x^y) == y * Log(x)
It helps to actually calculate the values for N = 24 with your calculator:
Kahan's: 23 * Log(2) = 6.924
Wikipedia's: Log(2^24) = 7.225
Kahan was forced to truncate 6.924 down to 6 digits because of floor(), bummer. The only actual difference is that Kahan used 1 less bit of precision.
Pretty hard to guess why, the professor might have relied on old notes. Written before IEEE-754 and not taking into account that the 24th bit of precision is for free. The format uses a trick, the most significant bit of a floating point value that isn't 0 is always 1. So it doesn't need to be stored. The processor adds it back before it performs a calculation. Turning 23 bits of stored precision into 24 of effective precision.
Or he took into account that the conversion from a decimal string to a binary floating point value itself generates an error. Many nice round decimal values, like 0.1, cannot be perfectly converted to binary. It has an endless number of digits, just like 1/3 in decimal. That however generates a result that is off by +/- 0.5 bits, achieved by simple rounding. So the result is accurate to 23.5 * Log(2) = 7.074 decimal digits. If he assumed that the conversion routine is clumsy and doesn't properly round then the result can be off by +/-1 bit and N-1 is appropriate. They are not clumsy.
Or he thought like a typical scientist or (heaven forbid) accountant and wants the result of a calculation converted back to decimal as well. Such as you'd get when you trivially look for a 7 digit decimal number whose conversion back-and-forth does not produce the same number. Yes, that adds another +/- 0.5 bit error, summing up to 1 bit error total.
But never, never make that mistake, you always have to include any errors you get from manipulating the number in a calculation. Some of them lose significant digits very quickly, subtraction in particular is very dangerous.

Decimal/Hexadecimal/Binary Conversion

Right now I'm preparing for my AP Computer Science exam, and I need some help understanding how to convert between decimal, hexadecimal, and binary values by hand. The book that I'm using (Barron's) includes an example but does not explain it very well.
What are the formulas that one should use for conversion between these number types?
Are you happy that you understand number bases? If not, then you will need to read up on this or you'll just be blindly following some rules.
Plenty of books would spend a whole chapter or more on this...
Binary is base 2, Decimal is base 10, Hexadecimal is base 16.
So Binary uses digits 0 and 1, Decimal uses 0-9, Hexadecimal uses 0-9 and then we run out so we use A-F as well.
So the position of a decimal digit indicates units, tens, hundreds, thousands... these are the "powers of 10"
The position of a binary digit indicates units, 2s, 4s, 8s, 16s, 32s...the powers of 2
The position of hex digits indicates units, 16s, 256s...the powers of 16
For binary to decimal, add up each 1 multiplied by its 'power', so working from right to left:
1001 binary = 1*1 + 0*2 + 0*4 + 1*8 = 9 decimal
For binary to hex, you can either work it out the total number in decimal and then convert to hex, or you can convert each 4-bit sequence into a single hex digit:
1101 binary = 13 decimal = D hex
1111 0001 binary = F1 hex
For hex to binary, reverse the previous example - it's not too bad to do in your head because you just need to work out which of 8,4,2,1 you need to add up to get the desired value.
For decimal to binary, it's more of a long division problem - find the biggest power of 2 smaller than your input, set the corresponding binary bit to 1, and subtract that power of 2 from the original decimal number. Repeat until you have zero left.
E.g. for 87:
the highest power of two there is 1,2,4,8,16,32,64!
64 is 2^6 so we set the relevant bit to 1 in our result: 1000000
87 - 64 = 23
the next highest power of 2 smaller than 23 is 16, so set the bit: 1010000
repeat for 4,2,1
final result 1010111 binary
i.e. 64+16+4+2+1 = 87 in decimal
For hex to decimal, it's like binary to decimal, only you multiply by 1,16,256... instead of 1,2,4,8...
For decimal to hex, it's like decimal to binary, only you are looking for powers of 16, not 2. This is the hardest one to do manually.
This is a very fundamental question, whose detailed answer, on an entry level could very well be a couple of pages. Try to google it :-)