how to represent floating points in binary? - binary

I have been working on these three lab questions for about 5 hours. I'm stuck on the last question.
Consider the following floating point number representation which
stores a floating point number is 16 bits. You have a sign-bit, a six
bit (excess-32) exponent, and a nine bit mantissa.
Explain how the 9-bit mantissa might get you into trouble.
Here is the preceding question. Not sure if it will help in analysis.
What is the range of exponents it supports?
000000 to 111111 or 0 to 63 where exponent values less
than 32 are negative, and exponent values greater than 32 are
positive.
I have a pretty good foundation for floating points and converting between decimals and floating points. Any guidance would be greatly appreciated.

To me, the ratio mantissa to exponent is a bit off. Even if we assume there is a hidden bit, effectively making this into a 10 bit mantissa (with top bit always set), you can represent + or - 2^31, but in 2^31/2^10 = 2^21 steps (i.e. steps of 2097152).
I'd rather use 11 bits mantissa and 5 bit exponent, making this 2^15/2^11 = 2^4, i.e. steps of 16.
So for me the trouble would be that 9+1 bits is simply too imprecise, compared to the relatively large exponent.

My guess is that a nine-bit mantissa simply provides far too little precision, so that any operation apart from trivial ones, will make the calculation far too inexact to be useful.
I admit this answer is a little bit far-fetched, but apart from this, I can't see a problem with the representation.

Related

Representing numbers with IEEE754 with Round to Nearest Even

I'm currently learning about IEEE754 standard and rounding, and I have an exercise which is the following:
Add -325.875 to 0.546875 in IEEE754, but with 3 bits dedicated to the mantissa instead of 23.
I'm having a lot of trouble doing this, especially representing the intermediary values, and the guard/round/sticky bits. Can someone give me a step-by-step solution, to the problem?
My biggest problem is that obviously I can't represent 0.546875 as 0.100011 as that would have more precision than the system has. So how would that be represented?
Apologies if the wording is confusing.
Preliminaries
The preferred term for the fraction portion of a floating-point number is “significand,” not “mantissa.” “Mantissa” is an old word for the fraction portion of a logarithm. Mantissas are logarithmic; adding to the mantissa multiplies the number represented. Significands are linear; adding to the significand adds to the number represented (as scaled by the exponent).
When working with a significand, use its mathematical precision, not the number of bits in the storage format. The IEEE-754 binary32 format has 23 bits in its primary field for the encoding of a significand, but another bit is encoded via the exponent field. Mathematically, numbers in the binary32 format behave as if they have 24 bits in their significands.
So, the task is to work with numbers with four bits in their significands, not three.
Work
In binary, −325.875 is −101000101.1112•2. In scientific notation, that is −1.010001011112•28. Rounding it to four bits in the significand gives −1.0102•28.
In binary, 0.546875 is .1000112. In scientific notation, that is 1.000112•2−1. Rounding it to four bits in the significand gives 1.0012•2−1. Note that the first four bits are 1000, but they are immediately followed by 11, so we round up. 1.00011 is closer to 1.001 than it is to 1.000.
So, in a floating-point format with four-bit significands, we want to add −1.0102•28 and 1.0012•2−1. If we adjust the latter number to have the same exponent as the former, we have −1.0102•28 and 0.0000000010012•28. To add those, we note the signs are different, so we want to subtract the magnitudes. It may help to line up the digits as we were taught in elementary school:
1.010000000000
0.000000001001
——————————————
1.001111110111
Thus, the mathematical result would be −1.0011111101112•28. However, we need to round the significand to four bits. The first four bits are 1001, but they are followed by 11, so we round up, producing 1010. So the final result is −1.0102•28.
−1.0102•28 is −1.25•28 = −320.

Minimum number of bits required for a range of numbers

Im trying to work out an answer for a question about meassuring pressures.
The meassurments are supposed to be stored in binary floating point format and my task is to determine the minimum number of bits required to do so with some constraints;
Maximum pressure is 1e+07 Pa
Minimum pressure is 10 Pa
Accuracy of meassurments is 0.001 %
So if I understand it correctly, I could happen to measssure
1e+07 + 0.00001 * 1e+07 = 10000100 Pa
and would want to store it precisely. This would mean I would need 24 bits, since
2^23< 10000100 <2^24-1.
Is this including the 1 bit for a negative sign? Since we don't meassure negative pressures, would a more accurate answer be 23 bits?
And also for the minimum pressure, I could happen to meassure 9.9999 Pa and would want to store this correctly, so 4 decimals.
Since I could do the same type of calculation and end up with
2^13<9999<2^14-1
Is this already covered in the 23-24 bits I chose first?
I'm very new to this so any help or just clarification would be appreciated.
Unless you are asking this question for (i.) academic interest or (ii.) are really short on storage - neither of which I am going to address - I would strongly advocate that you don't worry about the numbers of bits and instead use a standard float (4 bytes) or double (8 bytes). Databases and software packages understand floats and doubles. If you try to craft your own floating point format using the minimum number of bits then you are adding a lot of work for yourself.
A float will hold 7 significant digits, whereas a double will hold 16. If you want to get technical a float (single precision) uses 8 bits for the exponent, 1 bit for the sign and the remaining 23 bits for the significand, whilst a double (double precision) uses 11 bits for the exponent, 1 bit for the sign and 52 bits for the significand.
So I suggest you focus instead on whether a float or a double best meets your needs.
I know this doesn't answer your question, but I hope it addresses what you need.

How do I round this binary number to the nearest even

I have this binary representation of 0.1:
0.00011001100110011001100110011001100110011001100110011001100110
I need to round it to the nearest even to be able to store it in the double precision floating point. I can't seem to understand how to do that. Most tutorials talk about guard, round and sticky bits - where are they in this representation?
Also I've found the following explanation:
Let’s see what 0.1 looks like in double-precision. First, let’s write
it in binary, truncated to 57 significant bits:
0.000110011001100110011001100110011001100110011001100110011001…
Bits 54 and beyond total to greater than half the value of bit
position 53, so this rounds up to
0.0001100110011001100110011001100110011001100110011001101
This one doesn't talk about GRS bits, why? Aren't they always required?
The text you quote is from my article Why 0.1 Does Not Exist In Floating-Point . In that article I am showing how to do the conversion by hand, and the "GRS" bits are an IEEE implementation detail. Even if you are using a computer to do the conversion, you don't have to use IEEE arithmetic (and you shouldn't if you want to do it correctly ), so the GRS bits won't come into play there either. In any case, the GRS bits apply to calculations, not really to the conceptual idea of conversion.

Least Significant Bit if there is a binary point

If I have eight binary positions for instance:
7 6 5 4 3 2 1 0
its clear what the MSB and LSB is - the Most Significant Bit is the leftmost position (7) and Least Significant Bit the rightmost position (0)
But what if I have a binary.point (that is a fraction speaking decimal) i.e
7 6 5 4 3 2 1 0 -1 -2 -3
What could one here say is the most significant bit - is it the bit closest to the binary point or is it at position -3?
Its regarding floating point numbers
well, in any case it is the "last digit"
From a conceptual point of view, you can just think with decimal numbers :
Which digit is the least interesting, that is the least important if it changes, in :
123.45 $
clearly if i change the 5 i'm changing the "cents"
, which are the least important.
As in 12,345 $ , the least important is still the "5"
Though in 12,345.00 $ , the least important is the last 0.
You're talking about decimal points.
You mean fixed representation ? Or floating point number ?
If it is a fixed representation, then how do you know where is the decimal point ? Is it a convention somewhere ?
because both 123.45 and 12,345 could be, in decimal, represented by "12345". In the first case assuming that we have integers, in the second assuming that we have two-decimal-digits numbers.
If you talk about using floating points, then the notion of "least significant digit" would mean the least siginificant bit of the mantissa
but just talking about the "least significant bit" of a sequence of bit, whatever it means, is always the "last" one (or "first" one depending on conventions).
EDIT
For floating point numbers, then you have to remember that it's a complex thing.
I mean complex in the sense that you have several "numbers" packed into the representation in bits.
To take the "usual" 64 floating point (see http://en.wikipedia.org/wiki/Double-precision_floating-point_format)
you have the first bit for sign, the next 11 are an integer that is the "exponent", and the remaining one are the "mantissa" ("fractional part") (thought as 1.something_meant_by_your_mantissa_bits )
The last (64th) bit is the 52th (last) bit of the mantissa, thus the least significant bit of the corresponding number. But the 12th bit is also a last bit, the last-one of the 11-bit exponent.
I'll try to make again a example with decimal digits (it is a bit simpler and different from the IEEE "double precision")
Let's say we write 8 decimal digits numbers with 5-digits "mantissa" and 3-digits exponent.
in "1.2345 E 678" (meaning 1.2345 times ten to 678)
the "5" is the "least important digit".
If we "pack" the number like the IEEE double, "3digit-exponent then 5digit-mantissa" we would have
67812345
so the last digit of the big block is actually the least significant bit of the number.
But if we had another convention of packing, for example if the floating point number is supposed to be "5digit-mantissa then 3digit-exponent"
that is
12345678
and in our 8-digit the least significant one is the actually the 5th and not the last (8th) one.
then the least significant digit is not the last one.
It strongly depends on your convention of what your numbers (bits) mean.
Usually we talk about least significant bits in Bytes or Words that have meaning like memory address or just rough group of bits. Thus, you can think of them as integer numbers in anyway.
For the convenience, I guess the IEEE standard for floating point numbers put the mantissa at the end of the group so that the last bit, that is the least significant bit in the number, corresponds to the usual programming sense of "least significant bit of the group seen as a big pack of bits".

Help Understanding 8bit Floating Point Conversions with Decimals and Binary

I'm in a basic Engineering class and we're going through binary conversions. I can figure out the base 10 to binary or hex conversions really well, however the 8bit floating point conversions are kicking my ass and I can't find anything online that breaks it down in a n00b level and shows the steps? Wondering if any gurus have found anything online that would be helpful for this situation.
I have questions like 00101010(8bfp) = what number in base 10
Whenever I want to remember how floating point works, I refer back to the wikipedia page on 32 bit floats. I think it lays out the concepts pretty well.
http://en.wikipedia.org/wiki/Single_precision_floating-point_format
Note that wikipedia doesn't know what 8 bit floats are, I think your professor may have invented them ;)
Binary floating point formats are usually broken down into 3 fields: Sign bit, exponent and mantissa. The sign bit is simply set to 1 if the entire number should be negative, and 0 if the number is positive. The exponent is usually an unsigned int with an offset, where 2 to the 0'th power (1) is in the middle of the range. It's simpler in hardware and software to compare sizes this way. The mantissa works similarly to the mantissa in regular scientific notation, with the following caveat: The most significant bit is hidden. This is due to the requirement of normalizing scientific notation to have one significant digit above the decimal point. Remember when your math teacher in elementary school would whack your knuckles with a ruler for writing 35.648 x 10^6 or 0.35648 x 10^8 instead of the correct 3.5648 x 10^7? Since binary only has two states, this required digit above the decimal point is always one, and eliminating it allows another bit of accuracy at the low end of the mantissa.