Minimum number of bits required for a range of numbers - binary

Im trying to work out an answer for a question about meassuring pressures.
The meassurments are supposed to be stored in binary floating point format and my task is to determine the minimum number of bits required to do so with some constraints;
Maximum pressure is 1e+07 Pa
Minimum pressure is 10 Pa
Accuracy of meassurments is 0.001 %
So if I understand it correctly, I could happen to measssure
1e+07 + 0.00001 * 1e+07 = 10000100 Pa
and would want to store it precisely. This would mean I would need 24 bits, since
2^23< 10000100 <2^24-1.
Is this including the 1 bit for a negative sign? Since we don't meassure negative pressures, would a more accurate answer be 23 bits?
And also for the minimum pressure, I could happen to meassure 9.9999 Pa and would want to store this correctly, so 4 decimals.
Since I could do the same type of calculation and end up with
2^13<9999<2^14-1
Is this already covered in the 23-24 bits I chose first?
I'm very new to this so any help or just clarification would be appreciated.

Unless you are asking this question for (i.) academic interest or (ii.) are really short on storage - neither of which I am going to address - I would strongly advocate that you don't worry about the numbers of bits and instead use a standard float (4 bytes) or double (8 bytes). Databases and software packages understand floats and doubles. If you try to craft your own floating point format using the minimum number of bits then you are adding a lot of work for yourself.
A float will hold 7 significant digits, whereas a double will hold 16. If you want to get technical a float (single precision) uses 8 bits for the exponent, 1 bit for the sign and the remaining 23 bits for the significand, whilst a double (double precision) uses 11 bits for the exponent, 1 bit for the sign and 52 bits for the significand.
So I suggest you focus instead on whether a float or a double best meets your needs.
I know this doesn't answer your question, but I hope it addresses what you need.

Related

32 bit mantissa representation of 2455.1152

If I wanted to represent -2455.1152 as 32 bit I know the first bit is 1 (negative sign) but I can get the 2455 to binary as 10010010111 but for the fractional part I'm not too sure. .1152 could have an infinite number of fractional parts. Would that mean that only up to 23 bits are used to represent the fractional part? So since 2445 uses 11 bits, bits 11 to 0 are for the fractional part?
for the binary representation I have 10010010111.00011101001. Exponent is 10. 10+127=137. 137 as binary is 10001001.
full representation would be:
1 10001001 1001001011100011101001
is that right?
It looks like you are trying to devise your own floating-point representation, but you used a fixed-point tag so I will explain how to convert your real number to a traditional fixed-point representation. First, you need to decide how many bits will be used to represent the fractional part of the number. Just for the sake of discussion let's say that 16 bits will be used for the fractional part, 15 bits for the integer part, and one bit reserved for the sign bit. Now, multiply the absolute value of the real number by 2^{16}: 2455.1152 * 65536 = 160898429.747. You can either round to the nearest integer or just truncate. Suppose we just truncate to 160898429. Converting this to hexadecimal we get 0x09971D7D. To make this negative, invert and add a 1 to the LSB, and the final result is 0xF668E283.
To convert back to a real number just reverse the process. Take the absolute value of the fixed-point representation and divide by 2^{16}. In this case we would find that the fixed-point representation is equal to the real number -2455.1151886 . The accuracy can be improved by rounding instead of truncating when converting from real to fixed-point, or by allowing more bits for the fractional part.

Calculate which numbers cause problems when converting decimal to floating point?

I know when converting some numbers from binary to floating point there can be problems.
For example: 1.4 results in 1.39999. 6.2 is 6.1999999, 6.6 is 6.599999, etc.
Is there a way to calculate which numbers will cause these problems? Like create some sort of spreadsheet or database with numbers from 1-50,000 which don't convert exactly?
The errors in floating point calculations are rooted in the way that floating point numbers are stored. Depending on which precision you are using (usually either single (float) or double-precision). Double-precision floats take more space, but are roughly twice as precise.
Floating point numbers are typically stored in an IEEE format and thus only the most significant bits are included.
In a string of bits stored in this way to represent a floating point number, there are different parts that comprise the number. First, there is a sign bit indicating whether the number is positive or negative. Next, the exponent portion of the number is stored (in single-precision floats, this takes 8 bits). After the exponent, all remaining bits represent the significant digits of the number. Thus, the bits 1 10000000 01000000000000000000000 represent -0.5 The sign bit (first one) signifies that the number is negative. The next eight bits are the exponent. Usually, the exponent is stored with a bias so that small numbers may be stored precisely in addition to large ones. To do this, the exponent used for calculations is -127 (if an 8 bit exponent) + the exponent bits interpreted as an integer (in our case 128). All remaining bits signify the significant digits of the number starting with the ones place and moving rightward from there we cut the value in half each time (1 is 1, but 01 is 1/2 and 001 is 1/4) in our case the total number is -0.5 (-1 for the sign bit * 2^1 for the exponent * 0.5 for the remaining bits)
For further example, here is a converter that uses checkboxes to indicate the bits. At worst, you could write a bot to test all of the combinations you care about.
In general, if it cannot be described with a fraction that is not directly made of combinations of powers of two, it will be rounded. (If you can't get it with 1/2 + 1/4 + 1/8 etc. it will probably be rounded)
Almost all numbers are troublesome. The minor exception are those, that when repeatedly multiplied by 2 get rid of the fractional part and end up being less than 2^24.
e.g.
0.125 -> 0.25 -> 0.5 -> 1.0 ! OK
6.4 --> 12.8 --> 25.6 -> 51.2 -> 102.4 : we got a cycle! (.4 .8 .6 .2 ... )
EDIT
Given the purpose/context of the issue, the inexactness seems to cause trouble when the floating point is rounded towards zero.
e.g.
10.2 == 10.199999809, when the next floating point would be
next == 10.200000763 <-- the difference to the wanted value is ~4 times higher
vs.
10.3 == 10.300000197, when the previous (rounded down fp would be)
prev == 10.299992370, <-- this would have been also ~4x further from away
Perhaps it's time to talk to the programmer of the CNC driver...

"Safe" Arithmetic of float numbers containing whole number on CUDA

I am currently making some research on using float numbers instead of integers on CUDA for arithmetic operations. The necessity arises since integer arithmetic is very slow compared to floating point arithmetic, and thus there is possible performance enhancement when using floats instead of integers.
I've made a small experiments and written a simple program that just loops and adds 1.0f to a variable..It came out that this works up to 16777216.0f ..adding further 1.0f to the number will leave the number unchanged... So I was wondering weather this number is the maximum number which as far as operators +,-,* involving solely whole numbers will lead to accurate whole number results, say with +/-0.0001 accuracy?
Regards
Daniel
Jonathan Dursi pointed to some important links to explain floating point in his comment.
If you look, you'll note that 16777216 is 2^24. Floating point (single precision) has 23 bits plus the implicit '1' (since values are normalised). With 24 bits you would be able to represent any integer from 1.0 * 2^0 to 1.11..11b * 2^23 (actually you get the negative numbers too since the sign bit is separate, and zero with special coding). You get the extra value (2^24) since that can be represented as 1.0 * 2^24.
As soon as you try to add 1 to 2^24 you will observe that you are subject to the rounding error described in the links Jonathan posted.
So for integers, you will need to restrict your range to [-2^24,2^24]. If you cannot do that, you either need some careful checking or else restrict yourself to integers!

how to represent floating points in binary?

I have been working on these three lab questions for about 5 hours. I'm stuck on the last question.
Consider the following floating point number representation which
stores a floating point number is 16 bits. You have a sign-bit, a six
bit (excess-32) exponent, and a nine bit mantissa.
Explain how the 9-bit mantissa might get you into trouble.
Here is the preceding question. Not sure if it will help in analysis.
What is the range of exponents it supports?
000000 to 111111 or 0 to 63 where exponent values less
than 32 are negative, and exponent values greater than 32 are
positive.
I have a pretty good foundation for floating points and converting between decimals and floating points. Any guidance would be greatly appreciated.
To me, the ratio mantissa to exponent is a bit off. Even if we assume there is a hidden bit, effectively making this into a 10 bit mantissa (with top bit always set), you can represent + or - 2^31, but in 2^31/2^10 = 2^21 steps (i.e. steps of 2097152).
I'd rather use 11 bits mantissa and 5 bit exponent, making this 2^15/2^11 = 2^4, i.e. steps of 16.
So for me the trouble would be that 9+1 bits is simply too imprecise, compared to the relatively large exponent.
My guess is that a nine-bit mantissa simply provides far too little precision, so that any operation apart from trivial ones, will make the calculation far too inexact to be useful.
I admit this answer is a little bit far-fetched, but apart from this, I can't see a problem with the representation.

When I'm multiplying a float using multu, should I ignore the result in the LO register?

In our project, we take two floats from the user, store them in integer registers, and treat them as a IEEE 754 single precision floats, manipulating the bits by masking. So after I multiply the 23 bits of fraction value, should I take into account the result placed in the LO register if I want to return a single precision float (32 bits) as the product?
First off, I hope you mean 24 bits of value, since you'll need to include the implicit mantissa bit in your multiplication.
Second, if you you want your multiplication to be correctly rounded, as in IEEE-754, you will (sometimes) need the low part of the multiply in order to deliver the correct rounded result.
On the other hand, if you don't need to implement correct rounding, and you left-shift your fraction bits before multiplication, you will be able to ignore the low word of the result.